tiktag
Rust library + CLI for text anonymization. Ships a built-in multilingual NER model (Xenova/distilbert-base-multilingual-cased-ner-hrl, quantized ONNX).
Install
Or build from source:
Quickstart
|
Library
use Path;
use Tiktag;
let mut tiktag = new?;
let out = tiktag.anonymize?;
println!;
Tiktag::newloads tokenizer + ONNX session once (~350 ms).Tiktag::anonymize(&mut self, text)reuses that state (ms-to-tens-of-ms). Wrap in a mutex to share across threads.- Both return
Result<_, TiktagError>. - Placeholder numbering is stable within a single call; no cross-document identity.
CLI
|
Flags: --stdin, --json, --debug-json, --show-tokens.
The CLI resolves models/profiles.toml by checking next to the binary first, then the current working directory — run from anywhere once the bundle is next to tiktag.
JSON
--json fields: schema_version, provenance, profile, anonymized_text, stats.
stats.timings varies by machine; pipelines that hash output should ignore it.
Additive field changes keep schema_version; breaking changes bump it.
Dev
Built-in profile
Fixed config path: models/profiles.toml. model_dir resolves relative to the config directory.
Required files under model_dir: tokenizer.json, config.json, onnx/model_quantized.onnx.
Caveat
Model-based anonymization can miss entities. Use tiktag as an assistive control, not your only compliance/safety gate.
Model attribution
- Model source:
Xenova/distilbert-base-multilingual-cased-ner-hrl - Model license/terms: see model card on Hugging Face.
See AGENTS.md for the authoritative contract and known footguns.