txtfp

High-performance text fingerprinting SDK for Rust with classical sketches (MinHash + LSH, SimHash, TLSH), Unicode-correct canonicalization, and semantic embeddings (ONNX local).

Overview

txtfp produces compact, deterministic, byte-stable hashes for text deduplication, near-duplicate detection, and semantic search:

Method	Use case	Output	Complexity
MinHash	Set-similarity dedup (Jaccard)	`[u64; H]`	O(n) sketch
LSH	Sub-linear near-duplicate lookup	bucketed index	O(1) avg query
SimHash	Bit-LSH near-dup (Hamming)	`u64`	O(n) sketch
TLSH	Byte-level locality-sensitive hash	hex string	O(n) sketch
Embedding	Semantic similarity (ANN)	`Vec<f32>`	model-dependent

It is the text counterpart to audiofp (audio) and imgfprint (image), and is consumed by the cross-modal ucfp integrator.

Perfect for:

LLM training-set deduplication
RAG retrieval ranking
Content moderation
Plagiarism detection
Email / document de-dup at scale

Features

Byte-stable hash layouts — MinHashSig<H> and SimHash64 are repr(C) bytemuck::Pod. Schema-versioned, semver-frozen, golden-byte enforced (18 fixtures).
Production canonicalization — NFKC + simple casefold + Bidi/format strip; defends against Trojan Source, ZWJ injection, NFC bombs.
no_std + alloc-clean default features — builds for wasm32-unknown-unknown out of the box.
Streaming + offline fingerprinters — every classical sketcher has both a Fingerprinter (whole-doc) and StreamingFingerprinter (chunk-fed) variant.
Local embeddings — LocalProvider (ONNX via ort + Hugging Face Hub) for semantic similarity. Cloud-hosted providers are out of scope; implement EmbeddingProvider against your HTTP client of choice.
Markup helpers — HTML → text, Markdown → text.
Unicode security — UTS #39 confusable skeleton behind the security feature.
CJK tokenizer — jieba-rs with OnceLock-lazy dictionary for Simplified Chinese.
Cross-SDK parity — EmbeddingProvider, Embedding, semantic_similarity, FORMAT_VERSION aligned with imgfprint / audiofp.

Installation

[dependencies]
txtfp = "0.2"

Upgrading from 0.1.x? v0.2.0 flipped the default hash family from MurmurHash3_x64_128 to Xxh3_64 for both MinHash and SimHash — signature bytes change. Pin to 0.1 or pass HashFamily::MurmurHash3_x64_128 explicitly for v0.1.x / Python datasketch byte parity. v0.2.1 is API- and bytes-compatible with v0.2.0 (patch release).

Feature flags

Feature	Default	Pulls
`std`	✅	libstd. Without it, `no_std + alloc`.
`minhash`	✅	MinHash sketcher.
`simhash`	✅	SimHash sketcher.
`lsh`	✅	Banded LSH index over MinHash signatures.
`markup`		`html_to_text`, `markdown_to_text`.
`cjk`		`CjkTokenizer` (jieba, Simplified Chinese).
`tlsh`		`TlshFingerprinter`.
`security`		UTS #39 confusable skeleton in the canonicalizer.
`serde`		`Serialize` / `Deserialize` on signatures (incl. const-generic MinHash).
`parallel`		Rayon-powered batch helpers.
`semantic`		`LocalProvider` via `ort` + Hugging Face Hub.

For Japanese / Korean tokenization or PDF text extraction, implement the Tokenizer / Canonicalizer upstream of this crate against your preferred dedicated library (lindera, vibrato, pdf-extract, poppler, …). Cloud-hosted embedding endpoints (OpenAI, Voyage, Cohere, …) are similarly out of scope; implement EmbeddingProvider against any HTTP client of choice — see USAGE.md for a worked example.

Minimal build (no_std + alloc, MinHash + SimHash only — drops LSH):

[dependencies]
txtfp = { version = "0.3", default-features = false, features = ["minhash", "simhash"] }

Without LSH (still on default std):

[dependencies]
txtfp = { version = "0.3", default-features = false, features = ["std", "minhash", "simhash"] }

With local ONNX embeddings:

[dependencies]
txtfp = { version = "0.3", features = ["semantic"] }

Quick Start

use txtfp::{
    Canonicalizer, Fingerprinter, MinHashFingerprinter,
    ShingleTokenizer, WordTokenizer, jaccard,
};

fn main() -> Result<(), txtfp::Error> {
    let canon = Canonicalizer::default();
    let tok = ShingleTokenizer { k: 5, inner: WordTokenizer };
    let fp = MinHashFingerprinter::<_, 128>::new(canon, tok);

    let a = fp.fingerprint("the quick brown fox jumps over the lazy dog at noon today")?;
    let b = fp.fingerprint("the quick brown fox jumps over the lazy dog at dusk today")?;

    let similarity = jaccard(&a, &b);
    println!("Jaccard estimate: {:.2}", similarity);

    if similarity > 0.6 {
        println!("near-duplicate");
    }
    Ok(())
}

LSH for sub-linear near-duplicate lookup

# #[cfg(feature = "lsh")]
# fn demo() -> Result<(), txtfp::Error> {
use txtfp::{
    Canonicalizer, Fingerprinter, LshIndex, LshIndexBuilder,
    MinHashFingerprinter, ShingleTokenizer, WordTokenizer,
};

let canon = Canonicalizer::default();
let tok = ShingleTokenizer { k: 5, inner: WordTokenizer };
let fp = MinHashFingerprinter::<_, 128>::new(canon, tok);

// Optimize bands/rows for a Jaccard threshold of 0.7.
let mut idx: LshIndex<128> = LshIndexBuilder::for_threshold(0.7, 128)?.build();

idx.insert(0, fp.fingerprint("the quick brown fox jumps over the lazy dog at noon today")?);
idx.insert(1, fp.fingerprint("astronomers detect cosmic background radiation")?);

let probe = fp.fingerprint("the quick brown fox jumps over the lazy dog at dusk today")?;
let neighbours = idx.query_with_threshold(&probe, 0.5);
println!("near-duplicates: {neighbours:?}");
# Ok(()) }

Documentation

For the complete API reference and worked examples, see USAGE.md.

Architecture

Pipeline

input bytes
    │
    ▼
canonicalize  (NFKC + casefold + Bidi/format strip)
    │
    ▼
tokenize      (Word | Grapheme | Shingle | CJK)
    │
    ▼
sketch        (MinHash | SimHash | TLSH | Embedding)
    │
    ▼
compare       (jaccard | hamming | cosine_estimate | semantic_similarity)

Every layer is independently swappable: pick a canonicalizer config, plug any Tokenizer, choose a HashFamily, and the same input always produces the same byte-stable signature.

Signature byte layouts (frozen for v0.1.x)

MinHashSig<H>                       SimHash64
├── schema: u16  (= 1)              └── 8 bytes (u64, little-endian)
├── _pad:   [u8; 6] (zero)
└── hashes: [u64; H], LE

Total size: 8 + 8*H bytes

These layouts are enforced by 18 byte-frozen golden-test fixtures (tests/data/golden/). Failing a golden test is a hard breakage that requires a major-version bump.

Algorithms

MinHash uses double-hashing (Indyk–Motwani 1998 + Kirsch–Mitzenmacher 2008): one xxh3_128 per shingle, then derive H slots as low + (i * high). v0.2.0+ default; pass HashFamily::MurmurHash3_x64_128 for datasketch byte parity.
SimHash is Charikar 2002: token-weighted bag, 64-lane signed accumulator, sign-extract.
LSH is banded: bands * rows == H. LshIndexBuilder::for_threshold numerically minimizes false-positive + false-negative integral over [0, threshold] and [threshold, 1] to pick the partition.
TLSH wraps tlsh2 128/1.
Local embeddings load HF Hub ONNX models, tokenize with tokenizers, run ort 2.0, and pool with Pooling::{Cls, Mean, MeanNoNorm, Max}. The pooling default is looked up per-model (BGE → Cls, E5 → Mean, etc.).

Performance

Single-thread throughput on a 2024-class x86_64 laptop, fat-LTO release with RUSTFLAGS="-C target-cpu=native" and mimalloc as the benches' global allocator, measured with cargo bench --features lsh over the 5 KB lorem_ipsum (ASCII) corpus:

v0.2.0+ baseline (HashFamily::Xxh3_64 default):

Operation	Time	Throughput
MinHash sketch (h=128)	~110 µs/doc	~9K docs/sec
MinHash sketch (h=64)	~76 µs/doc	~13K docs/sec
SimHash sketch (b=64)	~205 µs/doc	~5K docs/sec¹
Canonicalize NFKC (ASCII)	~540 ns/doc	~1.9M docs/sec
LSH insert (h=128)	~1.9 µs/sig	~530K signatures/sec
LSH query (10K-doc index)	~393 µs²	~2.5K queries/sec
Hamming compare (`hamming`)	~0.5 ns	~2B comparisons/sec
Jaccard compare (h=128)	~50 ns	~20M comparisons/sec

¹ SimHash 5 KB throughput improved 40% from v0.1.2 (345 µs → 205 µs) via the streaming ±1-per-occurrence accumulator under Weighting::Tf.

² LSH query is slower than v0.1.x on adversarial bench corpora — xxh3's collision profile produces 1.62× more bucket candidates than MurmurHash3 on a 9/10-shared-words corpus. Per-candidate cost is unchanged. Pin HashFamily::MurmurHash3_x64_128 if your workload matches the bench shape and you need v0.1.x query latency. See CHANGELOG.md for the analysis.

Run benchmarks:

RUSTFLAGS="-C target-cpu=native" cargo bench --features lsh

Optimization knobs

The canonicalizer takes a single-pass ASCII fast path. v0.2.0 extends it to ASCII + droppable bidi/format codepoints (BOM, ZWSP, RLO, variation selectors) — measured 17× faster on a 5 KB corpus with one BOM and a ZWSP every 80 bytes (170 µs → 9.8 µs).
Tokenizer::for_each_token is a callback-style API that skips per-token String allocation; classical sketchers route through it.
mimalloc gives ~2× on LSH insert (alloc-heavy), ~6% on SimHash, marginal elsewhere.
The MinHash slot-update inner loop and the SimHash 64-lane accumulator are already auto-vectorized by LLVM (verified via release-build assembly: vpcmpltuq + AVX-512 mask blending on ymm registers). No hand-rolled SIMD planned.
LshIndex::extend_par (v0.2.0, parallel feature) shards bulk insert by band across the rayon thread pool: measured 1.74× speedup on 8 cores for 10K-doc bench.

Stability

Hash byte struct layouts (MinHashSig<H>, SimHash64, TlshFingerprint): frozen since v0.1.0. Golden tests enforce on every PR.
Hash byte values: changed once at v0.2.0 with the default-hasher flip from MurmurHash3 to xxh3. The struct layout did not change. v0.1.x byte parity is one builder call away (with_hasher(HashFamily::MurmurHash3_x64_128)); golden fixtures regenerated, no further byte changes planned for v0.2.x.
EmbeddingProvider, Embedding, semantic_similarity: parity-compatible with imgfprint 0.4.x and audiofp 0.2.x.
FORMAT_VERSION = 1: mirrored across the cross-modal sibling crates so the integrator (ucfp) can refuse to open a database whose layout predates the running build.
Cross-config comparisons are gated by FingerprintMetadata::config_hash. Two fingerprints with different non-zero config_hash values must not be compared.
SemVer enforcement: every PR runs cargo-semver-checks (added in v0.2.1) against the published baseline. Accidental SemVer breaks fail CI.

Security

OOM protection: streaming sketchers cap their internal buffer at 16 MiB; oversized chunks are rejected at update time.
Trojan Source / homoglyph defense: canonicalizer strips Bidi controls and the Cf category. security feature adds the UTS #39 confusable skeleton so Cyrillic 'а' folds to Latin 'a'.
NFC bombs bounded: NFKC growth capped at 18× (Unicode-spec-mandated worst case).
Deterministic output: same input always produces the same byte-identical signature; no hidden RNG, no clock dependency.
Cryptographic-level attacks on the hash families: out of scope. MurmurHash3, xxh3, and SimHash are non-cryptographic by design.

Comparison with alternatives

Feature	txtfp	datasketch (py)	sourmash (py)	rapidfuzz
MinHash	✓	✓	✓	—
Banded LSH	✓	✓	✓	—
SimHash	✓	✓	—	—
TLSH	✓	—	—	—
Streaming sketches	✓	✓	✓	—
Unicode canonicalization	✓	—	—	~
Trojan Source defense	✓	—	—	—
Local ONNX embeddings	✓	—	—	—
Byte-stable hash layouts	✓	—	—	—
`no_std + alloc`	✓	—	—	—
Pure Rust (no Python GIL)	✓	—	—	✓

Examples

See the examples/ directory:

dedup.rs — MinHash + LSH end-to-end deduplication
near_dup.rs — SimHash near-duplicate detection
semantic.rs — Local ONNX embedding similarity (requires semantic)
regen_goldens.rs — Regenerate the byte-frozen test fixtures (do not run on a patch release; only when intentionally bumping a minor)

cargo run --example dedup --features lsh --release
cargo run --example near_dup --release
cargo run --example semantic --features semantic --release

Contributing

Contributions welcome. The contract:

Fork the repository.
Branch (git checkout -b feature/x).
Run the matrix locally: cargo test --no-default-features --features "std,minhash,simhash,lsh,tlsh,markup,security,serde,parallel".
Run clippy: cargo clippy --all-targets -- -D warnings.
Run benches if the change touches a hot path: cargo bench.
Never regenerate golden fixtures unless you're explicitly bumping a minor version.
Open a PR. CI gates on fmt, clippy, doc, deny, audit, semver-checks, and a 60-second fuzz smoke (canonicalize and minhash_streaming targets under fuzz/).
Releases: see RELEASING.md.

Development

git clone https://github.com/themankindproject/txtfp
cd txtfp

# Default-feature smoke
cargo test

# Full classical surface (no semantic — pulls heavy ONNX deps)
cargo test --features "lsh,markup,security,serde,parallel,tlsh,cjk"

# Build the docs
cargo doc --no-deps --open

# Run the fuzz harness locally (requires nightly + cargo-fuzz)
cd fuzz && cargo +nightly fuzz run canonicalize -- -max_total_time=60

License

Licensed under the [MIT

txtfp 0.3.0