# txtfp — Internal Architecture & Algorithm Reference
> This document is the **internal** engineering reference for `txtfp`.
> It explains how every algorithm works at the bit level, how data flows
> through the pipeline, and how each module is organized. For the
> user-facing SDK guide, see [USAGE.md](USAGE.md).
---
## Table of Contents
1. [Project Overview](#project-overview)
2. [Architecture](#architecture)
3. [Core Algorithms](#core-algorithms)
- [MinHash](#1-minhash-jaccard-similarity)
- [SimHash](#2-simhash-cosine-similarity)
- [TLSH](#3-tlsh-trend-micro-lsh)
- [Banded LSH](#4-banded-lsh)
- [Shingling](#5-shingling)
4. [Data Structures & Memory Layouts](#data-structures--memory-layouts)
5. [Tokenization Internals](#tokenization-internals)
6. [Similarity Metrics](#similarity-metrics)
7. [Edge Cases & Invariants](#edge-cases--invariants)
8. [Module Map](#module-map)
---
## Project Overview
**txtfp** is a Rust SDK (v0.2.2, edition 2024, MSRV 1.88) for extracting compact, deterministic, byte-stable fingerprints from text. It serves:
- LLM training-set deduplication
- RAG retrieval ranking
- Content moderation & plagiarism detection
- Email / document de-dup at scale
### Design Pillars
| **Determinism** | Same input + same config → same bytes. No RNG, no clock. |
| **Byte stability** | `MinHashSig<H>` and `SimHash64` are `repr(C)` / `repr(transparent)` `bytemuck::Pod`. Golden tests enforce on every PR. |
| **Modularity** | Four-stage pipeline; each stage swappable via traits. |
| **Minimal footprint** | Default features compile `no_std + alloc` for WASM. Heavy deps (ONNX, CJK dicts) are opt-in. |
| **Cross-SDK parity** | `EmbeddingProvider`, `Embedding`, `semantic_similarity`, `FORMAT_VERSION` identical across `audiofp` / `imgfprint` / `txtfp`. |
---
## Architecture
### The Four-Stage Pipeline
Every fingerprint flows through four independent, composable stages:
```
input (&str)
│
▼
┌─────────────────────────────────────────────────┐
│ Stage 1: Canonicalizer │
│ NFKC normalize → strip Bidi/Cf → casefold │
│ → optional UTS #39 confusable skeleton │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Stage 2: Tokenizer │
│
▼
┌─────────────────────────────────────────────────┐
│ Stage 3: Fingerprinter │
│
▼
┌─────────────────────────────────────────────────┐
│ Stage 4: Comparison │
└─────────────────────────────────────────────────┘
```
**Key principle**: Each stage is independently swappable. You can switch tokenizers without touching fingerprinters, change hash families without re-canonicalizing, or compare signatures produced by different pipeline configurations (guarded by `config_hash`).
### Pipeline Flow Diagram
```mermaid
flowchart TD
A["Raw Text Input (&str)"] --> B["Canonicalizer"]
subgraph "Stage 1: Canonicalization"
B --> B1["NFKC Normalize"]
B1 --> B2["Strip Bidi/Cf chars"]
B2 --> B3["Simple Casefold"]
B3 --> B4{"security feature?"}
B4 -->|yes| B5["UTS #39 Confusable Skeleton"]
B4 -->|no| B6["Canonical String"]
B5 --> B6
end
B6 --> C["Tokenizer"]
subgraph "Stage 2: Tokenization"
C --> C1["WordTokenizer (UAX #29)"]
C --> C2["GraphemeTokenizer"]
C --> C3["ShingleTokenizer<T, k>"]
C --> C4["CjkTokenizer (jieba)"]
end
C1 --> D["Token Stream"]
C2 --> D
C3 --> D
C4 --> D
D --> E["Fingerprinter"]
subgraph "Stage 3: Fingerprinting"
E --> E1["MinHash → MinHashSig<H>"]
E --> E2["SimHash → SimHash64"]
E --> E3["TLSH → TlshFingerprint"]
E --> E4["Embedding → Vec<f32>"]
end
E1 --> F["Comparison"]
E2 --> F
E3 --> F
E4 --> F
subgraph "Stage 4: Comparison"
F --> F1["jaccard() → f32 [0,1]"]
F --> F2["hamming() → u32 [0,64]"]
F --> F3["cosine_estimate() → f32 [-1,1]"]
F --> F4["tlsh_distance() → i32"]
F --> F5["semantic_similarity() → f32 [-1,1]"]
end
```
### Runtime Sequence (MinHash path)
```mermaid
sequenceDiagram
participant User
participant FP as MinHashFingerprinter
participant Canon as Canonicalizer
participant Tok as ShingleTokenizer
participant Hash as hash128()
participant Sig as MinHashSig<H>
User->>FP: fingerprint(input)
FP->>Canon: canonicalize(input)
Canon-->>FP: canonical: String
FP->>FP: Initialize slots[0..H] = u64::MAX
FP->>Tok: for_each_token(canonical, callback)
loop For each shingle
Tok-->>FP: &str token
FP->>Hash: hash128(family, token.as_bytes(), seed)
Hash-->>FP: (lo: u64, hi: u64)
FP->>FP: for i in 0..H: slot[i] = min(slot[i], lo + i*hi)
end
FP->>Sig: MinHashSig { schema: 1, hashes: slots }
Sig-->>User: MinHashSig<H>
```
### Module Organization
```
src/
├── lib.rs # Re-exports, VERSION, FORMAT_VERSION
├── error.rs # Error enum (#[non_exhaustive]), Result alias
├── fingerprint.rs # Fingerprint enum, FingerprintMetadata, config_hash
├── canonical/
│ ├── mod.rs # Canonicalizer, CanonicalizerBuilder, fast paths
│ ├── bidi.rs # is_bidi_control(), is_format()
│ ├── casefold.rs # simple() via caseless crate
│ └── confusable.rs # UTS #39 skeleton (security feature)
├── tokenize/
│ ├── mod.rs # Tokenizer trait, TokenStream enum
│ ├── word.rs # WordTokenizer (UAX #29)
│ ├── grapheme.rs # GraphemeTokenizer (UAX #29 extended)
│ ├── shingle.rs # ShingleTokenizer<T> (k-gram adaptor)
│ └── cjk.rs # CjkTokenizer (jieba, OnceLock dict)
├── classical/
│ ├── mod.rs # Fingerprinter + StreamingFingerprinter traits
│ ├── hash.rs # HashFamily enum, hash128(), murmur3_x64_128()
│ ├── minhash/
│ │ ├── mod.rs # Re-exports
│ │ ├── sig.rs # MinHashSig<H> (repr(C), Pod)
│ │ ├── fingerprinter.rs # MinHashFingerprinter, sketch_canonical()
│ │ ├── streaming.rs # MinHashStreaming (16 MiB cap)
│ │ └── jaccard.rs # jaccard() with SIMD (wide crate)
│ ├── simhash/
│ │ ├── mod.rs # Re-exports
│ │ ├── sig.rs # SimHash64 (repr(transparent), Pod)
│ │ ├── fingerprinter.rs # SimHashFingerprinter, accumulate_bits()
│ │ ├── streaming.rs # SimHashStreaming
│ │ └── distance.rs # hamming(), cosine_estimate()
│ ├── lsh/
│ │ ├── mod.rs # Re-exports
│ │ ├── builder.rs # LshIndexBuilder, for_threshold(), quadrature
│ │ └── index.rs # LshIndex<H>, band tables, identity hasher
│ └── tlsh.rs # TlshFingerprinter, tlsh_distance(), MIN_INPUT_BYTES
├── semantic/
│ ├── mod.rs # Re-exports
│ ├── embedding.rs # Embedding struct, validation
│ ├── provider.rs # EmbeddingProvider trait, semantic_similarity()
│ ├── local.rs # LocalProvider (ort 2.0, HF Hub, pooling table)
│ ├── pooling.rs # Pooling enum (Cls, Mean, MeanNoNorm, Max)
│ └── chunk.rs # ChunkingStrategy, chunk_for_model()
└── markup/
├── mod.rs
├── html.rs # html_to_text (html2text, strips script/style)
└── markdown.rs # markdown_to_text (pulldown-cmark)
```
---
## Core Algorithms
### 1. MinHash (Jaccard Similarity)
#### Theory
MinHash estimates the Jaccard similarity `J(A, B) = |A ∩ B| / |A ∪ B|` between two sets by exploiting the property that for a random permutation π:
```
Pr[min π(A) = min π(B)] = J(A, B)
```
Instead of true random permutations (expensive), txtfp uses the **double-hashing construction** (Indyk–Motwani 1998, Kirsch–Mitzenmacher 2008):
1. Hash each token/shingle once with a 128-bit hash → `(lo, hi)`
2. Derive H virtual hash functions: `h_i = lo + i × hi` (wrapping u64 arithmetic)
3. Keep the minimum value seen per slot across all tokens
This reduces per-token cost from O(H) independent hashes to O(1) hash + O(H) additions.
#### Algorithm Walkthrough
```
INPUT: "the quick brown fox" with ShingleTokenizer{k=3, inner=WordTokenizer}
SEED: 0x00C0_FFEE_5EED
HASH: HashFamily::Xxh3_64 (default v0.2.0+)
H: 128 slots
Step 1 — Canonicalize:
"the quick brown fox" → "the quick brown fox" (ASCII fast path)
Step 2 — Tokenize (shingles, k=3):
["the quick brown", "quick brown fox"]
Step 3 — Sketch:
Initialize: slots[0..128] = u64::MAX
Shingle "the quick brown":
(lo, hi) = xxh3_128("the quick brown", seed)
For i = 0..128:
h_i = lo.wrapping_add(i.wrapping_mul(hi))
slots[i] = min(slots[i], h_i)
Shingle "quick brown fox":
(lo, hi) = xxh3_128("quick brown fox", seed)
For i = 0..128:
h_i = lo.wrapping_add(i.wrapping_mul(hi))
slots[i] = min(slots[i], h_i)
Step 4 — Output:
MinHashSig<128> { schema: 1, _pad: [0;6], hashes: slots }
```
#### Comparison: `jaccard(a, b)`
```
matches = count(i where a.hashes[i] == b.hashes[i])
jaccard_estimate = matches / H
Variance: σ² = p(1-p)/H where p = true Jaccard
For H=128, p=0.5: σ = 0.044 (95% CI: ±0.088)
```
#### Hash Families
| `Xxh3_64` (default v0.2+) | ~3× faster | ✗ | `xxh3_128_with_seed` → split into (lo, hi) |
| `MurmurHash3_x64_128` | Reference | ✓ | `murmur3_x64_128(key, seed)` → (h1, h2) |
#### MinHash Pipeline Diagram
```mermaid
flowchart TD
A["Input: 'the quick brown fox'"] --> B["Canonicalizer<br/>(ASCII fast path)"]
B --> C["ShingleTokenizer k=3"]
C --> D["Shingles:<br/>['the quick brown', 'quick brown fox']"]
D --> E["MinHash Sketch"]
subgraph "Double-Hashing Inner Loop"
E --> E1["slots[0..H] = u64::MAX"]
E1 --> E2["For each shingle:"]
E2 --> E3["(lo, hi) = hash128(family, shingle, seed)"]
E3 --> E4["For i in 0..H:"]
E4 --> E5["h_i = lo.wrapping_add(i * hi)"]
E5 --> E6["slots[i] = min(slots[i], h_i)"]
E6 --> E4
E4 -->|done| E7["Next shingle"]
E7 --> E2
E2 -->|all done| E8["MinHashSig<H>"]
end
E8 --> F["Output: schema=1 | pad=0 | hashes[128]"]
G["Sig A"] --> H["jaccard(A, B)"]
F --> H
I["Sig B"] --> H
H --> J["matches/H ≈ J(A,B)"]
```
#### Memory Layout
```mermaid
flowchart LR
subgraph "MinHashSig<H> — repr(C), bytemuck::Pod"
direction LR
S["schema: u16<br/>= 1<br/>[2 bytes]"]
P["_pad: [u8;6]<br/>= 0<br/>[6 bytes]"]
H["hashes: [u64; H]<br/>[8×H bytes, LE]"]
S --> P --> H
end
subgraph "Sizes"
S1["H=64 → 520 bytes"]
S2["H=128 → 1032 bytes"]
S3["H=256 → 2056 bytes"]
end
```
#### Variance & Accuracy
```
0.1 | 0.027 | ±0.053
0.3 | 0.041 | ±0.080
0.5 | 0.044 | ±0.088
0.7 | 0.041 | ±0.080
0.9 | 0.027 | ±0.053
```
#### Key Properties
- **Permutation-invariant**: Token order doesn't matter (set semantics)
- **Duplicate-insensitive**: `min(x, x) = x` — repeated tokens don't change the signature
- **Sub-additive**: Adding tokens can only decrease slot values
- **Deterministic**: Same input + config → identical bytes
---
### 2. SimHash (Cosine Similarity)
#### Theory
SimHash (Charikar 2002) projects a weighted token bag into a single 64-bit hash such that the Hamming distance between two hashes approximates the angular distance between the original vectors:
```
cos(θ) ≈ cos((hamming(a, b) / 64) × π)
```
#### Algorithm Walkthrough
```
INPUT: "the the fox" with Weighting::Tf
SEED: 0x00C0_FFEE_5EED
HASH: HashFamily::Xxh3_64
Step 1 — Canonicalize:
"the the fox" → "the the fox"
Step 2 — Tokenize (WordTokenizer):
Tokens in document order: ["the", "the", "fox"]
Step 3 — Sketch (Tf weighting — streaming hot path):
Initialize: acc[0..64] = 0_i64
Token "the" (1st occurrence):
(lo, _hi) = hash128(Xxh3_64, b"the", seed)
For bit b = 0..63:
if (lo >> b) & 1 == 1: acc[b] += 1 (saturating)
else: acc[b] -= 1 (saturating)
Token "the" (2nd occurrence):
Same hash → same bits → acc amplified again
acc[b] += 1 or -= 1 for each bit
Token "fox":
(lo, _hi) = hash128(Xxh3_64, b"fox", seed)
Different hash bits → different acc updates
Step 4 — Round to bits:
result: u64 = 0
For b = 0..63:
if acc[b] > 0: result |= (1 << b)
Output: SimHash64(result)
```
#### The `accumulate_bits` Inner Loop
This is the performance-critical function:
```rust
fn accumulate_bits(acc: &mut [i64; 64], lo: u64, w: i64) {
for (b, slot) in acc.iter_mut().enumerate() {
if (lo >> b) & 1 == 1 {
*slot = slot.saturating_add(w);
} else {
*slot = slot.saturating_sub(w);
}
}
}
```
**Saturating arithmetic** prevents overflow from adversarial inputs. LLVM auto-vectorizes this loop on x86_64 (verified: `vpcmpltuq` + AVX-512 mask blending).
#### Weighting Strategies
| `Tf` (default) | ±1 per occurrence | No | ✓ Streaming |
| `Uniform` | ±1 per distinct token | Yes | ✗ |
| `IdfWeighted(table)` | ±(tf × idf) | Yes | ✗ |
**Tf streaming insight**: `+1+1+1` for a triplicate token is mathematically identical to `+3` after dedup. The streaming path avoids the HashMap entirely.
#### Comparison
```
hamming(a, b) = popcnt(a.0 ^ b.0) // 0..=64, hardware POPCNT
cosine_estimate(a, b) = cos((hamming / 64) × π) // [-1.0, 1.0]
```
| 0 | 1.0 | Identical |
| 8 | 0.92 | Very similar |
| 16 | 0.71 | Similar |
| 32 | 0.0 | Uncorrelated |
| 48 | -0.71 | Dissimilar |
| 64 | -1.0 | Opposite |
#### SimHash Pipeline Diagram
```mermaid
flowchart TD
A["Input: 'the the fox'"] --> B["Canonicalizer"]
B --> C["WordTokenizer"]
C --> D["Tokens: ['the', 'the', 'fox']"]
D --> E["SimHash Sketch"]
subgraph "accumulate_bits (Tf streaming)"
E --> E1["acc[64] = {0}"]
E1 --> E2["For each token occurrence:"]
E2 --> E3["(lo, _) = hash128(token)"]
E3 --> E4["For b = 0..63:"]
E4 --> E5{"bit b set?"}
E5 -->|1| E6["acc[b] += 1 (saturating)"]
E5 -->|0| E7["acc[b] -= 1 (saturating)"]
E6 --> E8["next bit"]
E7 --> E8
E8 --> E4
E4 -->|done| E9["next token"]
E9 --> E2
end
E2 -->|all done| F["Round: bit[b] = 1 if acc[b] > 0"]
F --> G["SimHash64(u64)"]
H["SimHash64 A"] --> I["XOR → POPCNT"]
G --> I
J["SimHash64 B"] --> I
I --> K["hamming distance"]
K --> L["cos((dist/64) × π)"]
```
---
### 3. TLSH (Trend Micro LSH)
#### Theory
TLSH (Oliver et al. 2013) is a byte-level locality-sensitive hash. Unlike MinHash/SimHash which operate on tokenized text, TLSH processes **raw bytes** through a trigram sliding window and quantized histogram. It captures statistical features of the byte stream.
#### Algorithm Walkthrough
```
INPUT: "the quick brown fox jumps over the lazy dog" (44 bytes)
(Must be ≥ 50 bytes after canonicalization — this example is simplified)
VARIANT: 128/1 (32-byte body + 1-byte checksum)
Step 1 — Canonicalize:
"the quick brown fox..." → lowercase bytes
Step 2 — Convert to bytes and feed to tlsh2 crate:
a. TRIGRAM SLIDING WINDOW:
Bytes: [t, h, e, ' ', q, u, i, c, k, ...]
Window 0: [t, h, e] → hash → bucket index
Window 1: [h, e, ' '] → hash → bucket index
Window 2: [e, ' ', q] → hash → bucket index
... (n-2 windows for n bytes)
b. HISTOGRAM:
256 buckets accumulate trigram counts
Checksum accumulated separately
c. QUANTIZATION:
Bucket counts → quartile boundaries → 2-bit codes per bucket
Header: length, quartile ratios, checksum
d. OUTPUT:
70-char hex string: "T1" prefix + header + body
e.g. "T1A12..." (128/1 variant)
Step 3 — Compare:
tlsh_distance(A, B) = header_diff + body_diff
Returns i32: lower = more similar
```
#### Distance Thresholds
| 0 | Identical | Exact duplicate |
| 1–50 | High | Near-duplicate |
| 51–100 | Moderate | Related content |
| 101–200 | Low | Different content |
| 200+ | Very low | Unrelated |
#### TLSH Pipeline Diagram
```mermaid
flowchart TD
A["Input Text (≥50 bytes)"] --> B["Canonicalizer"]
B --> C["as_bytes()"]
C --> D["tlsh2::TlshBuilder128_1"]
subgraph "TLSH Internal (tlsh2 crate)"
D --> D1["Trigram sliding window"]
D1 --> D2["256-bucket histogram"]
D2 --> D3["Quartile quantization"]
D3 --> D4["Header (length + quartiles + checksum)"]
D4 --> D5["Body (32 bytes, 2-bit codes)"]
end
D5 --> E["TlshFingerprint (70-char hex)"]
F["Fingerprint A"] --> G["tlsh_distance()"]
E --> G
H["Fingerprint B"] --> G
G --> I["i32 distance (lower = more similar)"]
```
#### Key Differences from MinHash/SimHash
- Operates on **raw bytes**, not tokens
- Bypasses the tokenizer entirely (uses `sketch_bytes()` internally)
- Minimum 50 bytes input requirement
- Distance metric (lower = better) vs similarity metric (higher = better)
- Wraps the `tlsh2` crate — txtfp doesn't reimplement the algorithm
---
### 4. Banded LSH
#### Theory
Banded LSH collapses O(N) brute-force similarity search to O(1) average-case by partitioning MinHash signatures into bands and hashing each band into buckets:
```
P(at least one band collision) = 1 - (1 - t^r)^b
where:
t = true Jaccard similarity
b = number of bands
r = rows per band (b × r = H)
```
This produces an S-curve: documents above the threshold almost always collide; documents below almost never do.
#### Algorithm Walkthrough
```
SETUP: H=128, threshold=0.7
OPTIMIZATION: LshIndexBuilder::for_threshold(0.7, 128)
→ Enumerates factor pairs: (1,128), (2,64), (4,32), (8,16), (16,8), (32,4), (64,2), (128,1)
→ 200-point trapezoidal quadrature for each pair
→ Minimizes: ∫₀ᵗ P(s)ds + ∫ₜ¹ (1-P(s))ds
→ Picks optimal: b=16, r=8
INDEXING Document A (MinHashSig<128>):
Band 0: hashes[0..8] → xxh3_64(bytes) → bucket_key_0
Band 1: hashes[8..16] → xxh3_64(bytes) → bucket_key_1
...
Band 15: hashes[120..128] → xxh3_64(bytes) → bucket_key_15
Insert doc_id into each band's HashMap at the computed key.
QUERYING with probe signature:
1. Compute 16 band keys from probe
2. Probe each band's HashMap
3. Union all candidate doc_ids (deduplicate)
4. Optional: verify each candidate with exact jaccard()
5. Return filtered results
```
#### Probability Examples
```
threshold=0.7, b=16, r=8:
t=0.5: P = 1-(1-0.5⁸)¹⁶ = 1-(1-0.0039)¹⁶ = 0.061 (6% — good, low FP)
t=0.7: P = 1-(1-0.7⁸)¹⁶ = 1-(1-0.0576)¹⁶ = 0.604 (60% — transition)
t=0.9: P = 1-(1-0.9⁸)¹⁶ = 1-(1-0.4305)¹⁶ = 0.9995 (99.95% — high recall)
```
#### Band/Row Trade-offs (H=128)
| (bands, rows) | Sweet-spot threshold | Character |
|---------------|---------------------|-----------|
| (8, 16) | ~0.95 | Exact dedup only |
| (16, 8) | ~0.85 | Strict near-dup |
| (32, 4) | ~0.65 | Moderate fuzzy |
| (64, 2) | ~0.45 | High recall |
#### LSH Index Structure Diagram
```mermaid
flowchart TD
subgraph "LshIndex<128> (b=16, r=8)"
A["MinHashSig<128>"] --> B["Partition: 16 bands × 8 rows"]
B --> B0["Band 0: hashes[0..8]"]
B --> B1["Band 1: hashes[8..16]"]
B --> BN["..."]
B --> B15["Band 15: hashes[120..128]"]
B0 --> C0["xxh3_64(8 × u64 bytes) → key"]
B1 --> C1["xxh3_64 → key"]
B15 --> C15["xxh3_64 → key"]
C0 --> T0["Band 0 HashMap<br/>(identity hasher)<br/>key → SmallVec<doc_id>"]
C1 --> T1["Band 1 HashMap"]
C15 --> T15["Band 15 HashMap"]
end
subgraph "Query"
Q["Probe sig"] --> Q1["Compute 16 band keys"]
Q1 --> Q2["Probe each HashMap"]
Q2 --> Q3["Union candidates"]
Q3 --> Q4["verify: jaccard(probe, candidate) ≥ threshold"]
Q4 --> Q5["Results: Vec<u64>"]
end
```
#### `for_threshold` Optimization Diagram
```mermaid
flowchart LR
A["Input: threshold=0.7, H=128"] --> B["Enumerate factor pairs of 128"]
B --> C["For each (b, r):"]
C --> D["FP = ∫₀⁰·⁷ (1-(1-t^r)^b) dt<br/>(200-pt trapezoid)"]
C --> E["FN = ∫₀.₇¹ (1-(1-(1-t^r)^b)) dt"]
D --> F["Error = FP + FN"]
E --> F
F --> G["Track minimum"]
G --> H["Return best (b, r)"]
```
#### Internal Data Structures
- **Band tables**: `Vec<HashMap<u64, SmallVec<[u64; 4]>>>` — one HashMap per band
- **Identity hasher**: Band keys are already xxh3_64 digests; re-hashing is waste
- **Reverse map**: `HashMap<u64, MinHashSig<H>>` — stores signatures for `query_with_threshold` verification
- **Parallel insert** (`extend_par`): Shards by band across rayon pool, contention-free
---
### 5. Shingling
#### Theory
Shingling converts a token stream into overlapping k-grams, preserving local word order. This is the standard input representation for MinHash-based document similarity.
#### Algorithm Walkthrough
```
INPUT: "the quick brown fox" with k=3, inner=WordTokenizer
Step 1 — Inner tokenization:
WordTokenizer → ["the", "quick", "brown", "fox"] (n=4 tokens)
Step 2 — Sliding window (k=3):
Number of shingles = n - k + 1 = 4 - 3 + 1 = 2
Window 0: tokens[0..3] = ["the", "quick", "brown"] → "the quick brown"
Window 1: tokens[1..4] = ["quick", "brown", "fox"] → "quick brown fox"
Step 3 — Output:
["the quick brown", "quick brown fox"]
```
#### Edge Cases
```
k=0: → empty stream (no shingles)
k=1: → same as inner tokenizer (each token is its own shingle)
k > n (e.g., k=5, n=2 tokens "a b"):
→ single shingle: "a b" (datasketch convention)
```
#### `for_each_token` Hot Path (zero-allocation)
```mermaid
flowchart TD
A["Input text"] --> B["Inner tokenizer: for_each_token"]
B --> C["Concatenate all tokens into flat String"]
C --> D["Record byte ranges: Vec<(start, end)>"]
D --> E["For window i = 0..=(n-k):"]
E --> F["Extract ranges[i..i+k] from flat"]
F --> G["Join with ' ' into reusable buffer"]
G --> H["Callback(&buffer)"]
H --> E
E -->|done| I["End — zero per-shingle allocation"]
```
#### Shingle Count Formula
```
shingles(n, k) = max(1, n - k + 1) if n ≥ 1
= 0 if n = 0
Special: if n < k and n ≥ 1 → returns 1 (all tokens joined)
```
---
## Data Structures & Memory Layouts
### MinHashSig<H>
```rust
#[repr(C)]
pub struct MinHashSig<const H: usize> {
pub schema: u16, // = 1 (frozen)
pub _pad: [u8; 6], // = 0 (alignment)
pub hashes: [u64; H], // little-endian minimum hash values
}
// Total: 8 + 8*H bytes
// Properties: bytemuck::Pod, Copy, PartialEq, Eq, Hash
// Zero-copy: bytemuck::cast_slice(&[MinHashSig<H>]) → &[u8]
```
### SimHash64
```rust
#[repr(transparent)]
pub struct SimHash64(pub u64);
// Total: 8 bytes (little-endian)
// Properties: bytemuck::Pod, Copy, PartialEq, Eq, Hash
```
### TlshFingerprint
```rust
pub struct TlshFingerprint(pub String);
// 70-char hex string for 128/1 variant
// Not Pod — wraps the tlsh2 crate's output
```
### Embedding
```rust
pub struct Embedding {
pub vector: Vec<f32>, // n-dimensional dense vector
pub model_id: Option<String>, // producer model identifier
}
// Validation: rejects empty, NaN, ±Inf at construction
// Methods: dot(), normalize(), l2_norm(), dim()
```
### FingerprintMetadata
```rust
pub struct FingerprintMetadata {
pub algorithm: &'static str, // "minhash-h128", "simhash-b64", etc.
pub config_hash: u64, // xxh3_64 of canonicalizer|tokenizer|algo
pub model_id: Option<String>,
pub schema_version: u16,
pub byte_size: usize,
}
```
### Fingerprint (unified enum)
```rust
pub enum Fingerprint {
MinHash(MinHashSig<128>), // minhash feature
SimHash(SimHash64), // simhash feature
Tlsh(TlshFingerprint), // tlsh feature
Embedding(Embedding), // semantic feature
}
```
---
## Tokenization Internals
### Tokenizer Trait
```rust
pub trait Tokenizer: Send + Sync {
fn tokens<'a>(&'a self, input: &'a str) -> TokenStream<'a>;
fn name(&self) -> Cow<'static, str>;
fn for_each_token(&self, input: &str, f: &mut dyn FnMut(&str));
}
```
Two paths:
- `tokens()` → `TokenStream` (iterator, may allocate)
- `for_each_token()` → callback (zero-allocation hot path, used by all classical sketchers)
### Implementations
| `WordTokenizer` | UAX #29 word boundaries, filters non-word | Zero-sized, zero-alloc | `"word-uax29"` |
| `GraphemeTokenizer` | UAX #29 extended grapheme clusters | Zero-sized, zero-alloc | `"grapheme-uax29"` |
| `ShingleTokenizer<T>` | k-gram windows over inner T | Reusable buffer + range table | `"shingle-k=<k>/<inner>"` |
| `CjkTokenizer` | jieba-rs segmentation (Simplified Chinese) | OnceLock dictionary | `"cjk-jieba"` |
### Notable Behaviors
- **WordTokenizer**: "don't" = 1 token (UAX #29 MidLetter rule). Filters whitespace/punctuation.
- **GraphemeTokenizer**: `🇺🇸` = 1 token, `👨👩👧👦` = 1 token, `é` (combining) = 1 token. Yields everything including whitespace.
- **ShingleTokenizer**: k=0 → empty. k > n → single shingle of all tokens.
- **CjkTokenizer**: Dictionary loaded once via `OnceLock`. Falls back to per-codepoint on missing dict.
---
## Similarity Metrics
| `jaccard(a, b)` | `&MinHashSig<H>` × 2 | `f32 [0,1]` | matching slots / H | SIMD (wide crate, 4 lanes) |
| `hamming(a, b)` | `SimHash64` × 2 | `u32 [0,64]` | `popcnt(a ^ b)` | POPCNT (x86), cnt (AArch64) |
| `cosine_estimate(a, b)` | `SimHash64` × 2 | `f32 [-1,1]` | `cos((hamming/64) × π)` | — |
| `tlsh_distance(a, b)` | `&TlshFingerprint` × 2 | `Result<i32>` | header_diff + body_diff | — |
| `semantic_similarity(a, b)` | `&Embedding` × 2 | `Result<f32> [-1,1]` | `dot(a,b) / (‖a‖×‖b‖)` | — |
### Guards on `semantic_similarity`
- `model_id` mismatch → `Error::ModelMismatch`
- Dimension mismatch → `Error::DimensionMismatch`
- Zero-norm vector → `Error::InvalidInput`
---
## Edge Cases & Invariants
### Canonicalization
| Empty input | Returns `""` |
| Pure ASCII | Fast path: `to_ascii_lowercase()` (~540 ns / 5 KB) |
| ASCII + BOM/ZWSP/RLO | Fast path: filter-and-lowercase (17× faster than full pipeline) |
| `İ` (U+0130) | Expands to `i` + combining dot; re-normalizes |
| Bidi between combining marks | Pre-normalization strip prevents reordering |
| Idempotency | `canonicalize(canonicalize(x)) == canonicalize(x)` (fuzz-proven) |
### Fingerprinting
| Empty / whitespace-only input | `Error::InvalidInput("empty document")` |
| UTF-8 split across chunks (streaming) | Carries partial sequences to next `update()` |
| Trailing incomplete UTF-8 at `finalize()` | `Error::InvalidInput` |
| Buffer > 16 MiB (streaming) | `Error::InvalidInput` (configurable via `with_max_bytes`) |
| TLSH input < 50 bytes | `Error::InvalidInput` |
### Comparison
| Different `model_id` on embeddings | `Error::ModelMismatch` |
| Different dimensions | `Error::DimensionMismatch` |
| Wrong schema version on deser | `Error::SchemaMismatch` |
| Different `config_hash` | Caller-enforced (not auto-checked by comparison fns) |
---
## Module Map
### Key Constants
```rust
pub const VERSION: &str = "0.2.2";
pub const FORMAT_VERSION: u32 = 1; // Cross-SDK (audiofp, imgfprint, txtfp)
pub const DEFAULT_SEED: u64 = 0x00C0_FFEE_5EED; // MinHash + SimHash
pub const MIN_INPUT_BYTES: usize = 50; // TLSH minimum
```
### Trait Hierarchy
```
Fingerprinter (offline, &self)
├── MinHashFingerprinter<T, H> → MinHashSig<H>
├── SimHashFingerprinter<T> → SimHash64
└── TlshFingerprinter → TlshFingerprint
StreamingFingerprinter (chunked, &mut self)
├── MinHashStreaming<T, H> → MinHashSig<H>
└── SimHashStreaming<T> → SimHash64
Tokenizer (Send + Sync)
├── WordTokenizer
├── GraphemeTokenizer
├── ShingleTokenizer<T>
└── CjkTokenizer
EmbeddingProvider (Send + Sync)
└── LocalProvider (ort + HF Hub)
```
### Feature Gate Map
```
default = ["std", "minhash", "simhash", "lsh"]
minhash → MinHashFingerprinter, MinHashSig, jaccard, MinHashStreaming
simhash → SimHashFingerprinter, SimHash64, hamming, cosine_estimate, SimHashStreaming
lsh → LshIndex, LshIndexBuilder (requires minhash)
tlsh → TlshFingerprinter, tlsh_distance (dep: tlsh2)
semantic → LocalProvider, Embedding, semantic_similarity (deps: ort, tokenizers, hf-hub)
cjk → CjkTokenizer with jieba, Simplified Chinese only (dep: jieba-rs)
markup → html_to_text, markdown_to_text (deps: html2text, pulldown-cmark)
security → UTS #39 confusable skeleton (dep: unicode-security)
serde → Serialize/Deserialize on all signature types
parallel → LshIndex::extend_par (dep: rayon)
```
### Testing Infrastructure
- **Golden tests** (`tests/data/golden/`): 18 byte-frozen fixtures. Failing = major-version bump required.
- **Property tests** (proptest): Canonicalization idempotency, MinHash permutation invariance, SimHash determinism.
- **Fuzz harnesses** (`fuzz/`): `canonicalize` and `minhash_streaming` targets.
- **Integration tests**: `golden_minhash.rs`, `golden_simhash.rs`, `golden_canonical.rs`, `tlsh.rs`, `lsh.rs`, `adversarial.rs`, `i18n.rs`.
- **CI gates**: fmt, clippy, doc, deny, audit, semver-checks, 60s fuzz smoke.