sketchir
Sketching primitives for IR: MinHash/SimHash/LSH-style signatures.
What it is
sketchir is the index-only layer for:
- near-duplicate detection (shingles + MinHash)
- text fingerprinting (SimHash)
- approximate similarity candidate generation (LSH)
Best starting points
- Near-duplicate detection:
MinHashTextLSH+BlockingConfig - SimHash fingerprints:
SimHashFingerprint/SimHashLSH - Dense-vector LSH (batch):
LSHIndex-- multi-table random projection, add/build/search lifecycle - Dense-vector LSH (incremental):
DenseSimHashLSH-- SimHash + Hamming-1 probing, no build step
Tuning knobs (BlockingConfig)
| Param | Typical | Tradeoff |
|---|---|---|
ngram_size |
3-9 chars | Smaller = more sensitive to noise; Larger = stricter. |
num_bands * num_hashes_per_band |
100-256 | More = better Jaccard estimation, higher storage. |
num_bands |
20-50 | Controls recall/precision curve (S-curve). |
Example (MinHash blocking)
use ;
let cfg = default;
let mut index = new.unwrap;
index.insert_text;
index.insert_text;
let pairs = index.candidate_pairs;
assert!;
License
MIT OR Apache-2.0