Expand description
sketchir: sketching primitives for IR.
This crate is intended for index-only similarity sketches used in:
- near-duplicate detection (MinHash / shingles)
- text fingerprinting (SimHash)
- approximate similarity search (LSH-style candidate generation)
Scope here is primitives: signatures, basic indexing, deterministic behavior. Higher-level workflows (crawl dedupe pipelines, content extraction, etc.) belong elsewhere.
§Dense-vector LSH: LSHIndex vs DenseSimHashLSH
Both index dense f32 vectors but differ in lifecycle and recall strategy:
LSHIndex: batch workflow (add, build, search). Multi-table random projection with tunablenum_tables/num_functionsfor recall control.DenseSimHashLSH: incremental insertion (no build step). Single SimHash table with Hamming-distance-1 neighbor probing for approximate recall.
Re-exports§
pub use blocking::BlockingConfig;pub use blocking::MinHashTextLSH;pub use dense_simhash::DenseSimHashLSH;pub use lsh::LSHIndex;pub use lsh::MinHashLSH;pub use lsh::SimHashLSH;pub use minhash::MinHash;pub use minhash::MinHashSignature;pub use simhash::simhash_fingerprint;pub use simhash::SimHashFingerprint;
Modules§
- blocking
- Text blocking helpers built on MinHash + LSH.
- dense_
simhash - Dense-vector SimHash and an embedding LSH index.
- lsh
- LSH-style indexing helpers.
- minhash
- MinHash for Jaccard similarity estimation.
- simhash
- SimHash: binary fingerprints for fast near-duplicate detection.
Enums§
- Error
- Errors for sketchir indexes and operations.