Expand description
Pure-Rust core for lshdedup: MinHash + Locality-Sensitive Hashing.
§Algorithm
- Shingle. Split each document into k-grams over characters or words. Identical k-grams in different documents drive collisions.
- MinHash. Compute a fixed-size signature using N hash functions
over the shingle set. We use the classical
(a*h(x) + b) mod 2^32family seeded by the user-suppliedseed, so signatures are deterministic. - LSH. Slice the signature into B bands of R = N/B rows each. Hash each band; documents that share any band hash are candidate near-duplicates.
- Verify. For each candidate, compute exact Jaccard from the signatures (Hamming-equality of signature positions / N).
Standard references: Broder 1997 (MinHash) and Indyk & Motwani 1998 (LSH).
Structs§
- Config
- Index configuration.
- Hit
- One match returned by
Index::near_duplicates. - Index
- MinHash + LSH index. Insert documents, then query with a fresh string.
Enums§
- Dedup
Error - All errors surfaced by
lshdedup-core. - Shingle
Unit - Choice of shingle unit.
Type Aliases§
- Result
- Crate-wide result alias.