Skip to main content

Crate lshdedup_core

Crate lshdedup_core 

Source
Expand description

Pure-Rust core for lshdedup: MinHash + Locality-Sensitive Hashing.

§Algorithm

  1. Shingle. Split each document into k-grams over characters or words. Identical k-grams in different documents drive collisions.
  2. MinHash. Compute a fixed-size signature using N hash functions over the shingle set. We use the classical (a*h(x) + b) mod 2^32 family seeded by the user-supplied seed, so signatures are deterministic.
  3. LSH. Slice the signature into B bands of R = N/B rows each. Hash each band; documents that share any band hash are candidate near-duplicates.
  4. Verify. For each candidate, compute exact Jaccard from the signatures (Hamming-equality of signature positions / N).

Standard references: Broder 1997 (MinHash) and Indyk & Motwani 1998 (LSH).

Structs§

Config
Index configuration.
Hit
One match returned by Index::near_duplicates.
Index
MinHash + LSH index. Insert documents, then query with a fresh string.

Enums§

DedupError
All errors surfaced by lshdedup-core.
ShingleUnit
Choice of shingle unit.

Type Aliases§

Result
Crate-wide result alias.