sketchir 0.2.0

Sketching primitives for IR: minhash/simhash/LSH-style signatures.
Documentation

sketchir

crates.io Documentation CI

Sketching primitives for IR: MinHash/SimHash/LSH-style signatures.

Tuning Knobs

Param Typical Tradeoff
ngram_size 3-9 chars Smaller = more sensitive to noise; Larger = stricter.
num_bands * num_hashes_per_band 100-256 More = better Jaccard estimation, higher storage.
num_bands 20-50 Controls recall/precision curve (S-curve).

What it is

sketchir is the index-only layer for:

  • near-duplicate detection (shingles + MinHash)
  • text fingerprinting (SimHash)
  • approximate similarity candidate generation (LSH)

Best starting points

  • Near-duplicate detection: MinHashTextLSH + BlockingConfig
  • SimHash fingerprints: SimHashFingerprint / SimHashLSH
  • Generic LSH interface: LSHIndex

Example (MinHash blocking)

use sketchir::{BlockingConfig, MinHashTextLSH};

let cfg = BlockingConfig::default();
let mut index = MinHashTextLSH::new(cfg).unwrap();
index.insert_text("a", "hello world");
index.insert_text("b", "hello  world!");

let pairs = index.candidate_pairs();
assert!(!pairs.is_empty());

License

MIT OR Apache-2.0