Skip to main content

Crate sketchir

Crate sketchir 

Source
Expand description

sketchir: sketching primitives for IR.

This crate is intended for index-only similarity sketches used in:

  • near-duplicate detection (MinHash / shingles)
  • text fingerprinting (SimHash)
  • approximate similarity search (LSH-style candidate generation)

Scope here is primitives: signatures, basic indexing, deterministic behavior. Higher-level workflows (crawl dedupe pipelines, content extraction, etc.) belong elsewhere.

§Dense-vector LSH: LSHIndex vs DenseSimHashLSH

Both index dense f32 vectors but differ in lifecycle and recall strategy:

  • LSHIndex: batch workflow (add, build, search). Multi-table random projection with tunable num_tables / num_functions for recall control.
  • DenseSimHashLSH: incremental insertion (no build step). Single SimHash table with Hamming-distance-1 neighbor probing for approximate recall.

Re-exports§

pub use blocking::BlockingConfig;
pub use blocking::MinHashTextLSH;
pub use dense_simhash::DenseSimHashLSH;
pub use lsh::LSHIndex;
pub use lsh::MinHashLSH;
pub use lsh::SimHashLSH;
pub use minhash::MinHash;
pub use minhash::MinHashSignature;
pub use simhash::simhash_fingerprint;
pub use simhash::SimHashFingerprint;

Modules§

blocking
Text blocking helpers built on MinHash + LSH.
dense_simhash
Dense-vector SimHash and an embedding LSH index.
lsh
LSH-style indexing helpers.
minhash
MinHash for Jaccard similarity estimation.
simhash
SimHash: binary fingerprints for fast near-duplicate detection.

Enums§

Error
Errors for sketchir indexes and operations.