Skip to main content

cluster_canonicals_lsh

Function cluster_canonicals_lsh 

Source
pub fn cluster_canonicals_lsh(
    canonicals: &[String],
    threshold: f64,
    num_perm: usize,
    band_rows: usize,
) -> Vec<(Vec<usize>, f64)>
Expand description

cluster_canonicals_lsh(canonicals, threshold, num_perm, band_rows): the scalable path.

MinHash-LSH generates candidate pairs in ~O(n) (skipping the O(n²) dissimilar pairs); each candidate is then verified with the exact ratio, so clusters + min_sim match the exact path (modulo LSH recall, tuned high via band_rows). Filter-verification, in the BayesLSH-Lite / SourcererCC lineage. Use past the O(n²) wall (>100k strings); for exact recall use cluster_canonicals.