pub fn cluster_canonicals_lsh(
canonicals: &[String],
threshold: f64,
num_perm: usize,
band_rows: usize,
) -> Vec<(Vec<usize>, f64)>Expand description
cluster_canonicals_lsh(canonicals, threshold, num_perm, band_rows): the scalable path.
MinHash-LSH generates candidate pairs in ~O(n) (skipping the O(n²) dissimilar pairs); each
candidate is then verified with the exact ratio, so clusters + min_sim match the exact path
(modulo LSH recall, tuned high via band_rows). Filter-verification, in the BayesLSH-Lite /
SourcererCC lineage. Use past the O(n²) wall (>100k strings); for exact recall use
cluster_canonicals.