Skip to main content

Crate difflib_fast

Crate difflib_fast 

Source
Expand description

difflib-fast — fast, byte-for-byte exact difflib Ratcliff–Obershelp (“gestalt”) similarity.

ratio / gestalt::gestalt_ratio are a drop-in for difflib.SequenceMatcher(None, a, b, autojunk=False).ratio(): ratio = 2·M / (len(a)+len(b)), where M is the total size of the Ratcliff–Obershelp matching blocks. The result — including difflib’s tie-break (longest; earliest-a; earliest-b) and its argument-order asymmetry — is reproduced exactly, but M is computed via a suffix automaton (LCS in O(|a|+|b|) regardless of character frequency) instead of difflib’s popular-character b2j rescans. On long, small-alphabet text (e.g. canonicalized source code) this is the difference between difflib’s pathological case and a linear scan.

Beyond the per-pair ratio, cluster_canonicals does an exact single-linkage clustering of a corpus at a similarity threshold — prebuild each string’s automaton once, then an early-exit all-pairs join (length blocking + quick_ratio filter + threshold-aware RO), in parallel via rayon — and reports each cluster with its exact minimum pairwise ratio. cluster_canonicals_lsh is the scalable MinHash-LSH variant (candidate generation + exact verification) for very large corpora past the O(n²) wall.

Two independent implementations of M live here: the suffix-automaton path (gestalt) and a straight port of difflib’s b2j recursion (ratio_reference); the test suite asserts they are bit-identical, which is the crate’s core correctness gate.

Re-exports§

pub use gestalt::gestalt_ratio;

Modules§

gestalt
Gestalt-Chain: fast exact Ratcliff–Obershelp via a suffix automaton.

Functions§

b2j_work
Estimated b2j inner-loop work Σ_c count_a(c)·count_b(c) — exactly the positions the first-block scan visits; the dispatch signal, also exposed for benchmarking. (oa·ob folds the non-ASCII tail.)
cluster_canonicals
cluster_canonicals(canonicals, threshold)[(member indices, min pairwise ratio)].
cluster_canonicals_chars
Exact single-linkage clustering over pre-collected char vectors: returns each cluster (member indices, sorted) with its exact minimum pairwise ratio. O(n²) early-exit join, rayon-parallel.
cluster_canonicals_lsh
cluster_canonicals_lsh(canonicals, threshold, num_perm, band_rows): the scalable path.
ratio
Fast exact difflib ratio. Bit-identical to difflib.SequenceMatcher(None, a, b, autojunk=False).ratio().
ratio_b2j
The difflib b2j ratio path directly (bypasses the length dispatch) — exposed for benchmarking and as a second, structurally-distinct exact implementation. Prefer ratio for real use.
ratio_many
Exact difflib ratio for many (a, b) pairs at once, computed in parallel across all cores (rayon). ratio_many(pairs)[i] equals ratio(&pairs[i].0, &pairs[i].1), bit-for-bit.
ratio_reference
Reference (b2j) difflib ratio — structurally distinct from the suffix-automaton path; the test suite asserts gestalt_ratio equals this exactly. Prefer ratio for real use (far faster).