Expand description
difflib-fast — fast, byte-for-byte exact difflib Ratcliff–Obershelp (“gestalt”) similarity.
ratio / gestalt::gestalt_ratio are a drop-in for
difflib.SequenceMatcher(None, a, b, autojunk=False).ratio():
ratio = 2·M / (len(a)+len(b)), where M is the total size of the Ratcliff–Obershelp matching
blocks. The result — including difflib’s tie-break (longest; earliest-a; earliest-b) and its
argument-order asymmetry — is reproduced exactly, but M is computed via a suffix automaton
(LCS in O(|a|+|b|) regardless of character frequency) instead of difflib’s popular-character
b2j rescans. On long, small-alphabet text (e.g. canonicalized source code) this is the
difference between difflib’s pathological case and a linear scan.
Beyond the per-pair ratio, cluster_canonicals does an exact single-linkage clustering of a
corpus at a similarity threshold — prebuild each string’s automaton once, then an early-exit
all-pairs join (length blocking + quick_ratio filter + threshold-aware RO), in parallel via
rayon — and reports each cluster with its exact minimum pairwise ratio. cluster_canonicals_lsh
is the scalable MinHash-LSH variant (candidate generation + exact verification) for very large
corpora past the O(n²) wall.
Two independent implementations of M live here: the suffix-automaton path (gestalt) and a
straight port of difflib’s b2j recursion (ratio_reference); the test suite asserts they are
bit-identical, which is the crate’s core correctness gate.
Re-exports§
pub use gestalt::gestalt_ratio;
Modules§
- gestalt
- Gestalt-Chain: fast exact Ratcliff–Obershelp via a suffix automaton.
Functions§
- b2j_
work - Estimated b2j inner-loop work
Σ_c count_a(c)·count_b(c)— exactly the positions the first-block scan visits; the dispatch signal, also exposed for benchmarking. (oa·obfolds the non-ASCII tail.) - cluster_
canonicals cluster_canonicals(canonicals, threshold)→[(member indices, min pairwise ratio)].- cluster_
canonicals_ chars - Exact single-linkage clustering over pre-collected
charvectors: returns each cluster (member indices, sorted) with its exact minimum pairwise ratio. O(n²) early-exit join, rayon-parallel. - cluster_
canonicals_ lsh cluster_canonicals_lsh(canonicals, threshold, num_perm, band_rows): the scalable path.- ratio
- Fast exact difflib ratio. Bit-identical to
difflib.SequenceMatcher(None, a, b, autojunk=False).ratio(). - ratio_
b2j - The difflib
b2jratio path directly (bypasses the length dispatch) — exposed for benchmarking and as a second, structurally-distinct exact implementation. Preferratiofor real use. - ratio_
reference - Reference (b2j) difflib ratio — structurally distinct from the suffix-automaton path; the test
suite asserts
gestalt_ratioequals this exactly. Preferratiofor real use (far faster).