Expand description
difflib-fast — fast, byte-for-byte exact difflib Ratcliff–Obershelp (“gestalt”) similarity.
ratio / gestalt::gestalt_ratio are a drop-in for
difflib.SequenceMatcher(None, a, b, autojunk=False).ratio():
ratio = 2·M / (len(a)+len(b)), where M is the total size of the Ratcliff–Obershelp matching
blocks. The result — including difflib’s tie-break (longest; earliest-a; earliest-b) and its
argument-order asymmetry — is reproduced exactly, but M is computed via a suffix automaton
(LCS in O(|a|+|b|) regardless of character frequency) instead of difflib’s popular-character
b2j rescans. On long, small-alphabet text (e.g. canonicalized source code) this is the
difference between difflib’s pathological case and a linear scan.
Beyond the per-pair ratio, cluster_canonicals does an exact single-linkage clustering of a
corpus at a similarity threshold — prebuild each string’s automaton once, then an early-exit
all-pairs join (length blocking + quick_ratio filter + threshold-aware RO), in parallel via
rayon — and reports each cluster with its exact minimum pairwise ratio. cluster_canonicals_lsh
is the scalable MinHash-LSH variant (candidate generation + exact verification) for very large
corpora past the O(n²) wall.
Two independent implementations of M back this: the suffix-automaton path (gestalt) and a
straight port of difflib’s b2j recursion (a test-only reference oracle); the test suite asserts
they are bit-identical, which is the crate’s core correctness gate.
Re-exports§
pub use gestalt::gestalt_ratio;pub use rationer::Concurrency;pub use rationer::PreparedRationer;pub use rationer::Rationer;pub use rationer::RationerBuilder;
Modules§
- gestalt
- Gestalt-Chain: fast exact Ratcliff–Obershelp via a suffix automaton.
- rationer
Rationer— stateful similarity / clustering handle with explicit backend control.- simjoin
simjoin— exact all-pairs weighted-cosine similarity join: given a corpus of sparse non-negative vectors, find every pair(i, j)whose cosine similarity is≥ t.
Functions§
- cluster_
canonicals cluster_canonicals(canonicals, threshold)→[(member indices, min pairwise ratio)].- cluster_
canonicals_ lsh cluster_canonicals_lsh(canonicals, threshold, num_perm, band_rows): the scalable path.- ratio
- Fast exact difflib ratio. Bit-identical to
difflib.SequenceMatcher(None, a, b, autojunk=False).ratio(). - ratio_
many - Exact difflib
ratiofor many(a, b)pairs at once, computed in parallel across all cores (rayon).ratio_many(pairs)[i]equalsratio(&pairs[i].0, &pairs[i].1), bit-for-bit.