Skip to main content

Crate difflib_fast

Crate difflib_fast 

Source
Expand description

difflib-fast — fast, byte-for-byte exact difflib Ratcliff–Obershelp (“gestalt”) similarity.

ratio / gestalt::gestalt_ratio are a drop-in for difflib.SequenceMatcher(None, a, b, autojunk=False).ratio(): ratio = 2·M / (len(a)+len(b)), where M is the total size of the Ratcliff–Obershelp matching blocks. The result — including difflib’s tie-break (longest; earliest-a; earliest-b) and its argument-order asymmetry — is reproduced exactly, but M is computed via a suffix automaton (LCS in O(|a|+|b|) regardless of character frequency) instead of difflib’s popular-character b2j rescans. On long, small-alphabet text (e.g. canonicalized source code) this is the difference between difflib’s pathological case and a linear scan.

Beyond the per-pair ratio, cluster_canonicals does an exact single-linkage clustering of a corpus at a similarity threshold — prebuild each string’s automaton once, then an early-exit all-pairs join (length blocking + quick_ratio filter + threshold-aware RO), in parallel via rayon — and reports each cluster with its exact minimum pairwise ratio. cluster_canonicals_lsh is the scalable MinHash-LSH variant (candidate generation + exact verification) for very large corpora past the O(n²) wall.

Two independent implementations of M back this: the suffix-automaton path (gestalt) and a straight port of difflib’s b2j recursion (a test-only reference oracle); the test suite asserts they are bit-identical, which is the crate’s core correctness gate.

Re-exports§

pub use gestalt::gestalt_ratio;
pub use rationer::Concurrency;
pub use rationer::PreparedRationer;
pub use rationer::Rationer;
pub use rationer::RationerBuilder;

Modules§

gestalt
Gestalt-Chain: fast exact Ratcliff–Obershelp via a suffix automaton.
rationer
Rationer — stateful similarity / clustering handle with explicit backend control.

Functions§

cluster_canonicals
cluster_canonicals(canonicals, threshold)[(member indices, min pairwise ratio)].
cluster_canonicals_lsh
cluster_canonicals_lsh(canonicals, threshold, num_perm, band_rows): the scalable path.
ratio
Fast exact difflib ratio. Bit-identical to difflib.SequenceMatcher(None, a, b, autojunk=False).ratio().
ratio_many
Exact difflib ratio for many (a, b) pairs at once, computed in parallel across all cores (rayon). ratio_many(pairs)[i] equals ratio(&pairs[i].0, &pairs[i].1), bit-for-bit.