Expand description
Gestalt-Chain: fast exact Ratcliff–Obershelp via a suffix automaton.
difflib’s ratio is slow because find_longest_match is O(|a|·|b|) (it rescans every
occurrence of popular characters). The matched-block total M is the recursive
longest-common-substring decomposition; a suffix automaton finds the longest common
substring in O(|a|+|b|) regardless of character frequency, and recursing on the left/
right remainders reproduces difflib’s M with the same tie-break (longest; earliest in
a; earliest in b). So gestalt_ratio equals difflib.SequenceMatcher.ratio() exactly.
Hot-path engineering for the all-pairs join:
- transitions are stored CSR (per-state sorted slice) → O(log deg) lookup, O(len) memory, no hashing — and crucially no O(deg) scan at the high-degree root, which the scan hammers when strings are dissimilar;
- the b-side automaton is prebuilt once per string (
build_sam) and reused for every pair, so the all-pairs cost is n builds + n² scans, not n² builds.
Operates on char (code points), so it is bit-identical to difflib on non-ASCII.
Structs§
- Sam
- Suffix automaton with CSR (sorted-per-state) transitions — built once, queried by scans.
Functions§
- build_
sam - Build (and finalize) the suffix automaton of
b— prebuild once, reuse across pairs. - gestalt_
edge - Edge test that also yields the exact ratio:
Some(ratio)iffRO(a,b) >= threshold, elseNone. Keeps the reject early-exit (abort the instant the upper boundM + Σ min(window) < need) but computes full M on the qualifying branch so the cached ratio feeds the clustermin_sim— caching edge ratios here meansmin_simonly recomputes the rare non-edge (chained) intra-cluster pairs, not the whole dense blob. Bit-identical to(let r = ratio; (r >= threshold).then_some(r)). - gestalt_
qualifies - Threshold-aware exact decision: does
RO(a,b) ≥ threshold? Computes the matched total M with two-sided early-exit — accept the instantM ≥ need, reject the instant the upper boundM + Σ min(window lengths) < need. Exact (the bound never drops a qualifying pair), and dissimilar pairs abort long before the full decomposition.need = ⌈threshold·(|a|+|b|)/2⌉. - gestalt_
qualifies_ ms - The threshold early-exit recursion given precomputed matching statistics (
fstate/fmatchforaof lengthnavssam_b). Split out so the scan can be done in an MLP batch separately. - gestalt_
ratio gestalt_ratio(a, b) -> float: exact difflib ratio, computed via suffix-automaton LCS.- gestalt_
ratio_ capped - Cluster
min_simhelper: returns the exact ratio whenRO(a,b) <= cap, otherwise a value> cap(it accept-early-exits the instant M exceeds the cap’s bound). Used to find a cluster’s minimum pairwise ratio withcur = cur.min(gestalt_ratio_capped(a, b, sam, cur)): in a dense cluster the dominant high-ratio pairs blow pastcurand are pruned after the first block or two, so full M is computed only for the genuinely-low pairs. Bit-identical minimum to the full ratio. - gestalt_
ratio_ chars - gestalt_
ratio_ prebuilt - Ratio of
avs the string the prebuiltsam_bwas built from (=b). - matching_
stats_ cost - Cost probe: run only the matching-statistics scan of
avssam_b(the unavoidable per-pair floor — RO’s first block is an LCS, Θ(|a|)) and return a checksum so it isn’t optimized out. Measures the pure scan throughput separate from the RO recursion.