Skip to main content

Module gestalt

Module gestalt 

Source
Expand description

Gestalt-Chain: fast exact Ratcliff–Obershelp via a suffix automaton.

difflib’s ratio is slow because find_longest_match is O(|a|·|b|) (it rescans every occurrence of popular characters). The matched-block total M is the recursive longest-common-substring decomposition; a suffix automaton finds the longest common substring in O(|a|+|b|) regardless of character frequency, and recursing on the left/ right remainders reproduces difflib’s M with the same tie-break (longest; earliest in a; earliest in b). So gestalt_ratio equals difflib.SequenceMatcher.ratio() exactly.

Hot-path engineering for the all-pairs join:

  • transitions are stored CSR (per-state sorted slice) → O(log deg) lookup, O(len) memory, no hashing — and crucially no O(deg) scan at the high-degree root, which the scan hammers when strings are dissimilar;
  • the b-side automaton is prebuilt once per string (build_sam) and reused for every pair, so the all-pairs cost is n builds + n² scans, not n² builds.

Operates on char (code points), so it is bit-identical to difflib on non-ASCII.

Structs§

Sam
Suffix automaton with CSR (sorted-per-state) transitions — built once, queried by scans.

Functions§

build_sam
Build (and finalize) the suffix automaton of b — prebuild once, reuse across pairs.
gestalt_edge
Edge test that also yields the exact ratio: Some(ratio) iff RO(a,b) >= threshold, else None. Keeps the reject early-exit (abort the instant the upper bound M + Σ min(window) < need) but computes full M on the qualifying branch so the cached ratio feeds the cluster min_sim — caching edge ratios here means min_sim only recomputes the rare non-edge (chained) intra-cluster pairs, not the whole dense blob. Bit-identical to (let r = ratio; (r >= threshold).then_some(r)).
gestalt_qualifies
Threshold-aware exact decision: does RO(a,b) ≥ threshold? Computes the matched total M with two-sided early-exit — accept the instant M ≥ need, reject the instant the upper bound M + Σ min(window lengths) < need. Exact (the bound never drops a qualifying pair), and dissimilar pairs abort long before the full decomposition. need = ⌈threshold·(|a|+|b|)/2⌉.
gestalt_qualifies_ms
The threshold early-exit recursion given precomputed matching statistics (fstate/fmatch for a of length na vs sam_b). Split out so the scan can be done in an MLP batch separately.
gestalt_ratio
gestalt_ratio(a, b) -> float: exact difflib ratio, computed via suffix-automaton LCS.
gestalt_ratio_capped
Cluster min_sim helper: returns the exact ratio when RO(a,b) <= cap, otherwise a value > cap (it accept-early-exits the instant M exceeds the cap’s bound). Used to find a cluster’s minimum pairwise ratio with cur = cur.min(gestalt_ratio_capped(a, b, sam, cur)): in a dense cluster the dominant high-ratio pairs blow past cur and are pruned after the first block or two, so full M is computed only for the genuinely-low pairs. Bit-identical minimum to the full ratio.
gestalt_ratio_chars
gestalt_ratio_prebuilt
Ratio of a vs the string the prebuilt sam_b was built from (= b).
matching_stats_cost
Cost probe: run only the matching-statistics scan of a vs sam_b (the unavoidable per-pair floor — RO’s first block is an LCS, Θ(|a|)) and return a checksum so it isn’t optimized out. Measures the pure scan throughput separate from the RO recursion.