Skip to main content

Module gestalt

Module gestalt 

Source
Expand description

Gestalt-Chain: fast exact Ratcliff–Obershelp via a suffix automaton.

difflib’s ratio is slow because find_longest_match is O(|a|·|b|) (it rescans every occurrence of popular characters). The matched-block total M is the recursive longest-common-substring decomposition; a suffix automaton finds the longest common substring in O(|a|+|b|) regardless of character frequency, and recursing on the left/ right remainders reproduces difflib’s M with the same tie-break (longest; earliest in a; earliest in b). So gestalt_ratio equals difflib.SequenceMatcher.ratio() exactly.

Hot-path engineering for the all-pairs join:

  • transitions are stored CSR (per-state sorted slice) → O(log deg) lookup, O(len) memory, no hashing — and crucially no O(deg) scan at the high-degree root, which the scan hammers when strings are dissimilar;
  • the b-side automaton is prebuilt once per string (build_sam) and reused for every pair, so the all-pairs cost is n builds + n² scans, not n² builds.
  • prefetch hints attempted on the per-iteration node[state] load — the SAM walk is a data-dependent pointer chase the hardware prefetcher cannot anticipate. A prfm pldl1keep (AArch64) / _mm_prefetch (x86) experiment was MEASURED on M3 Pro and did NOT pay off (-10% to -2% across mypy/sympy/django) — the SAMs fit in L2 already, and the prefetch instruction burns execution slots without producing a hit-rate improvement. See the tombstone near matching_stats_into for details.

Operates on char (code points), so it is bit-identical to difflib on non-ASCII.

Structs§

Sam
Suffix automaton with CSR (sorted-per-state) transitions — built once, queried by scans.

Functions§

build_sam
Build (and finalize) the suffix automaton of b — prebuild once, reuse across pairs.
delta_to_chain_cap
Map a user-facing delta (max acceptable RO ratio loss, in absolute units) to a chain-walk depth cap for longest_in. Empirically calibrated against gestalt::instrument’s chain depth histogram on the mypy/sympy corpora:
gestalt_edge
Edge test that also yields the exact ratio: Some(ratio) iff RO(a,b) >= threshold, else None. Keeps the reject early-exit (abort the instant the upper bound M + Σ min(window) < need) but computes full M on the qualifying branch so the cached ratio feeds the cluster min_sim — caching edge ratios here means min_sim only recomputes the rare non-edge (chained) intra-cluster pairs, not the whole dense blob. Bit-identical to (let r = ratio; (r >= threshold).then_some(r)).
gestalt_edge_with_ms
Stage-4b helper: same recursion + early-exit as gestalt_edge, but operates on a caller-provided (fstate, fmatch) instead of running matching_stats_into inline.
gestalt_edge_with_ms_delta
Approximate-RO variant: same as gestalt_edge_with_ms but caps the suffix-link chain walk inside longest_in to roughly 1/√delta ascents. delta = 0.0 means exact (no cap, default). delta ∈ (0, 1] caps the depth; the returned ratio’s worst-case absolute deviation from the exact RO is bounded by ~delta (empirically verified on canonical-Python corpora by tests/approx_ro.rs). The chain depth distribution on real workloads is heavy-headed (p99 ≈ 7), so even small delta values rarely actually fire the cap.
gestalt_qualifies
Threshold-aware exact decision: does RO(a,b) ≥ threshold? Computes the matched total M with two-sided early-exit — accept the instant M ≥ need, reject the instant the upper bound M + Σ min(window lengths) < need. Exact (the bound never drops a qualifying pair), and dissimilar pairs abort long before the full decomposition. need = ⌈threshold·(|a|+|b|)/2⌉.
gestalt_qualifies_ms
The threshold early-exit recursion given precomputed matching statistics (fstate/fmatch for a of length na vs sam_b). Split out so the scan can be done in an MLP batch separately.
gestalt_ratio
gestalt_ratio(a, b) -> float: exact difflib ratio, computed via suffix-automaton LCS.
gestalt_ratio_capped
Cluster min_sim helper: returns the exact ratio when RO(a,b) <= cap, otherwise a value > cap (it accept-early-exits the instant M exceeds the cap’s bound). Used to find a cluster’s minimum pairwise ratio with cur = cur.min(gestalt_ratio_capped(a, b, sam, cur)): in a dense cluster the dominant high-ratio pairs blow past cur and are pruned after the first block or two, so full M is computed only for the genuinely-low pairs. Bit-identical minimum to the full ratio.
gestalt_ratio_chars
gestalt_ratio_prebuilt
Ratio of a vs the string the prebuilt sam_b was built from (= b).
matching_stats_cost
Cost probe: run only the matching-statistics scan of a vs sam_b (the unavoidable per-pair floor — RO’s first block is an LCS, Θ(|a|)) and return a checksum so it isn’t optimized out. Measures the pure scan throughput separate from the RO recursion.