difflib-fast
The exact difflib similarity ratio — at suffix-automaton speed.
difflib.SequenceMatcher.ratio() tells you how similar two strings are — the way a human diff sees
it. It is the right metric, and it is slow. difflib-fast computes the exact same number,
byte-for-byte, with a suffix automaton — so it stays linear exactly where difflib falls apart.
use ratio;
// bit-for-bit identical to difflib.SequenceMatcher(None, a, b, autojunk=False).ratio()
assert_eq!;
= "0.1"
The idea
ratio = 2·M / (len(a) + len(b)), where M is the total size of the Ratcliff–Obershelp matching
blocks. The metric is exact and well-defined — including difflib's tie-break and its
argument-order asymmetry, both of which this crate reproduces bit-for-bit.
The catch is how you compute M. difflib does it by re-scanning every occurrence of each character;
on long, small-alphabet text — canonicalized source code, log lines, DNA — a handful of popular
characters turn that into a quadratic crawl. difflib-fast computes the longest common substring with
a suffix automaton in O(|a|+|b|) regardless of how often a character repeats. Same answer. No
crawl.
Two independent implementations of
Mlive in this crate — the suffix automaton, and a from-scratch port ofdifflib's own recursion. The test suite asserts they are bit-identical. That equality is the correctness guarantee.
What it does
Two things, both exact:
use ;
// 1. pairwise similarity — a drop-in for difflib's ratio
let r = ratio;
// 2. cluster a whole corpus by similarity (single-linkage, exact min pairwise ratio per cluster)
let corpus = vec!;
let clusters = cluster_canonicals; // → [([0, 1], 0.84…)]
The clustering path is the one built for scale: each string's automaton is prebuilt once and
reused across the whole n² join, dissimilar pairs early-exit the moment the threshold is decided,
and the work is spread across cores with rayon. A cluster's reported min_sim is its exact minimum
pairwise ratio.
API
| function | what |
|---|---|
ratio(a, b) -> f64 |
exact difflib ratio (dispatches automaton ⇄ b2j per input) |
ratio_many(&[(String, String)]) -> Vec<f64> |
exact ratio for a batch of pairs, in parallel (rayon) |
gestalt::gestalt_ratio(a, b) -> f64 |
the suffix-automaton path directly |
ratio_reference(a, b) -> f64 |
the b2j reference port (slow; the test oracle) |
cluster_canonicals(&[String], threshold) -> Vec<(Vec<usize>, f64)> |
exact single-linkage clusters + min pairwise ratio |
cluster_canonicals_chars(&[Vec<char>], threshold) |
same, over pre-collected char vectors |
cluster_canonicals_lsh(&[String], threshold, num_perm, band_rows) |
scalable MinHash-LSH variant (candidate-gen + exact verify) for very large corpora |
Correctness
This crate's entire reason to exist is being exactly difflib — so correctness is enforced, not
hoped for:
- two implementations, one answer — the suffix-automaton path and the
b2jreference port are asserted bit-identical on thousands of fuzzed pairs (fast_matches_reference); - 18k-assertion threshold gate — every early-exit decision matches the full ratio
(
qualifies_matches_ratio_threshold); difflibreference values, including non-ASCII.
Benchmarks
Exact byte-for-byte RO on real canonicalized Python (top-level function bodies, ast.dump-shape),
Apple M3 Pro (6 P + 6 E cores). pairs/s = pairwise ratio decisions per second. Full methodology,
per-repo tables, and the C++ / Python harnesses are in benchmarks.md.
Clustering throughput — the production path (prebuilt automaton + threshold early-exit + rayon),
cargo run --release --features bench --bin bench:
| repo | raw ratio, 1 thread | threshold @0.5, 1 thread | threshold @0.5, 12 threads |
|---|---|---|---|
| django | 24.3k pairs/s | 117k pairs/s | 936k pairs/s |
| sympy | 23.6k | 97.9k | 828k |
| ha | 21.5k | 69.6k | 468k |
| mypy | 14.1k | 60.6k | 271k |
| transformers | 3.5k¹ | 47.3k | 362k |
¹ transformers' model code has unusually long functions (per-pair RO is O(L·log L)), so raw
throughput is lower — the same reason difflib struggles there.
vs other exact-RO implementations — single thread, same metric, across the five repos above:
| competitor | difflib-fast speedup |
|---|---|
Python stdlib difflib (pure Python — the original) |
245–1070× |
C++ duckie/difflib (well-optimized b2j) |
1.4–3.4× |
CyDifflib (Cython difflib) |
~3–10× |
difflib (Rust difflib crate) |
11–50× |
gestalt_ratio (Rust Ratcliff–Obershelp crate) |
18–96× |
Against the original Python difflib it's a different universe: pure-Python SequenceMatcher
manages 15–77 exact-RO ratios/s on this code, so difflib-fast is 245–1070× faster single-threaded —
and ~1600–6600× on all 12 cores. CyDifflib has no in-process parallelism (GIL-bound, no nogil /
batch API), so on all 12 cores — difflib-fast (rayon) vs CyDifflib (multiprocessing) on the same
qualifying-pairs task — the gap is 60–242×. CyDifflib's default autojunk=True is faster but differs from the exact ratio on ~100%
of these pairs (mean |Δ| ≈ 0.22): a different metric, not a drop-in. The libraries that beat
difflib-fast on raw speed (RapidFuzz, strsim) likewise compute a different metric (Indel /
Levenshtein), not difflib's.
Python package
Swap difflib.SequenceMatcher(...).ratio() for difflib_fast.ratio(...) and the same number comes
back — hundreds of times faster per call, and thousands of times faster when you score a whole
corpus. That's the entire change: import difflib_fast.
# from source (needs a Rust toolchain; pip drives maturin automatically — no manual build):
# or grab a prebuilt abi3 wheel (CPython 3.9+) from the GitHub Releases page — no Rust needed.
# one pair → one float, byte-for-byte difflib (autojunk=False)
# 0.8947368421052632
# cluster a corpus (single-linkage, exact min pairwise ratio per cluster)
# → [([0, 1], 0.86…)]
All cores, no contention
ratio is overloaded: hand it a list of pairs and it returns a list of ratios, computing them
in parallel across every core inside Rust — with the GIL released. You don't touch a
ThreadPoolExecutor, you don't fight the GIL; you just pass the batch:
=
# list[float], one per pair — fanned out over all cores
This matters because Python can't parallelize the stdlib version: difflib in a thread pool stays
GIL-bound and doesn't speed up at all. The batch form sidesteps that entirely — the parallelism lives
in Rust, not in Python threads. Measured on real canonicalized Python (mypy, M3 Pro, 12 threads,
benchmarks/bench_python.py):
| from Python | throughput | vs stdlib difflib |
|---|---|---|
ratio(a, b) — one call |
2.4k pairs/s | 104× |
ratio(pairs) — batch, all cores |
15k pairs/s | 628× |
cluster_canonicals(corpus) — the real workload |
199k eff. pairs/s | 8,541× |
difflib in a ThreadPoolExecutor |
23 pairs/s | 1.0× — threads don't help |
Clustering wins biggest because each string's automaton is built once and reused across the whole
n² join (and dissimilar pairs early-exit) — so it's not 12× the per-call speed, it's a different
algorithm. The number you'd actually feel: a corpus whose all-pairs comparison would take stdlib
difflib ~30 minutes clusters in ~0.2 s.
The package is typed (py.typed + .pyi stubs — pyright/mypy see the overloads), gated behind the
python cargo feature so the pure-Rust crate keeps zero Python dependency by default. Built with
maturin (mixed layout: compiled _difflib_fast +
python/difflib_fast/ package); build locally into a venv with maturin develop --release --features python.
How it works
The metric is Ratcliff–Obershelp (Ratcliff & Obershelp, 1988) computed over a suffix automaton
(Blumer et al., 1985) via matching statistics (Chang & Lawler, 1994); short, diverse inputs take a
lighter b2j path instead, chosen per-input by a cheap work estimate. The composition — exact
byte-for-byte RO this way, tuned for an all-pairs clustering join — is original to this crate.
src/gestalt.rs is the engine: its module doc and inline comments cover the
automaton, the endpos range structure that lets one prebuilt automaton serve the whole RO recursion,
the threshold-engine early-exit math, and the measured performance floor.
Exact difflib. None of the wait.
Made with ⚡ by @prostomarkeloff