difflib-fast

The exact difflib similarity ratio — at suffix-automaton speed.

difflib.SequenceMatcher.ratio() tells you how similar two strings are — the way a human diff sees it. It is the right metric, and it is slow. difflib-fast computes the exact same number, byte-for-byte, with a suffix automaton — so it stays linear exactly where difflib falls apart.

use difflib_fast::ratio;

// bit-for-bit identical to difflib.SequenceMatcher(None, a, b, autojunk=False).ratio()
assert_eq!(ratio("the quick brown fox", "the quick brown dog"), 0.8947368421052632);

difflib-fast = "0.1"

The idea

ratio = 2·M / (len(a) + len(b)), where M is the total size of the Ratcliff–Obershelp matching blocks. The metric is exact and well-defined — including difflib's tie-break and its argument-order asymmetry, both of which this crate reproduces bit-for-bit.

The catch is how you compute M. difflib does it by re-scanning every occurrence of each character; on long, small-alphabet text — canonicalized source code, log lines, DNA — a handful of popular characters turn that into a quadratic crawl. difflib-fast computes the longest common substring with a suffix automaton in O(|a|+|b|) regardless of how often a character repeats. Same answer. No crawl.

Two independent implementations of M live in this crate — the suffix automaton, and a from-scratch port of difflib's own recursion. The test suite asserts they are bit-identical. That equality is the correctness guarantee.

What it does

Two things, both exact:

use difflib_fast::{ratio, cluster_canonicals};

// 1. pairwise similarity — a drop-in for difflib's ratio
let r = ratio("def add(a, b): return a + b", "def add(x, y): return x + y");

// 2. cluster a whole corpus by similarity (single-linkage, exact min pairwise ratio per cluster)
let corpus = vec![
    "def add(a, b): return a + b".to_string(),
    "def add(x, y): return x + y".to_string(),
    "totally unrelated".to_string(),
];
let clusters = cluster_canonicals(&corpus, 0.5);  // → [([0, 1], 0.84…)]

The clustering path is the one built for scale: each string's automaton is prebuilt once and reused across the whole n² join, dissimilar pairs early-exit the moment the threshold is decided, and the work is spread across cores with rayon. A cluster's reported min_sim is its exact minimum pairwise ratio.

API

function	what
`ratio(a, b) -> f64`	exact `difflib` ratio (dispatches automaton ⇄ `b2j` per input)
`ratio_many(&[(String, String)]) -> Vec<f64>`	exact `ratio` for a batch of pairs, in parallel (rayon)
`gestalt::gestalt_ratio(a, b) -> f64`	the suffix-automaton path directly
`ratio_reference(a, b) -> f64`	the `b2j` reference port (slow; the test oracle)
`cluster_canonicals(&[String], threshold) -> Vec<(Vec<usize>, f64)>`	exact single-linkage clusters + min pairwise ratio
`cluster_canonicals_chars(&[Vec<char>], threshold)`	same, over pre-collected `char` vectors
`cluster_canonicals_lsh(&[String], threshold, num_perm, band_rows)`	scalable MinHash-LSH variant (candidate-gen + exact verify) for very large corpora

Correctness

This crate's entire reason to exist is being exactly difflib — so correctness is enforced, not hoped for:

two implementations, one answer — the suffix-automaton path and the b2j reference port are asserted bit-identical on thousands of fuzzed pairs (fast_matches_reference);
18k-assertion threshold gate — every early-exit decision matches the full ratio (qualifies_matches_ratio_threshold);
difflib reference values, including non-ASCII.

cargo test

Benchmarks

Exact byte-for-byte RO on real canonicalized Python (top-level function bodies, ast.dump-shape), Apple M3 Pro (6 P + 6 E cores). pairs/s = pairwise ratio decisions per second. Full methodology, per-repo tables, and the C++ / Python harnesses are in benchmarks.md.

Clustering throughput — the production path (prebuilt automaton + threshold early-exit + rayon), cargo run --release --features bench --bin bench:

repo	raw ratio, 1 thread	threshold @0.5, 1 thread	threshold @0.5, 12 threads
django	24.3k pairs/s	117k pairs/s	936k pairs/s
sympy	23.6k	97.9k	828k
ha	21.5k	69.6k	468k
mypy	14.1k	60.6k	271k
transformers	3.5k¹	47.3k	362k

¹ transformers' model code has unusually long functions (per-pair RO is O(L·log L)), so raw throughput is lower — the same reason difflib struggles there.

vs other exact-RO implementations — single thread, same metric, across the five repos above:

competitor	difflib-fast speedup
Python stdlib `difflib` (pure Python — the original)	245–1070×
C++ `duckie/difflib` (well-optimized b2j)	1.4–3.4×
CyDifflib (Cython `difflib`)	~3–10×
`difflib` (Rust `difflib` crate)	11–50×
`gestalt_ratio` (Rust Ratcliff–Obershelp crate)	18–96×

Against the original Python difflib it's a different universe: pure-Python SequenceMatcher manages 15–77 exact-RO ratios/s on this code, so difflib-fast is 245–1070× faster single-threaded — and ~1600–6600× on all 12 cores. CyDifflib has no in-process parallelism (GIL-bound, no nogil / batch API), so on all 12 cores — difflib-fast (rayon) vs CyDifflib (multiprocessing) on the same qualifying-pairs task — the gap is 60–242×. CyDifflib's default autojunk=True is faster but differs from the exact ratio on ~100% of these pairs (mean |Δ| ≈ 0.22): a different metric, not a drop-in. The libraries that beat difflib-fast on raw speed (RapidFuzz, strsim) likewise compute a different metric (Indel / Levenshtein), not difflib's.

Python package

Swap difflib.SequenceMatcher(...).ratio() for difflib_fast.ratio(...) and the same number comes back — hundreds of times faster per call, and thousands of times faster when you score a whole corpus. That's the entire change: import difflib_fast.

# from source (needs a Rust toolchain; pip drives maturin automatically — no manual build):
pip install git+https://github.com/prostomarkeloff/difflib-fast

# or grab a prebuilt abi3 wheel (CPython 3.9+) from the GitHub Releases page — no Rust needed.

import difflib_fast

# one pair → one float, byte-for-byte difflib (autojunk=False)
difflib_fast.ratio("the quick brown fox", "the quick brown dog")   # 0.8947368421052632

# cluster a corpus (single-linkage, exact min pairwise ratio per cluster)
difflib_fast.cluster_canonicals(["def f(a): ...", "def f(x): ...", "other"], 0.5)
# → [([0, 1], 0.86…)]

All cores, no contention

ratio is overloaded: hand it a list of pairs and it returns a list of ratios, computing them in parallel across every core inside Rust — with the GIL released. You don't touch a ThreadPoolExecutor, you don't fight the GIL; you just pass the batch:

pairs = [(a, b) for a in corpus for b in corpus]
difflib_fast.ratio(pairs)            # list[float], one per pair — fanned out over all cores

This matters because Python can't parallelize the stdlib version: difflib in a thread pool stays GIL-bound and doesn't speed up at all. The batch form sidesteps that entirely — the parallelism lives in Rust, not in Python threads. Measured on real canonicalized Python (mypy, M3 Pro, 12 threads, benchmarks/bench_python.py):

from Python	throughput	vs stdlib `difflib`
`ratio(a, b)` — one call	2.4k pairs/s	104×
`ratio(pairs)` — batch, all cores	15k pairs/s	628×
`cluster_canonicals(corpus)` — the real workload	199k eff. pairs/s	8,541×
`difflib` in a `ThreadPoolExecutor`	23 pairs/s	1.0× — threads don't help

Clustering wins biggest because each string's automaton is built once and reused across the whole n² join (and dissimilar pairs early-exit) — so it's not 12× the per-call speed, it's a different algorithm. The number you'd actually feel: a corpus whose all-pairs comparison would take stdlib difflib ~30 minutes clusters in ~0.2 s.

The package is typed (py.typed + .pyi stubs — pyright/mypy see the overloads), gated behind the python cargo feature so the pure-Rust crate keeps zero Python dependency by default. Built with maturin (mixed layout: compiled _difflib_fast + python/difflib_fast/ package); build locally into a venv with maturin develop --release --features python.

How it works

The metric is Ratcliff–Obershelp (Ratcliff & Obershelp, 1988) computed over a suffix automaton (Blumer et al., 1985) via matching statistics (Chang & Lawler, 1994); short, diverse inputs take a lighter b2j path instead, chosen per-input by a cheap work estimate. The composition — exact byte-for-byte RO this way, tuned for an all-pairs clustering join — is original to this crate.

src/gestalt.rs is the engine: its module doc and inline comments cover the automaton, the endpos range structure that lets one prebuilt automaton serve the whole RO recursion, the threshold-engine early-exit math, and the measured performance floor.

Exact difflib. None of the wait.

Made with ⚡ by @prostomarkeloff

difflib-fast 0.1.1