1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
"""Type stubs for difflib-fast — exact difflib Ratcliff-Obershelp similarity + clustering."""
"""Exact ``difflib.SequenceMatcher(None, a, b, autojunk=False).ratio()`` — byte-for-byte.
The Ratcliff-Obershelp similarity ``2*M / (len(a) + len(b))``, identical to Python's ``difflib``
(including its argument-order asymmetry), computed via a suffix automaton so it stays linear on
long, repetitive inputs where ``difflib`` degrades.
"""
"""Exact :func:`ratio` for many ``(a, b)`` pairs at once, computed across all cores inside Rust.
``ratio(pairs)[i] == ratio(*pairs[i])``, bit-for-bit. The parallel fan-out happens in Rust with the
GIL released — the contention-free way to score a batch from Python: no ``ThreadPoolExecutor``, no
per-call overhead. Returns one float per input pair, in order.
"""
"""Exact single-linkage clustering of ``canonicals`` by RO similarity.
Two strings join a cluster when their exact ratio is ``>= threshold``. Returns one
``(member_indices, min_pairwise_ratio)`` per cluster of >= 2 members; ``member_indices`` index
into ``canonicals`` (sorted), ``min_pairwise_ratio`` is the cluster's exact minimum pairwise ratio.
"""
"""Scalable MinHash-LSH variant of :func:`cluster_canonicals` for very large corpora.
Generates candidate pairs via MinHash-LSH (``num_perm`` permutations, ``band_rows`` rows per band),
then verifies each candidate with the exact ratio. Clusters match the exact path modulo LSH recall
(tuned via ``band_rows``); use :func:`cluster_canonicals` when exact recall is required.
"""
: