Crate textdistance

Source
Expand description

§textdistance.rs

[ github.com ] [ docs.rs ] [ crates.io ]

Rust library with lots of different algorithms to compare how similar two sequences are.

Features:

  • 💪 Based on popular and battle-tested textdistance Python library (and written by the same author).
  • 📚 Contains 20+ algorithms for all purposes.
  • 🔬 Includes state-of-the-art algorithms like EntropyNCD and Sift4.
  • 🪶 Zero-dependency.
  • 🐜 #![no_std] support (embedded systems).
  • 🔨 Works with any iterators, including bytes, code points, Unicode grapheme clusters, words, and numbers.
  • ❤️ Friendly and consistent API for all algorithms.
  • 📏 Optional normalization of the result on the 0.0-1.0 interval.
  • 🛡 No unsafe code.
  • 🦀 Pure Rust.

§Available algorithms

Edit-based:

  1. DamerauLevenshtein, both optimal string alignment and restricted.
  2. Hamming
  3. Jaro
  4. JaroWinkler
  5. Levenshtein
  6. Sift4Common
  7. Sift4Simple
  8. SmithWaterman

Token-based:

  1. Bag
  2. Cosine (aka Orchini, Tucker, Otsuka–Ochiai)
  3. EntropyNCD (Entropy-based Normalized Compression Distance)
  4. Jaccard (aka Tanimoto, Critical Success Index)
  5. Overlap (aka Szymkiewicz–Simpson)
  6. Roberts
  7. SorensenDice (aka F1, Czekanowski, Zijdenbos)
  8. Tversky

Sequence-based:

  1. LCSSeq (Longest Common SubSequence)
  2. LCSStr (Longest Common SubString)
  3. RatcliffObershelp (aka Gestalt pattern matching)

Naive:

  1. Prefix
  2. Suffix
  3. Length

Normalization for other metrics:

  1. LIG3 normalization for Hamming by Levenshtein
  2. MLIPNS normalization for Hamming
  3. YujianBo normalization for Levenshtein

§Installation

cargo add textdistance

Or if you’re going to use it in a no_std project:

cargo add --no-default-features textdistance

§Usage

The textdistance::str module provides shortcut functions for each algorithm for calculating the distance/similarity between two strings:

use textdistance::str::damerau_levenshtein;
assert!(damerau_levenshtein("abc", "acbd") == 2);

The textdistance::nstr module is the same but all algorithms return a normalized value (between 0.0 and 1.0):

use textdistance::nstr::damerau_levenshtein;
assert!(damerau_levenshtein("abc", "acbd") == 2./4.);

For more advanced usage, each algorithm is provided as a struct implementing the Algorithm trait:

use textdistance::{Algorithm, DamerauLevenshtein};

let a = DamerauLevenshtein::default();
let r = a.for_str("abc", "acbd");
assert!(r.val() == 2);
assert!(r.nval() == 2./4.);
  1. The Algorithm trait provides for_str, for_vec, and for_iter to calculate the result for two strings, vectors (slices), or iterators respectively. In addition, there are for_words and for_bigrams methods that split the text into words or bigrams respectively before calculating the distance.
  2. Each method returns a textdistance::Result that provides methods to get absolute (val) or normalized (nval) value of the metric, distance (dist and ndist), or similarity (sim and nsim).

§Unicode support

The for_str method (and so all functions in the str and nstr modules) uses String.chars to split the string and then runs it through the for_iter method. So, é will be considered two distinct characters (“latin small letter e” and “combining acute accent”). Usually, that’s ok and this is how Python works. You can read more in the official Rust documentation.

If you want é to be considered as a single symbol, use the unicode-segmentation crate:

use textdistance::{Algorithm, DamerauLevenshtein};
use unicode_segmentation::UnicodeSegmentation;

let s1 = "a̐éö̲\r\n";
let s2 = "éa̐ö̲\r\n";
let g1 = s1.graphemes(true);
let g2 = s2.graphemes(true);
let a = DamerauLevenshtein::default();
let r = a.for_iter(g1, g2);
assert!(r.val() == 1);

§Choosing the algorithm

The algorithm to use depends on your use case. First, you need to decide on the algorithm category:

  1. Edit-based algorithms work well on short sequences for detecting typos and minor changes.
  2. Token-based algorithms work well on longer sequences for comparing long texts with noticeable modifications.
  3. Sequence-based algorithms work well for calculating diff size between the original and the changed version of the sequence.

If you go with edit-based, the next thing is to decide what kind of changes you need to detect:

  • ✏️ Substitution. One character is replaced by another.
  • ➕ Addition. A new character is added.
  • 🗑 Deletion. A character is removed.
  • 🔄 Transposition. Two sequential characters are swapped.
algsubadddeltrans
Hamming
Jaro
JaroWinkler
Sift4
Levenshtein
DamerauLevenshtein
  • Hamming is the fastest one but detects only substitutions.
  • Sift4 is very fast but not as well-known and battle-tested as other algorithms.
  • Jaro is slower than Sift4 but well-known and battle-tested.
  • JaroWinkler is like Jaro but gives more weight to strings with a matching prefix.
  • Levenshtein detects everything but transpositions and faster than DamerauLevenshtein (but slower than other algorithms).
  • DamerauLevenshtein ticks all the boxes but very slow.

There are some use cases:

  • DamerauLevenshtein with some optimizations is used in cargo to correct typos in command names.
  • Jaro is included in the Elixir standard library (String.jaro_distance). It is used by the compiler and by mix (cargo for Elixir) to provide the “did you mean?” functionality for typos in module or command names.
  • RatcliffObershelp variation is included in the Python standard library (difflib.SequenceMatcher).

§Benchmarks

Legend:

  • 🐌 is very slow (> 5 ms)
  • 🐢 is slow (> 1 ms)
  • 🐇 is fast (> 500 µs)
  • 🐎 is very fast (< 500 µs)

Edit-based (and their normalizations):

algorithmtime
hamming🐎 19.203 µs
mlipns🐎 20.625 µs
sift4_simple🐎 143.69 µs
sift4_common🐎 238.86 µs
jaro🐢 1.7148 ms
jaro_winkler🐢 1.7174 ms
levenshtein🐢 4.5999 ms
yujian_bo🐢 4.6044 ms
lig3🐌 6.5563 ms
smith_waterman🐌 9.5146 ms
damerau_levenshtein_restricted🐌 10.301 ms
damerau_levenshtein🐌 41.938 ms

Token-based:

algorithmtime
cosine🐇 508.59 µs
sorensen_dice🐇 510.75 µs
tversky🐇 512.41 µs
overlap🐇 513.76 µs
bag🐇 523.06 µs
jaccard🐇 580.79 µs
roberts🐇 714.79 µs
entropy_ncd🐇 731.68 µs

Sequence-based:

algorithmtime
lcsstr🐢 3.2658 ms
lcsseq🐌 7.4349 ms
ratcliff_obershelp🐌 36.308 ms

Naive:

algorithmtime
length🐎 2.5300 µs
prefix🐎 22.473 µs
suffix🐎 38.821 µs

The benchmarks are powered by criterion and live in the benches directory. They are quite simple: grab 10 open-source licenses, take a 200 chars prefix from each, and cross-compare these prefixes. The numbers might be very different for a different kind of input, length of the input, when comparing words rather than characters, or running the benchmarks on a different machine. The goal of these benchmarks is to provide basic guidance rather than give a definitive answer. If performance is critical for your application, I encourage you to make your benchmarks on the real data you have.

§Versioning

We stick to SemVer:

  1. The patch number is for bug fixes. The results of an algorithm may change in some corner cases if we found that the previous implementation doesn’t match the algorithm described in the original paper.
  2. The minor number is for new algorithms and features.
  3. The major number is for big changes in the API. We try to avoid breaking stuff but we prefer to provide a friendly and convenient API over keeping a backward compatibility.

§Limitations

  • In the original textdisance, most of the algorithms are adjusted to work on any number of the input sequences. However, Rust doesn’t support variadic arguments, so all algorithms currently are implemented only for exactly two inputs.
  • All algorithms in the crate implement the same Algorithm trait. Hence metrics that have additional limitations on the input sequence elements beyond Eq (like Editex and MRA that work only with ASCII letters) currently cannot be implemented.
  • Most of the implemented algorithms have certain properties (like commutative property) that make their behavior more like what you would expect and make normalization simple. So, I haven’t implemented yet Needleman-Wunsch and Gotoh, mostly because they are tricky to normalize and I’m still not 100% sure that I did it correctly in the original textdistance.

§Acknowledgments

There are the libraries that I used as a reference implementation and the source of test cases:

Specials thanks to Trevor Gross for transferring to me the ownership of the textdistance name on crates.io.

§Testing locally

To run everything locally, all you need is Rust, Python, and task. Execute task all to run all code formatters, linters, and tests.

Thank you ❤️

Modules§

nstr
Helper functions providing the default normalized implementation of distance/similarity algorithms for strings.
str
Helper functions providing the default implementation of distance/similarity algorithms for strings.

Structs§

Bag
Bag distance is how many max items there are in one sequence that aren’t in the other.
Cosine
Cosine similarity is the cosine of the angle between two vectors.
DamerauLevenshtein
Damerau-Levenshtein distance is an edit distance between two sequences.
EntropyNCD
Entropy-based Normalized Compression Distance.
Hamming
Hamming distance is the number of positions at which the corresponding symbols are different.
Jaccard
Jaccard similarity is a ratio of intersection to union of two sets.
Jaro
Jaro similarity is calculated based on the number of transpositions to turn one string into the other.
JaroWinkler
Jaro-Winkler similarity is a variation of Jaro with a better rating for strings with a matching prefix.
LCSSeq
The length of the Longest common subsequence.
LCSStr
The length of the Longest common substring.
LIG3
LIG3 similarity is a normalization of Hamming by Levenshtein.
Length
Length distance is the absolute difference between the lengths of the two sequences.
Levenshtein
Levenshtein distance is an edit distance between two sequences.
MLIPNS
MLIPNS similarity is a normalization for Hamming that returns either 0 or 1.
Overlap
Overlap similarity is the size of the intersection divided by the smaller of the size of the two sets.
Prefix
Prefix similarity is the length of the longest common prefix for the given sequences.
RatcliffObershelp
Ratcliff/Obershelp similarity is LCSStr that recursively finds matches on both sides of the longest substring.
Result
Result of a distance/similarity algorithm.
Roberts
Roberts similarity.
Sift4Common
Sift4 distance is an edit algorithm designed to be “fast and relatively accurate”.
Sift4Simple
Sift4 distance is an edit algorithm designed to be “fast and relatively accurate”.
SmithWaterman
Smith-Waterman similarity is edit-based and designed for nucleic acid (and protein) sequences.
SorensenDice
Sørensen–Dice similarity is a ratio of common chars to total chars in the given strings.
Suffix
Suffix similarity is the length of the longest common suffix of the given sequences.
Tversky
Tversky similarity is a generalization of SorensenDice and Jaccard.
YujianBo
Yujian-Bo distance is a normalization of Levenshtein.

Traits§

Algorithm
A base trait for all distance/similarity algorithms.