Expand description
§Stringmetrics library
Stringmetrics
is a library for applying text- and token- based comparison
algorithms to determine the similarity of two strings or sets. It currently
includes a variety of implementations of Levenshtein
distance, Hamming
distance, and Jaccard
Similarity, with more string
metrics expected to be added in the future. It also includes helpful
tokenizers for things like splitting sentences into words.
§Example
use stringmetrics::levenshtein;
assert_eq!(levenshtein("kitten", "sitting"), 3);
§Algorithm Descriptions
This section seeks to give an overview of the different algorithms contained in this module. See individual functions for usage guidelines.
§Hamming Distance Algorithm
The hamming distance between two equal length strings is the number of
substitutions required to change one string into the other. Computation is
very simple; all that is required is to iterate through and track
differences. Hamming distance is implemented by hamming
and
hamming_iter
§Levenshtein distance Algorithm
The funcition levenshtein implements the following algorithm. Basically, the tool parses from top left to bottom right to create a table like follows, for the classic example:
j → 0 1 2 3 4 5 6 7
i str B →
↓ s i t t i n g
0 s [0, 1, 2, 3, 4, 5, 6, 7]
1 t k [1, 1, 2, 3, 4, 5, 6, 7]
2 r i [2, 2, 1, 2, 3, 4, 5, 6]
3 t [3, 3, 2, 1, 2, 3, 4, 5]
4 A t [4, 4, 3, 2, 1, 2, 3, 4]
5 ↓ e [5, 5, 4, 3, 2, 2, 3, 4]
6 n [6, 6, 5, 4, 3, 3, 2, 3]
The outer rows (at i=0 and j=0) are just incrementing and can be filled in automatically. Then, the algorithm works its way left-to-right then top-to-bottom to fill in the table. Rules are:
- Insertion cost is the value of the field to the left plus one
- Deletion cost is the value of the field above plus one
- Substitution cost is the value of the field left above if the two relevant characters are the same. If they are different, add one to this value.
- Take the minimum of these three values and put it in the current box.
Eventually, the above matrix can be filled out and the final “distance” is the number at the bottom right; in this case, 3.
This library uses an algorithm published by Sten
Hjelmqvist
and described on the Levenshtein distance Wikipedia
page
but adapted to use a single vector. Main memory usage is only that of a
Vec<u32>
in the same length as the shortest string.
Please note: this library eventually aims to replace the current algorithm with one that is more performant across varying lengths of strings. The interface will not change.
§Limited Levenshtein algorithm
This uses the same algorithm as above, but stops matching at a desired limit
to avoid spending resources on obviously different strings. Use this version
where possible, implemented by levenshtein_limit
.
use stringmetrics::levenshtein_limit;
assert_eq!(levenshtein_limit("superlongstring", "", 3), 3);
§Weighted Levenshtein algorithm
Implemented by levenshtein_weight
and levenshtein_weight_iter
, a
weighted Levenshtein algorithm just allows applying differing penalties for
insertion, deletion, and substitution. It’s basically the same algorithm as
above, except the added values are specified (rather than always one), and
the initial row population is related to insertion and deletion weights.
For example, the following code uses insertion, deletion, and substitution weights of 4, 3, and 2, respectively. There is also a limit of 100:
use stringmetrics::{levenshtein_weight, LevWeights};
let weights = LevWeights::new(4, 3, 2);
assert_eq!(levenshtein_weight("kitten", "sitting", 100, &weights), 8);
Creates the following matrix:
j → 0 1 2 3 4 5 6 7
i str B →
↓ s i t t i n g
0 s [ 0, 4, 8, 12, 16, 20, 24, 28]
1 t k [ 3, 2, 6, 10, 14, 18, 22, 26]
2 r i [ 6, 5, 2, 6, 10, 14, 18, 22]
3 t [ 9, 8, 5, 2, 6, 10, 14, 18]
4 A t [12, 11, 8, 5, 2, 6, 10, 14]
5 ↓ e [15, 14, 11, 8, 5, 4, 8, 12]
6 n [18, 17, 14, 11, 8, 7, 4, 8]
Rules are similar to above:
- Insertion cost is the value of the field to the left plus insertion weight
- Deletion cost is the value of the field above plus deletion weight
- Substitution cost is the value of the field left above if the two relevant characters are the same. If they are different, add substitution weight to this value.
- Take the minimum of these three values and put it in the current box.
The result of 8 is representative of one added letter (+4) and two substitutions (+2*2). The substitution could alternatively be counted as an insertion followed by a deletion but the algorithm “chooses” against it since the cost would be much higher (4+3=7 when the substitution cost is only 2).)
§Note on string comparisons
All string-based levenshtein algorithms use bytes rather than characters by
default. This speeds things up significantly, and usually the difference is
unimportant. However, if you are working with CJK character sets or emojis,
you may prefer the somewhat more accurate (but slower) chars()
usage. This
example presents the case well:
use stringmetrics::{levenshtein_limit, levenshtein_limit_iter};
// Default implementation; these are not the "expected" values
// In many cases, this may be acceptable
assert_eq!(levenshtein_limit("鱼", "雪", 100), 2);
assert_eq!(levenshtein_limit("😙", "🔬", 100), 2);
// "Expected" values by using iterator functions
assert_eq!(levenshtein_limit_iter("鱼".chars(), "雪".chars(), 100), 1);
assert_eq!(levenshtein_limit_iter("😙".chars(), "🔬".chars(), 100), 1);
If accurate matching on further extended unicode is required, the unicode
segmentation
crate
can be used to split on the iterable graphemes(true)
.
§Jaccard Similarity
Jaccard similarity or the Jaccard Index of two sets is the number of items
found in both sets, divided by the number of unique items in the two sets.
This is often used for things like n-gram string similarity. Relevant
functions are jaccard
and jaccard_set
use stringmetrics::jaccard;
let crew1 = ["Einar", "Olaf", "Harald"];
let crew2 = ["Olaf", "Harald", "Birger"];
assert_eq!(jaccard(crew1.iter(), crew2.iter()), 0.5);
Modules§
- errors
- This module includes errors used by this crate.
Structs§
- LevWeights
- A struct that holds the costs of insertion, deletion, and substitution. Used for levenshthein algorithms that require weight specifications.
Functions§
- hamming
- Hamming distance computations
- hamming_
iter - Apply the hamming function to iterators
- jaccard
- Calculate the Jaccard index on two iterators using
jaccard_set
- jaccard_
set - Calculate the Jaccard index on two
HashSet
s. - levenshtein
- Basic Levenshtein distance computation
- levenshtein_
limit - Levenshtein distance computation with a limit
- levenshtein_
limit_ iter - Levenshthein distance computation on anything with
Iterator
with items that havePartialEq
. - levenshtein_
weight - Levenshtein distance computations with adjustable weights and a limit
- levenshtein_
weight_ iter - Weighted Levenshthein distance computation on anything with
Iterator
with items that havePartialEq
. - try_
levenshtein - The same alrogithm as
levenshtein_limit
but return anOption
to indicate if the limit is exceeded - try_
levenshtein_ iter - The same algorithm as [
levenshtein_limit_iter
] but return anOption
to indicate if the limit is exceeded - try_
levenshtein_ weight - The same algorithm as
levenshtein_weight
but return anOption
to indicate if the limit is exceeded - try_
levenshtein_ weight_ iter - The same algorithm as
levenshtein_weight_iter
but return anOption
to indicate if the limit is exceeded