Module stringmetrics::algorithms

Expand description

Stringmetric Algorithms

This module includes the various implementations for Levenshthein and Hamming distance, as well as the Jaccard index. See these modules for in-depth explanation of how the algorithms work, or the function docs for usage information

All relevant functions can be directly imported from stringmetrics, no need to access them nested modules (see the example below).

Example

use stringmetrics::levenshtein;
let a = "this is a book";
let b = "i am a cook";
assert_eq!(levenshtein(a, b), 6);

Algorithm Descriptions

This section seeks to give an overview of the different algorithms contained in this module. See individual functions for usage guidelines.

Hamming Distance Algorithm

The hamming distance between two equal length strings is the number of substitutions required to change one string into the other. Computation is very simple; all that is required is to iterate through and track differences. Hamming distance is implemented by hamming and hamming_iter

Levenshtein distance Algorithm

The funcition levenshtein implements the following algorithm. Basically, the tool parses from top left to bottom right to create a table like follows, for the classic example:

     j → 0  1  2  3  4  5  6  7
i             str B  →
↓           s  i  t  t  i  n  g
0  s    [0, 1, 2, 3, 4, 5, 6, 7]
1  t  k [1, 1, 2, 3, 4, 5, 6, 7]
2  r  i [2, 2, 1, 2, 3, 4, 5, 6]
3     t [3, 3, 2, 1, 2, 3, 4, 5]
4  A  t [4, 4, 3, 2, 1, 2, 3, 4]
5  ↓  e [5, 5, 4, 3, 2, 2, 3, 4]
6     n [6, 6, 5, 4, 3, 3, 2, 3]

The outer rows (at i=0 and j=0) are just incrementing and can be filled in automatically. Then, the algorithm works its way left-to-right then top-to-bottom to fill in the table. Rules are:

Insertion cost is the value of the field to the left plus one
Deletion cost is the value of the field above plus one
Substitution cost is the value of the field left above if the two relevant characters are the same. If they are different, add one to this value.
Take the minimum of these three values and put it in the current box.

Eventually, the above matrix can be filled out and the final “distance” is the number at the bottom right; in this case, 3.

This library uses an algorithm published by Sten Hjelmqvist and described on the Levenshtein distance Wikipedia page but adapted to use a single vector. Main memory usage is only that of a Vec<u32> in the same length as the shortest string.

Limited Levenshtein algorithm

This uses the same algorithm as above, but stops matching at a desired limit to avoid spending resources on obviously different strings. Use this version where possible, implemented by levenshtein_limit.

use stringmetrics::levenshtein_limit;
assert_eq!(levenshtein_limit("superlongstring", "", 3), 3);

Weighted Levenshtein algorithm

Implemented by levenshtein_weight and levenshtein_weight_iter, a weighted Levenshtein algorithm just allows applying differing penalties for insertion, deletion, and substitution. It’s basically the same algorithm as above, except the added values are specified (rather than always one), and the initial row population is related to insertion and deletion weights.

For example, the following code uses insertion, deletion, and substitution weights of 4, 3, and 2, respectively. There is also a limit of 100:

use stringmetrics::{levenshtein_weight, LevWeights};
let weights = LevWeights::new(4, 3, 2);
assert_eq!(levenshtein_weight("kitten", "sitting", 100, &weights), 8);

Creates the following matrix:

      j → 0   1   2   3   4   5   6   7
i             str B  →
↓             s   i   t   t   i   n   g
0  s    [ 0,  4,  8, 12, 16, 20, 24, 28]
1  t  k [ 3,  2,  6, 10, 14, 18, 22, 26]
2  r  i [ 6,  5,  2,  6, 10, 14, 18, 22]
3     t [ 9,  8,  5,  2,  6, 10, 14, 18]
4  A  t [12, 11,  8,  5,  2,  6, 10, 14]
5  ↓  e [15, 14, 11,  8,  5,  4,  8, 12]
6     n [18, 17, 14, 11,  8,  7,  4,  8]

Rules are similar to above:

Insertion cost is the value of the field to the left plus insertion weight
Deletion cost is the value of the field above plus deletion weight
Substitution cost is the value of the field left above if the two relevant characters are the same. If they are different, add substitution weight to this value.
Take the minimum of these three values and put it in the current box.

The result of 8 is representative of one added letter (+4) and two substitutions (+2*2). The substitution could alternatively be counted as an insertion followed by a deletion but the algorithm “chooses” against it since the cost would be much higher (4+3=7 when the substitution cost is only 2).)

Note on string comparisons

All string-based levenshtein algorithms use bytes rather than characters by default. This speeds things up significantly, and usually the difference is unimportant. However, if you are working with CJK character sets or emojis, you may prefer the somewhat more accurate (but slower) chars() usage. This example presents the case well:

use stringmetrics::{levenshtein_limit, levenshtein_limit_iter};

// Default implementation; these are not the "expected" values
// In many cases, this may be acceptable
assert_eq!(levenshtein_limit("鱼", "雪", 100), 2);
assert_eq!(levenshtein_limit("😙", "🔬", 100), 2);

// "Expected" values by using iterator functions
assert_eq!(levenshtein_limit_iter("鱼".chars(), "雪".chars(), 100), 1);
assert_eq!(levenshtein_limit_iter("😙".chars(), "🔬".chars(), 100), 1);

If accurate matching on further extended unicode is required, the unicode segmentation crate can be used to split on the iterable graphemes(true).

Structs

LevWeights

A struct that holds the costs of insertion, deletion, and substitution. Used for levenshthein algorithms that require weight specifications.

Functions

hamming

Hamming distance computations

hamming_iter

Apply the hamming function to iterators

jaccard

Calculate the Jaccard index on two iterators using jaccard_set

jaccard_set

Calculate the Jaccard index on two HashSets.

levenshtein

Basic Levenshtein distance computation

levenshtein_limit

Levenshtein distance computation with a limit

levenshtein_limit_iter

Levenshthein distance computation on anything with Iterator with items that have PartialEq.

levenshtein_weight

Levenshtein distance computations with adjustable weights and a limit

levenshtein_weight_iter

Weighted Levenshthein distance computation on anything with Iterator with items that have PartialEq.