1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
//! This module contains functions for applying various closeness algorithms.
//!
//! See the indiviual functions for usage and examples. This page intends to
//! give a high level overview of how the functions are implemented.
//!
//! # Levenshtein distance algorithm
//!
//! The funcition [levenshtein][crate::algorithms::levenshtein] implements the
//! algorithm below:
//!
//! $$ \operatorname{lev}_{a,b}(i,j) = \begin{cases} \max(i,j) &\text{if }
//! \min(i,j) = 0, \\
//! \min \begin{cases} \operatorname{lev}_{a,b}(i-1,j) + 1 \\
//! \operatorname{lev}_{a,b}(i,j-1) + 1 \\
//! \operatorname{lev}_{a,b}(i-1,j-1) + 1_{(a_i \ne b_i)}
//! \end{cases}
//! &\text{otherwise} \end{cases}$$
//!
//! _(erm... I can't seem to get KaTeX working. Let me know on GitHub if you can
//! help!)_
//!
//! Easier shown than read. Basically, the algorithm parses from top left to
//! bottom right to create a table like follows for the classic example:
//!
//! ```text
//! j → 0 1 2 3 4 5 6 7
//! i str B ->
//! ↓ s i t t i n g
//! 0 s [0, 1, 2, 3, 4, 5, 6, 7]
//! 1 t k [1, 1, 2, 3, 4, 5, 6, 7]
//! 2 r i [2, 2, 1, 2, 3, 4, 5, 6]
//! 3 t [3, 3, 2, 1, 2, 3, 4, 5]
//! 4 A t [4, 4, 3, 2, 1, 2, 3, 4]
//! 5 ↓ e [5, 5, 4, 3, 2, 2, 3, 4]
//! 6 n [6, 6, 5, 4, 3, 3, 2, 3]
//! ```
//!
//! The outer rows (at i=0 and j=0) are just incrementing and can be filled in
//! automatically. Then, the algorithm works its way left-to-right then
//! top-to-bottom to fill in the table. Rules are:
//!
//! 1. Insertion cost is the value of the field to the left plus one
//! 2. Deletion cost is the value of the field above plus one
//! 3. Substitution cost is the value of the field left above if the two
//! relevant characters are the same. If they are different, add one to this
//! value.
//! 4. Take the minimum of these three values and put it in the current box.
//!
//! Eventually, the above matrix can be filled out and the final "distance" is
//! the number at the bottom right; in this case, 3.
//!
//! This library uses [an algorithm published by Sten
//! Hjelmqvist](https://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm-2)
//! and described on [the Levenshtein distance Wikipedia
//! page](https://en.wikipedia.org/wiki/Levenshtein_distance#Iterative_with_two_matrix_rows)
//! that only uses two vectors at a time, rather than constructing the entire
//! matrix, for memory optimizations. Memory use is only that for a vector of
//! u32 twice the length of string B.
//!
//! ## Limited Levenshtein algorithm
//!
//! This is easy; same algorithm as above, just stop matching when you hit a
//! desired limit to avoid spending resources on obviously different strings.
//! Use this version where possible, implemented by [`levenshtein_limit`].
//!
//! ```
//! use stringmetrics::algorithms::levenshtein_limit;
//! assert_eq!(levenshtein_limit("superlongstring", "", 3), 3);
//! ```
//!
//! ## Weighted Levenshtein algorithm
//!
//! Implemented by [`levenshtein_weight`] and [`levenshtein_limit_weight`], a
//! weighted Levenshtein algorithm just allows applying differing penalties for
//! insertion, deletion, and substitution. It's basically the same algorithm as
//! above, except the added values are weights rather than one, and the initial
//! row population is related to insertion and deletion weights.
//!
//! For example, the following code:
//!
//! ```
//! use stringmetrics::algorithms::levenshtein_weight;
//! assert_eq!(levenshtein_weight("kitten", "sitting", 4, 3, 2), 8);
//! ```
//! Creates the following matrix:
//!
//! ```text
//! j → 0 1 2 3 4 5 6 7
//! i str B ->
//! ↓ s i t t i n g
//! 0 s [ 0, 4, 8, 12, 16, 20, 24, 28]
//! 1 t k [ 3, 2, 6, 10, 14, 18, 22, 26]
//! 2 r i [ 6, 5, 2, 6, 10, 14, 18, 22]
//! 3 t [ 9, 8, 5, 2, 6, 10, 14, 18]
//! 4 A t [12, 11, 8, 5, 2, 6, 10, 14]
//! 5 ↓ e [15, 14, 11, 8, 5, 4, 8, 12]
//! 6 n [18, 17, 14, 11, 8, 7, 4, 8]
//! ```
//!
//!
mod basic;
mod damerau;
mod jaccard;
mod levenshtein;
pub use self::basic::hamming;
pub use self::damerau::damerau_levenshtein;
pub use self::jaccard::{jaccard, jaccard_set};
pub use self::levenshtein::{
levenshtein, levenshtein_limit, levenshtein_limit_weight, levenshtein_weight,
};