Expand description
§Ngrammatic
This crate provides fuzzy search/string matching using N-grams.
This implementation is character-based, rather than word based, matching solely based on string similarity. It is modelled somewhat after the python ngram module with some inspiration from chappers’ blog post on fuzzy matching with ngrams.
The crate is implemented in three parts: the Corpus, which is an
index connecting strings (words, symbols, whatever) to their Ngrams,
and SearchResults, which contains a fuzzy match result, with the
word and a similarity measure in the range of 0.0 to 1.0.
The general usage pattern is to construct a Corpus, .add() your
list of valid symbols to it, and then perform .search()es of valid,
unknown, misspelled, etc symbols on the Corpus. The results come
back as a vector of up to 10 results, sorted from highest similarity
to lowest.
Licensed under the MIT license.
§Installation
This crate is published on crates.io.
To use it, add this to your Cargo.toml:
[dependencies]
ngrammatic = "0.7"Or, for a possible improvement search speeds, enable the “rayon” feature:
[dependencies]
ngrammatic = { version = "0.7", features = ["rayon"] }Benchmark results show rayon offers a 30-80% performance improvement in search, but about a 2% performance decline for corpus creation. When enabling rayon, you’ll likely see better results using serial corpus creation and parallel search, but this is no substitute for running your own benchmarks.
§Usage example
To do fuzzy matching, build up your corpus of valid symbols like this:
use ngrammatic::{CorpusBuilder, Pad};
let mut corpus = CorpusBuilder::default()
.arity(2)
.pad_full(Pad::Auto)
.finish();
// Build up the list of known words
corpus.add_text("pie");
corpus.add_text("animal");
corpus.add_text("tomato");
corpus.add_text("seven");
corpus.add_text("carbon");
// Now we can try an unknown/misspelled word, and find a similar match
// in the corpus
let results = corpus.search("tomacco", 0.25, 10);
let top_match = results.first();
assert!(top_match.is_some());
assert!(top_match.unwrap().similarity > 0.5);
assert_eq!(top_match.unwrap().text,String::from("tomato"));§Benchmarking
Some benchmarks exist to compare the performance of various scenarios.
In order to run them, rayon must be enabled, e.g.:
$ cargo bench --features rayonThe benchmarks of the top-domains.txt file can take quite a long time to complete, as they’re working against a very large dataset.
Here’s a sample of data collected from the 0.7 version on my development machine (approx. 15-20% faster than 0.5!):
| Test | lower bound (ms) | typical (ms) | upper bound (ms) |
|---|---|---|---|
| novel parallel insertion case sensitive | 69.424 | 69.680 | 69.951 |
| novel parallel insertion case insensitive | 69.865 | 70.118 | 70.392 |
| novel serial insertion case sensitive | 67.468 | 67.715 | 67.976 |
| novel serial insertion case insensitive | 68.839 | 69.253 | 69.698 |
| random text parallel insertion case sensitive | 107.67 | 107.94 | 108.22 |
| random text parallel insertion case insensitive | 109.89 | 110.41 | 110.97 |
| random text serial insertion case sensitive | 107.46 | 107.90 | 108.41 |
| random text serial insertion case insensitive | 108.04 | 108.39 | 108.77 |
| domain names parallel insertion case sensitive | 108.76 | 109.15 | 109.56 |
| domain names parallel insertion case insensitive | 108.96 | 109.27 | 109.59 |
| domain names serial insertion case sensitive | 107.00 | 107.23 | 107.46 |
| domain names serial insertion case insensitive | 107.70 | 107.93 | 108.19 |
| novel parallel search no match | 0.49750 | 0.50972 | 0.52333 |
| novel parallel search match | 2.9885 | 3.0266 | 3.0700 |
| novel serial search no match | 0.75857 | 0.75919 | 0.75993 |
| novel serial search match | 6.7814 | 6.8057 | 6.8324 |
| random text parallel search no match | 0.47633 | 0.49054 | 0.50553 |
| random text parallel search match | 2.6893 | 2.7269 | 2.7673 |
| random text serial search no match | 1.3826 | 1.3860 | 1.3901 |
| random text serial search match | 13.709 | 13.940 | 14.182 |
| domain names parallel search no match | 15.018 | 15.194 | 15.379 |
| domain names parallel search match | 1590.0 | 1593.5 | 1597.1 |
| domain names serial search no match | 61.813 | 62.431 | 63.085 |
| domain names serial search match | 4466.2 | 4475.0 | 4498.9 |
Do note that those search times against the top domain names corpus were taking several seconds to complete in the case where a perfect match exists. It’s unclear at the moment why search results with perfect matches always take significantly longer.
§Areas for future improvement
Adding string interning to the corpus was a really big performance and memory
win. Unfortunately, updating Ngram with interning is quite a bit more
complicated, and I’m open to proposals for how to accomplish it.
In the meantime, replacing Strings in Ngram with SmolStrs netted about a 15%
performance win vs without.
New major gains will likely require some thoughtful cpu and memory profiling.
Structs§
- Corpus
- Holds a corpus of words and their ngrams, allowing fuzzy matches of candidate strings against known strings in the corpus.
- Corpus
Builder - Build an Ngram Corpus, one setting at a time.
- Identity
KeyTransformer - Identity key transformer.
- Linked
KeyTransformer - Linked key transformer.
- Lower
KeyTransformer - Lowercasing key transformer.
- Ngram
- Stores a “word”, with all its n-grams. The “arity” member determines the value of “n” used in generating the n-grams.
- Ngram
Builder - Build an
Ngram, one setting at a time. - Search
Result - Holds a fuzzy match search result string, and its associated similarity to the query text.
Enums§
- Pad
- Determines how strings are padded before calculating the grams.
Having some sort of padding is especially important for small words
Auto pad pre/appends
arity-1 space chars Read more about the effect of ngram padding
Traits§
- KeyTransformer
- Trait for key transformers.