Crate ngrammatic

Crate ngrammatic 

Source
Expand description

§Ngrammatic

Build status Crates.io Documentation

This crate provides fuzzy search/string matching using N-grams.

This implementation is character-based, rather than word based, matching solely based on string similarity. It is modelled somewhat after the python ngram module with some inspiration from chappers’ blog post on fuzzy matching with ngrams.

The crate is implemented in three parts: the Corpus, which is an index connecting strings (words, symbols, whatever) to their Ngrams, and SearchResults, which contains a fuzzy match result, with the word and a similarity measure in the range of 0.0 to 1.0.

The general usage pattern is to construct a Corpus, .add() your list of valid symbols to it, and then perform .search()es of valid, unknown, misspelled, etc symbols on the Corpus. The results come back as a vector of up to 10 results, sorted from highest similarity to lowest.

Licensed under the MIT license.

§Installation

This crate is published on crates.io.

To use it, add this to your Cargo.toml:

[dependencies]
ngrammatic = "0.7"

Or, for a possible improvement search speeds, enable the “rayon” feature:

[dependencies]
ngrammatic = { version = "0.7", features = ["rayon"] }

Benchmark results show rayon offers a 30-80% performance improvement in search, but about a 2% performance decline for corpus creation. When enabling rayon, you’ll likely see better results using serial corpus creation and parallel search, but this is no substitute for running your own benchmarks.

§Usage example

To do fuzzy matching, build up your corpus of valid symbols like this:

use ngrammatic::{CorpusBuilder, Pad};

let mut corpus = CorpusBuilder::default()
    .arity(2)
    .pad_full(Pad::Auto)
    .finish();

// Build up the list of known words
corpus.add_text("pie");
corpus.add_text("animal");
corpus.add_text("tomato");
corpus.add_text("seven");
corpus.add_text("carbon");

// Now we can try an unknown/misspelled word, and find a similar match
// in the corpus
let results = corpus.search("tomacco", 0.25, 10);
let top_match = results.first();

assert!(top_match.is_some());
assert!(top_match.unwrap().similarity > 0.5);
assert_eq!(top_match.unwrap().text,String::from("tomato"));

§Benchmarking

Some benchmarks exist to compare the performance of various scenarios.

In order to run them, rayon must be enabled, e.g.:

$ cargo bench --features rayon

The benchmarks of the top-domains.txt file can take quite a long time to complete, as they’re working against a very large dataset.

Here’s a sample of data collected from the 0.7 version on my development machine (approx. 15-20% faster than 0.5!):

Testlower bound (ms)typical (ms)upper bound (ms)
novel parallel insertion case sensitive69.42469.68069.951
novel parallel insertion case insensitive69.86570.11870.392
novel serial insertion case sensitive67.46867.71567.976
novel serial insertion case insensitive68.83969.25369.698
random text parallel insertion case sensitive107.67107.94108.22
random text parallel insertion case insensitive109.89110.41110.97
random text serial insertion case sensitive107.46107.90108.41
random text serial insertion case insensitive108.04108.39108.77
domain names parallel insertion case sensitive108.76109.15109.56
domain names parallel insertion case insensitive108.96109.27109.59
domain names serial insertion case sensitive107.00107.23107.46
domain names serial insertion case insensitive107.70107.93108.19
novel parallel search no match0.497500.509720.52333
novel parallel search match2.98853.02663.0700
novel serial search no match0.758570.759190.75993
novel serial search match6.78146.80576.8324
random text parallel search no match0.476330.490540.50553
random text parallel search match2.68932.72692.7673
random text serial search no match1.38261.38601.3901
random text serial search match13.70913.94014.182
domain names parallel search no match15.01815.19415.379
domain names parallel search match1590.01593.51597.1
domain names serial search no match61.81362.43163.085
domain names serial search match4466.24475.04498.9

Do note that those search times against the top domain names corpus were taking several seconds to complete in the case where a perfect match exists. It’s unclear at the moment why search results with perfect matches always take significantly longer.

§Areas for future improvement

Adding string interning to the corpus was a really big performance and memory win. Unfortunately, updating Ngram with interning is quite a bit more complicated, and I’m open to proposals for how to accomplish it.

In the meantime, replacing Strings in Ngram with SmolStrs netted about a 15% performance win vs without.

New major gains will likely require some thoughtful cpu and memory profiling.

Structs§

Corpus
Holds a corpus of words and their ngrams, allowing fuzzy matches of candidate strings against known strings in the corpus.
CorpusBuilder
Build an Ngram Corpus, one setting at a time.
IdentityKeyTransformer
Identity key transformer.
LinkedKeyTransformer
Linked key transformer.
LowerKeyTransformer
Lowercasing key transformer.
Ngram
Stores a “word”, with all its n-grams. The “arity” member determines the value of “n” used in generating the n-grams.
NgramBuilder
Build an Ngram, one setting at a time.
SearchResult
Holds a fuzzy match search result string, and its associated similarity to the query text.

Enums§

Pad
Determines how strings are padded before calculating the grams. Having some sort of padding is especially important for small words Auto pad pre/appends arity-1 space chars Read more about the effect of ngram padding

Traits§

KeyTransformer
Trait for key transformers.