ngrammatic 0.5.0

This crate provides fuzzy search/string matching using N-grams.

This implementation is character-based, rather than word based, matching solely based on string similarity.

Licensed under the MIT license.

Documentation

https://docs.rs/ngrammatic/latest/ngrammatic/

Installation

This crate is published on crates.io.

To use it, add this to your Cargo.toml:

[dependencies]
ngrammatic = "0.5"

Or, for a possible improvement search speeds, enable the "rayon" feature:

[dependencies]
ngrammatic = { version = "0.5", features = ["rayon"] }

Benchmark results show 0-80% performance improvement in search, but a 0-3% performance decline for corpus creation. With such a wide variation in performance impact, it was decided not to enable it by default. When enabled, it is possible to use any combination of serialized or parallelized methods depending on which shows better performance for your use cases.

Usage

To do fuzzy matching, build up your corpus of valid symbols like this:

use ngrammatic::{CorpusBuilder, Pad};

let mut corpus = CorpusBuilder::new()
    .arity(2)
    .pad_full(Pad::Auto)
    .finish();

// Build up the list of known words
corpus.add_text("pie");
corpus.add_text("animal");
corpus.add_text("tomato");
corpus.add_text("seven");
corpus.add_text("carbon");

// Now we can try an unknown/misspelled word, and find a similar match
// in the corpus
let word = String::from("tomacco");
if let Some(top_result) = corpus.search(word, 0.25).first() {
    if top_result.similarity > 0.99 {
        println!("✔ {}", top_result.text);
    } else {
        println!("❓{} (did you mean {}? [{:.0}% match])",
                 word,
                 top_result.text,
                 top_result.similarity * 100.0);
    }
} else {
    println!("🗙 {}", word);
}

Benchmarking

Some benchmarks exist to compare the performance of various scenarios.

In order to run them, rayon must be enabled, e.g.:

cargo bench --features rayon

The benchmarks of the top-domains.txt file can take quite a long time to complete, as they're working against a very large dataset.

Here's a sample of data collected from the 0.5 version on my development machine:

Test	lower bound (ms)	typical (ms)	upper bound (ms)
novel parallel insertion case sensitive	84.477	84.857	85.236
novel parallel insertion case insensitive	83.806	84.185	84.581
novel serial insertion case sensitive	82.052	82.356	82.676
novel serial insertion case insensitive	81.476	81.793	82.135
random text parallel insertion case sensitive	154.46	154.95	155.45
random text parallel insertion case insensitive	156.2	156.74	157.31
random text serial insertion case sensitive	152.56	153.03	153.54
random text serial insertion case insensitive	153.9	154.28	154.67
domain names parallel insertion case sensitive	154.16	154.64	155.16
domain names parallel insertion case insensitive	154.61	155.58	156.71
domain names serial insertion case sensitive	151.89	152.29	152.69
domain names serial insertion case insensitive	152.76	153.28	153.83
novel parallel search no match	0.51049	0.51999	0.5306
novel parallel search match	3.1019	3.1362	3.1746
novel serial search no match	0.77879	0.77967	0.78066
novel serial search match	6.9894	7.0099	7.0331
random text parallel search no match	0.50239	0.51616	0.53169
random text parallel search match	2.8231	2.8733	2.9281
random text serial search no match	1.4228	1.4238	1.4248
random text serial search match	16.335	16.594	16.864
domain names parallel search no match	21.821	22.029	22.25
domain names parallel search match	2068.7	2071.8	2075.1
domain names serial search no match	81.061	81.687	82.348
domain names serial search match	5898.3	5906.6	5914.9

Do note that those search times against the top domain names corpus were taking several seconds to complete in the case where a perfect match exists. It's unclear at the moment why search results with perfect matches always take significantly longer.

Areas for future improvement

Adding string interning to the corpus was a really big performance and memory win. Unfortunately, updating Ngram with interning is quite a bit more complicated, and I'm open to proposals for how to accomplish it.

Interning the grams themselves (HashMap<String, usize>) might be less of a win, though. The default symbol for StringInterner is a u32, while the string itself is the same size or small for arity <= 4. Because we don't actually care about the 'stringness' of the grams, something like a tinyvec pre-sized to arity could be a win.