textprep

Text preprocessing primitives: normalization, tokenization, n-grams, string similarity, stopwords, and fast keyword matching.

[dependencies]
textprep = "0.1.4"

Normalization

scrub normalizes text to a canonical form for indexing and comparison: NFC normalization, case folding, and diacritics stripping.

use textprep::scrub;

assert_eq!(scrub("Muller"), "muller");
assert_eq!(scrub("Cafe\u{0301}"), "cafe");  // combining accent

For search pipelines that need stricter normalization (NFKC, bidi control removal, zero-width stripping), use ScrubConfig:

use textprep::{scrub_with, ScrubConfig};

let cfg = ScrubConfig::search_key();
let key = scrub_with("  Hello\u{200B}World  ", &cfg);
// NFKC + lowercase + collapsed whitespace

Tokenization

Split text into words or sentences, with character offsets:

use textprep::tokenize::{words, sentences, tokenize_with_offsets};

let w = words("Hello, world!");
assert_eq!(w, vec!["Hello", "world"]);

let s = sentences("First sentence. Second one!");
assert_eq!(s.len(), 2);

// With character offsets (not byte offsets)
let tokens = tokenize_with_offsets("Hello world");
assert_eq!(tokens[0].text, "Hello");
assert_eq!(tokens[0].start, 0);
assert_eq!(tokens[0].end, 5);

Fast keyword matching

FlashText provides linear-time multi-pattern keyword search (Aho-Corasick based):

use textprep::FlashText;

let mut ft = FlashText::new();
ft.add_keyword("Big Apple", "New York");
ft.add_keyword("NYC", "New York");

let matches = ft.find("I live in the Big Apple, also known as NYC.");
assert_eq!(matches[0].value, "New York");
// matches[0].start/end are character offsets

N-grams

Character-level and word-level n-gram generation:

use textprep::ngram::{char_ngrams, word_ngrams};

let cg = char_ngrams("hello", 3);
// ["hel", "ell", "llo"]

let words = vec!["the", "quick", "brown", "fox"];
let wg = word_ngrams(&words, 2);
// ["the quick", "quick brown", "brown fox"]

String similarity

Jaccard similarity at word and character-ngram levels:

use textprep::similarity::{word_jaccard, trigram_jaccard};

let sim = word_jaccard("hello world", "world hello");
assert!((sim - 1.0).abs() < f64::EPSILON); // same words

let sim = trigram_jaccard("kitten", "sitting");
assert!(sim > 0.0 && sim < 1.0);

Stopwords

Built-in English stopword list, plus loadable lists for other languages:

use textprep::stopwords::is_english_stopword;

assert!(is_english_stopword("the"));
assert!(!is_english_stopword("quantum"));

Unicode utilities

Direct access to normalization forms and text cleaning:

use textprep::unicode::{nfc, nfkc};
use textprep::fold::{fold, strip_diacritics};
use textprep::html::decode_entities;

let normalized = nfkc("ﬁ");       // "fi" (compatibility decomposition)
let lowered = fold("Straße");      // "straße"
let plain = strip_diacritics("cafe\u{0301}"); // "cafe"
let decoded = decode_entities("&amp; &lt;"); // "& <"

Feature flags

Feature	What it adds
`casefold`	Full Unicode NFKC_Casefold (e.g. sharp-s to "ss")
`serde`	Serialize/deserialize for `Token`, `KeywordMatch`, `ScrubConfig`

License

MIT OR Apache-2.0

textprep 0.1.4