textprep
Text preprocessing primitives: normalization, tokenization, n-grams, string similarity, stopwords, and fast keyword matching.
[]
= "0.1.4"
Normalization
scrub normalizes text to a canonical form for indexing and comparison: NFC normalization, case folding, and diacritics stripping.
use scrub;
assert_eq!;
assert_eq!; // combining accent
For search pipelines that need stricter normalization (NFKC, bidi control removal, zero-width stripping), use ScrubConfig:
use ;
let cfg = search_key;
let key = scrub_with;
// NFKC + lowercase + collapsed whitespace
Tokenization
Split text into words or sentences, with character offsets:
use ;
let w = words;
assert_eq!;
let s = sentences;
assert_eq!;
// With character offsets (not byte offsets)
let tokens = tokenize_with_offsets;
assert_eq!;
assert_eq!;
assert_eq!;
Fast keyword matching
FlashText provides linear-time multi-pattern keyword search (Aho-Corasick based):
use FlashText;
let mut ft = new;
ft.add_keyword;
ft.add_keyword;
let matches = ft.find;
assert_eq!;
// matches[0].start/end are character offsets
N-grams
Character-level and word-level n-gram generation:
use ;
let cg = char_ngrams;
// ["hel", "ell", "llo"]
let words = vec!;
let wg = word_ngrams;
// ["the quick", "quick brown", "brown fox"]
String similarity
Jaccard similarity at word and character-ngram levels:
use ;
let sim = word_jaccard;
assert!; // same words
let sim = trigram_jaccard;
assert!;
Stopwords
Built-in English stopword list, plus loadable lists for other languages:
use is_english_stopword;
assert!;
assert!;
Unicode utilities
Direct access to normalization forms and text cleaning:
use ;
use ;
use decode_entities;
let normalized = nfkc; // "fi" (compatibility decomposition)
let lowered = fold; // "straße"
let plain = strip_diacritics; // "cafe"
let decoded = decode_entities; // "& <"
Feature flags
| Feature | What it adds |
|---|---|
casefold |
Full Unicode NFKC_Casefold (e.g. sharp-s to "ss") |
serde |
Serialize/deserialize for Token, KeywordMatch, ScrubConfig |
License
MIT OR Apache-2.0