textprep
Text preprocessing primitives: normalization, tokenization, and fast keyword matching.
Contract
-
Invariants (must never change):
- Normalization:
scrubdefaults to NFC normalization + lower case. - Offsets:
TokenandKeywordMatchreturn character offsets (usize) into the original input string.- Note: these are not byte offsets. For slicing, convert via
.chars()(or build a byte-index map if you need fast repeated slicing).
- Note: these are not byte offsets. For slicing, convert via
- No panic on Unicode: All functions must handle invalid UTF-8 gracefully.
- Normalization:
-
Support / Dependencies:
- Unicode: Relies on
unicode-normalizationandunicode-segmentation. - Keyword Matching: Uses Aho-Corasick (
FlashTextequivalent) for linear-time multi-pattern search.
- Unicode: Relies on
Usage
use ;
// Normalization
let raw = "Héllö World!";
let key = scrub; // "hello world!"
// Fast Keyword Matching
let mut ft = new;
ft.add_keyword;
let text = "I live in the Big Apple.";
let found = ft.find;
// found[0].start/end are CHAR offsets (not byte offsets)
assert_eq!;