textprep

Text preprocessing primitives: normalization, tokenization, and fast keyword matching.

Contract

Invariants (must never change):
- Normalization: scrub defaults to NFC normalization + lower case.
- Offsets: Token and KeywordMatch return character offsets (usize) into the original input string.
  - Note: these are not byte offsets. For slicing, convert via .chars() (or build a byte-index map if you need fast repeated slicing).
- No panic on Unicode: All functions must handle invalid UTF-8 gracefully.
Support / Dependencies:
- Unicode: Relies on unicode-normalization and unicode-segmentation.
- Keyword Matching: Uses Aho-Corasick (FlashText equivalent) for linear-time multi-pattern search.

Usage

use textprep::{scrub, FlashText};

// Normalization
let raw = "Héllö World!";
let key = scrub(raw); // "hello world!"

// Fast Keyword Matching
let mut ft = FlashText::new();
ft.add_keyword("Big Apple", "New York");

let text = "I live in the Big Apple.";
let found = ft.find(text);
// found[0].start/end are CHAR offsets (not byte offsets)
assert_eq!(found[0].value, "New York");

textprep 0.1.1

textprep

Contract

Usage