textprep 0.1.1

Text preprocessing primitives: normalization, tokenization, and fast keyword matching.
Documentation

textprep

Text preprocessing primitives: normalization, tokenization, and fast keyword matching.

Contract

  • Invariants (must never change):

    • Normalization: scrub defaults to NFC normalization + lower case.
    • Offsets: Token and KeywordMatch return character offsets (usize) into the original input string.
      • Note: these are not byte offsets. For slicing, convert via .chars() (or build a byte-index map if you need fast repeated slicing).
    • No panic on Unicode: All functions must handle invalid UTF-8 gracefully.
  • Support / Dependencies:

    • Unicode: Relies on unicode-normalization and unicode-segmentation.
    • Keyword Matching: Uses Aho-Corasick (FlashText equivalent) for linear-time multi-pattern search.

Usage

use textprep::{scrub, FlashText};

// Normalization
let raw = "Héllö World!";
let key = scrub(raw); // "hello world!"

// Fast Keyword Matching
let mut ft = FlashText::new();
ft.add_keyword("Big Apple", "New York");

let text = "I live in the Big Apple.";
let found = ft.find(text);
// found[0].start/end are CHAR offsets (not byte offsets)
assert_eq!(found[0].value, "New York");