textprep 0.1.1 - Docs.rs

# textprep

Text preprocessing primitives: normalization, tokenization, and fast keyword matching.

## Contract

- **Invariants (must never change)**:
  - **Normalization**: `scrub` defaults to **NFC** normalization + **lower case**.
  - **Offsets**: `Token` and `KeywordMatch` return **character offsets** (usize) into the *original* input string.
    - **Note**: these are **not byte offsets**. For slicing, convert via `.chars()` (or build a byte-index map if you need fast repeated slicing).
  - **No panic on Unicode**: All functions must handle invalid UTF-8 gracefully.

- **Support / Dependencies**:
  - **Unicode**: Relies on `unicode-normalization` and `unicode-segmentation`.
  - **Keyword Matching**: Uses Aho-Corasick (`FlashText` equivalent) for linear-time multi-pattern search.

## Usage

```rust
use textprep::{scrub, FlashText};

// Normalization
let raw = "Héllö World!";
let key = scrub(raw); // "hello world!"

// Fast Keyword Matching
let mut ft = FlashText::new();
ft.add_keyword("Big Apple", "New York");

let text = "I live in the Big Apple.";
let found = ft.find(text);
// found[0].start/end are CHAR offsets (not byte offsets)
assert_eq!(found[0].value, "New York");
```