UCFP canonical text layer.
This module normalizes text into a deterministic, versioned format. Downstream stages (perceptual, semantic, index) can rely on this for stable identity.
What we do
- Unicode normalization (NFKC by default, configurable)
- Casing and punctuation handling (lowercase, optional stripping)
- Whitespace normalization (collapses to single spaces)
- Tokenization with byte offsets for downstream accuracy
- Versioned hashes so you can tell which canonicalization was used
Pure function guarantee
No I/O, no clock calls, no OS/locale dependence. Give us the same text and config, you get the same result on any machine.
Invariants worth knowing
- Input should be trusted UTF-8 (usually from ingest stage)
- We don't re-validate ingest constraints here
- Output depends only on text + config
- Hash = SHA-256(version || 0x00 || canonical_text)
Bottom line: same input + same config = same output forever.