content-canonical 0.1.0

UCFP canonical text layer.

This module normalizes text into a deterministic, versioned format. Downstream stages (perceptual, semantic, index) can rely on this for stable identity.

What we do

Unicode normalization (NFKC by default, configurable)
Casing and punctuation handling (lowercase, optional stripping)
Whitespace normalization (collapses to single spaces)
Tokenization with byte offsets for downstream accuracy
Versioned hashes so you can tell which canonicalization was used

Pure function guarantee

No I/O, no clock calls, no OS/locale dependence. Give us the same text and config, you get the same result on any machine.

Invariants worth knowing

Input should be trusted UTF-8 (usually from ingest stage)
We don't re-validate ingest constraints here
Output depends only on text + config
Hash = SHA-256(version || 0x00 || canonical_text)

Bottom line: same input + same config = same output forever.