Expand description
UCFP canonical text layer.
This module normalizes text into a deterministic, versioned format. Downstream stages (perceptual, semantic, index) can rely on this for stable identity.
§What we do
- Unicode normalization (NFKC by default, configurable)
- Casing and punctuation handling (lowercase, optional stripping)
- Whitespace normalization (collapses to single spaces)
- Tokenization with byte offsets for downstream accuracy
- Versioned hashes so you can tell which canonicalization was used
§Pure function guarantee
No I/O, no clock calls, no OS/locale dependence. Give us the same text and config, you get the same result on any machine.
§Invariants worth knowing
- Input should be trusted UTF-8 (usually from ingest stage)
- We don’t re-validate ingest constraints here
- Output depends only on text + config
- Hash = SHA-256(version || 0x00 || canonical_text)
Bottom line: same input + same config = same output forever.
Structs§
- Canonicalize
Config - Configuration for the canonical text pipeline.
- Canonicalized
Document - Canonical representation of a text document.
- Token
- A token with its UTF-8 byte offsets in the canonical text.
Enums§
- Canonical
Error - Errors that can occur during canonicalization.
Functions§
- canonicalize
- Main entry point. Takes an input payload and config and returns a canonicalized document.
- collapse_
whitespace - Collapses repeated whitespace, trims edges, and normalizes newlines to single spaces.
- hash_
canonical_ bytes - Compute the canonical identity hash for canonical text and version.
- hash_
text - Hash arbitrary text bytes with SHA-256 and return a hex digest.
- tokenize
- Tokenizes canonical text and produces byte offsets.