Crate canonical

Expand description

UCFP canonical text layer.

This module normalizes text into a deterministic, versioned format. Downstream stages (perceptual, semantic, index) can rely on this for stable identity.

§What we do

Unicode normalization (NFKC by default, configurable)
Casing and punctuation handling (lowercase, optional stripping)
Whitespace normalization (collapses to single spaces)
Tokenization with byte offsets for downstream accuracy
Versioned hashes so you can tell which canonicalization was used

§Pure function guarantee

No I/O, no clock calls, no OS/locale dependence. Give us the same text and config, you get the same result on any machine.

§Invariants worth knowing

Input should be trusted UTF-8 (usually from ingest stage)
We don’t re-validate ingest constraints here
Output depends only on text + config
Hash = SHA-256(version || 0x00 || canonical_text)

Bottom line: same input + same config = same output forever.

Structs§

CanonicalizeConfig: Configuration for the canonical text pipeline.
CanonicalizedDocument: Canonical representation of a text document.
Token: A token with its UTF-8 byte offsets in the canonical text.

Enums§

CanonicalError: Errors that can occur during canonicalization.

Functions§

canonicalize: Main entry point. Takes an input payload and config and returns a canonicalized document.
collapse_whitespace: Collapses repeated whitespace, trims edges, and normalizes newlines to single spaces.
hash_canonical_bytes: Compute the canonical identity hash for canonical text and version.
hash_text: Hash arbitrary text bytes with SHA-256 and return a hex digest.
tokenize: Tokenizes canonical text and produces byte offsets.

Crate canonical

Crate canonical Copy item path

§What we do

§Pure function guarantee

§Invariants worth knowing

Structs§

Enums§

Functions§

Crate canonical