Skip to main content

Crate canonical

Crate canonical 

Source
Expand description

UCFP canonical text layer.

This module normalizes text into a deterministic, versioned format. Downstream stages (perceptual, semantic, index) can rely on this for stable identity.

§What we do

  • Unicode normalization (NFKC by default, configurable)
  • Casing and punctuation handling (lowercase, optional stripping)
  • Whitespace normalization (collapses to single spaces)
  • Tokenization with byte offsets for downstream accuracy
  • Versioned hashes so you can tell which canonicalization was used

§Pure function guarantee

No I/O, no clock calls, no OS/locale dependence. Give us the same text and config, you get the same result on any machine.

§Invariants worth knowing

  • Input should be trusted UTF-8 (usually from ingest stage)
  • We don’t re-validate ingest constraints here
  • Output depends only on text + config
  • Hash = SHA-256(version || 0x00 || canonical_text)

Bottom line: same input + same config = same output forever.

Structs§

CanonicalizeConfig
Configuration for the canonical text pipeline.
CanonicalizedDocument
Canonical representation of a text document.
Token
A token with its UTF-8 byte offsets in the canonical text.

Enums§

CanonicalError
Errors that can occur during canonicalization.

Functions§

canonicalize
Main entry point. Takes an input payload and config and returns a canonicalized document.
collapse_whitespace
Collapses repeated whitespace, trims edges, and normalizes newlines to single spaces.
hash_canonical_bytes
Compute the canonical identity hash for canonical text and version.
hash_text
Hash arbitrary text bytes with SHA-256 and return a hex digest.
tokenize
Tokenizes canonical text and produces byte offsets.