content-canonical 0.1.0

Content canonicalization and text normalization library
Documentation

UCFP canonical text layer.

This module normalizes text into a deterministic, versioned format. Downstream stages (perceptual, semantic, index) can rely on this for stable identity.

What we do

  • Unicode normalization (NFKC by default, configurable)
  • Casing and punctuation handling (lowercase, optional stripping)
  • Whitespace normalization (collapses to single spaces)
  • Tokenization with byte offsets for downstream accuracy
  • Versioned hashes so you can tell which canonicalization was used

Pure function guarantee

No I/O, no clock calls, no OS/locale dependence. Give us the same text and config, you get the same result on any machine.

Invariants worth knowing

  • Input should be trusted UTF-8 (usually from ingest stage)
  • We don't re-validate ingest constraints here
  • Output depends only on text + config
  • Hash = SHA-256(version || 0x00 || canonical_text)

Bottom line: same input + same config = same output forever.