Expand description
Text normalization, version norm-v1 (RFC-005 §9).
norm-v1 is deliberately small and exactly specified, because its output feeds content hashes and indexes — any change must come with a version bump:
- strip a UTF-8 BOM at the start of the document;
- normalize CRLF and lone CR to LF;
- remove control characters except
\nand\t; - trim trailing whitespace on each line.
Unicode NFC normalization is intentionally not part of norm-v1 (deferred to a future norm-v2 with RFC-014 language work); Japanese text passes through byte-identical apart from the rules above.
Constants§
- NORMALIZATION_
VERSION - Version constant recorded with every extraction. Text normalization stage version (RFC-005 §9).
Functions§
- normalize_
document - Apply norm-v1 to a whole document.