Skip to main content

Module normalize

Module normalize 

Source
Expand description

Text normalization, version norm-v1 (RFC-005 §9).

norm-v1 is deliberately small and exactly specified, because its output feeds content hashes and indexes — any change must come with a version bump:

  1. strip a UTF-8 BOM at the start of the document;
  2. normalize CRLF and lone CR to LF;
  3. remove control characters except \n and \t;
  4. trim trailing whitespace on each line.

Unicode NFC normalization is intentionally not part of norm-v1 (deferred to a future norm-v2 with RFC-014 language work); Japanese text passes through byte-identical apart from the rules above.

Constants§

NORMALIZATION_VERSION
Version constant recorded with every extraction. Text normalization stage version (RFC-005 §9).

Functions§

normalize_document
Apply norm-v1 to a whole document.