Skip to main content

Module normalization

Module normalization 

Source
Expand description

Unicode normalization (UAX #15).

v3++ b-5: implements the four standard normalization forms - NFC, NFD, NFKC, NFKD - directly against the embedded UCD 14.0.0 tables in super::normalization_data. The implementation is intentionally third-party-free so:

  • Both the tree-walk evaluator and the wasm-AOT backend share one dataset and one algorithm, avoiding silent drift between executors.
  • Bumping the Unicode version is a single regenerate-and-commit step (see tools/gen_normalization_tables.py).

The four entry points (to_nfd, to_nfkd, to_nfc, to_nfkc) all return owned Strings. Hangul syllables are decomposed / composed algorithmically per UAX #15 section 16 - keeping them in the data tables would cost ~88 KB for the syllable block alone with no performance gain.

§Algorithm sketch

  • NFD: decode each char -> recursive canonical decomposition (data table + Hangul algorithm) -> canonical reorder (stable sort on CCC within each non-starter run) -> re-encode.
  • NFKD: same as NFD but using the compatibility table.
  • NFC: run NFD, then a single left-to-right composition pass that pairs each starter with subsequent characters via COMPOSITION_PAIRS plus the algorithmic Hangul composer.
  • NFKC: run NFKD, then the same composition pass.

Excluded composites (Full_Composition_Exclusion plus the explicit CompositionExclusions.txt list) are absent from COMPOSITION_PAIRS at generation time, so the composition pass never needs to consult an exclusion table at runtime.

Enums§

DecompKind
Mode flag for decompose_to_buffer.

Constants§

HANGUL_L_BASE
First Hangul leading consonant jamo (U+1100).
HANGUL_L_COUNT
Count of leading-consonant jamos.
HANGUL_N_COUNT
HANGUL_V_COUNT * HANGUL_T_COUNT — block size per leading jamo.
HANGUL_S_BASE
First precomposed Hangul syllable (U+AC00).
HANGUL_S_COUNT
Total count of precomposed Hangul syllables.
HANGUL_T_BASE
Hangul trailing-consonant filler (T_BASE itself never composes; the real trailing jamo range is T_BASE + 1 ..= T_BASE + T_COUNT - 1).
HANGUL_T_COUNT
Count of trailing-consonant jamos (including the filler at offset 0).
HANGUL_V_BASE
First Hangul vowel jamo (U+1161).
HANGUL_V_COUNT
Count of vowel jamos.

Functions§

canonical_reorder
Canonical reorder pass (UAX #15 D109): within every run of non-starters (CCC > 0) sort code points by CCC ascending, stably. Starters (CCC == 0) are anchors that break runs.
ccc
Canonical_Combining_Class for cp. Returns 0 for any code point not present in CCC_TABLE - the table only stores non-zero classes (Not_Reordered is the default).
compose
Canonical composition pass (UAX #15 section 16). Operates on a Vec<u32> that has already been decomposed and reordered.
compose_pair
Composition pair lookup: (first, second) -> composed. Returns None when no canonical composition exists or when the composite is on the exclusion list (filtered out at table-generation time, so the runtime never re-checks).
decompose_and_reorder
Common scaffold: decompose into a Vec<u32> then canonical-reorder.
decompose_to_buffer
Decompose input into out using the requested table. The payload tables are already fully expanded (the generator script flattens nested decompositions), so a single lookup per code point is sufficient - no recursion needed at runtime.
encode
Re-encode a Vec<u32> to a String. Any code point that does not round-trip through char::from_u32 (surrogates, > U+10FFFF) is silently dropped - they cannot appear in our tables, but defensive coding keeps from_u32_unchecked out of the picture.
encode_ccc_table_bytes
Encode the canonical-combining-class table.
encode_composition_table_bytes
Encode the canonical composition pair table.
encode_decomp_table_bytes
Encode NFD_INDEX + NFD_POOL into the byte layout the wasm runtime expects.
hangul_compose
Algorithmic Hangul composition. Tries L + V (and optionally + T) -> precomposed syllable. Returns None when the pair is not a valid jamo pairing.
hangul_decompose_into
Algorithmic Hangul decomposition. Returns the L / V (/ optional T) jamo sequence in out when cp is in the syllable block, or false if cp is not a precomposed Hangul syllable.
nfd_lookup
Look up the canonical decomposition of cp in NFD_INDEX / NFD_POOL. Returns None if cp has no canonical decomposition.
nfkd_lookup
Compatibility analog of nfd_lookup. Falls back to the canonical entry when no compatibility mapping exists (the generator script duplicates canonical-only entries into NFKD as well).
to_nfc
Public: NFC.
to_nfd
Public: NFD.
to_nfkc
Public: NFKC.
to_nfkd
Public: NFKD.