Expand description
Unicode normalization (UAX #15).
v3++ b-5: implements the four standard normalization forms - NFC,
NFD, NFKC, NFKD - directly against the embedded UCD 14.0.0 tables
in super::normalization_data. The implementation is intentionally
third-party-free so:
- Both the tree-walk evaluator and the wasm-AOT backend share one dataset and one algorithm, avoiding silent drift between executors.
- Bumping the Unicode version is a single regenerate-and-commit
step (see
tools/gen_normalization_tables.py).
The four entry points (to_nfd, to_nfkd, to_nfc,
to_nfkc) all return owned Strings. Hangul syllables are
decomposed / composed algorithmically per UAX #15 section 16 -
keeping them in the data tables would cost ~88 KB for the syllable
block alone with no performance gain.
§Algorithm sketch
- NFD: decode each
char-> recursive canonical decomposition (data table + Hangul algorithm) -> canonical reorder (stable sort on CCC within each non-starter run) -> re-encode. - NFKD: same as NFD but using the compatibility table.
- NFC: run NFD, then a single left-to-right composition pass
that pairs each starter with subsequent characters via
COMPOSITION_PAIRSplus the algorithmic Hangul composer. - NFKC: run NFKD, then the same composition pass.
Excluded composites (Full_Composition_Exclusion plus the explicit
CompositionExclusions.txt list) are absent from
COMPOSITION_PAIRS at generation time, so the composition pass
never needs to consult an exclusion table at runtime.
Enums§
- Decomp
Kind - Mode flag for
decompose_to_buffer.
Constants§
- HANGUL_
L_ BASE - First Hangul leading consonant jamo (U+1100).
- HANGUL_
L_ COUNT - Count of leading-consonant jamos.
- HANGUL_
N_ COUNT HANGUL_V_COUNT * HANGUL_T_COUNT— block size per leading jamo.- HANGUL_
S_ BASE - First precomposed Hangul syllable (U+AC00).
- HANGUL_
S_ COUNT - Total count of precomposed Hangul syllables.
- HANGUL_
T_ BASE - Hangul trailing-consonant filler (T_BASE itself never composes; the
real trailing jamo range is
T_BASE + 1 ..= T_BASE + T_COUNT - 1). - HANGUL_
T_ COUNT - Count of trailing-consonant jamos (including the filler at offset 0).
- HANGUL_
V_ BASE - First Hangul vowel jamo (U+1161).
- HANGUL_
V_ COUNT - Count of vowel jamos.
Functions§
- canonical_
reorder - Canonical reorder pass (UAX #15 D109): within every run of non-starters (CCC > 0) sort code points by CCC ascending, stably. Starters (CCC == 0) are anchors that break runs.
- ccc
- Canonical_Combining_Class for
cp. Returns 0 for any code point not present inCCC_TABLE- the table only stores non-zero classes (Not_Reordered is the default). - compose
- Canonical composition pass (UAX #15 section 16). Operates on a
Vec<u32>that has already been decomposed and reordered. - compose_
pair - Composition pair lookup:
(first, second) -> composed. ReturnsNonewhen no canonical composition exists or when the composite is on the exclusion list (filtered out at table-generation time, so the runtime never re-checks). - decompose_
and_ reorder - Common scaffold: decompose into a
Vec<u32>then canonical-reorder. - decompose_
to_ buffer - Decompose
inputintooutusing the requested table. The payload tables are already fully expanded (the generator script flattens nested decompositions), so a single lookup per code point is sufficient - no recursion needed at runtime. - encode
- Re-encode a
Vec<u32>to aString. Any code point that does not round-trip throughchar::from_u32(surrogates, > U+10FFFF) is silently dropped - they cannot appear in our tables, but defensive coding keepsfrom_u32_uncheckedout of the picture. - encode_
ccc_ table_ bytes - Encode the canonical-combining-class table.
- encode_
composition_ table_ bytes - Encode the canonical composition pair table.
- encode_
decomp_ table_ bytes - Encode
NFD_INDEX+NFD_POOLinto the byte layout the wasm runtime expects. - hangul_
compose - Algorithmic Hangul composition. Tries L + V (and optionally + T)
-> precomposed syllable. Returns
Nonewhen the pair is not a valid jamo pairing. - hangul_
decompose_ into - Algorithmic Hangul decomposition. Returns the L / V (/ optional T)
jamo sequence in
outwhencpis in the syllable block, orfalseifcpis not a precomposed Hangul syllable. - nfd_
lookup - Look up the canonical decomposition of
cpinNFD_INDEX/NFD_POOL. ReturnsNoneifcphas no canonical decomposition. - nfkd_
lookup - Compatibility analog of
nfd_lookup. Falls back to the canonical entry when no compatibility mapping exists (the generator script duplicates canonical-only entries into NFKD as well). - to_nfc
- Public: NFC.
- to_nfd
- Public: NFD.
- to_nfkc
- Public: NFKC.
- to_nfkd
- Public: NFKD.