Skip to main content

Module normalization_data

Module normalization_data 

Source
Expand description

AUTO-GENERATED — do not edit by hand.

Generated from the Unicode Character Database (UCD) version 14.0.0:

  • UnicodeData.txt — decomposition mapping + CCC
  • DerivedNormalizationProps.txt — Full_Composition_Exclusion
  • CompositionExclusions.txt — explicit exclusion list

Regenerate with:

python3 crates/relon-unicode/tools/gen_normalization_tables.py \
    > crates/relon-unicode/src/normalization_data.rs

Last regenerated: 2026-05-18 (UCD 14.0.0).

Bump procedure when a new UCD ships:

  1. Drop the new *.txt UCD files where the script expects them (see gen_normalization_tables.py for the exact paths).
  2. Re-run the script. The script bakes multi-level decomposition, filters out Full_Composition_Exclusion, and excludes Hangul syllables — none of that needs manual fix-up.
  3. Run cargo test -p relon-ir unicode to confirm round-trip conformance.

Hangul syllables (U+AC00..=U+D7A3) are decomposed and composed algorithmically per UAX #15 §16 — keeping them out of the tables saves ~88 KB.

Statics§

CCC_TABLE
Canonical_Combining_Class, sparse (only non-zero entries). Sorted by code point. Lookup falls back to 0 when absent.
COMPOSITION_PAIRS
Canonical composition pairs, sorted by (first, second). Excludes any pair whose composite has Full_Composition_Exclusion = True or appears in CompositionExclusions.txt. Hangul composition runs through its own algorithmic helper.
NFD_INDEX
Sorted by code point. Each entry is (cp, payload_offset, payload_len). payload_offset indexes into NFD_POOL. Hangul syllables are excluded; callers must run the algorithmic decompose first.
NFD_POOL
NFKD_INDEX
NFKD_POOL