Expand description
Unicode-aware tables, algorithms, and the glob matcher shared by the tree-walk evaluator and the wasm-AOT / native codegen backends.
This crate is a leaf: it depends on no other relon-* crate
(matching relon-util / relon-cap), so it sits at the very
bottom of the workspace dep graph. It consolidates every Unicode
dataset, the SIMD ASCII fast path, and the linear-time glob
matcher that previously lived under relon-ir/src/unicode/ and
relon-ir/src/glob.rs. Pulling them into a standalone crate lets
relon-evaluator consume the shared tables without an edge to
relon-ir (the evaluator is a tree-walk engine and never touches
the IR surface), keeping the dep graph honest.
relon-ir keeps same-named re-exports so the codegen backends
that reach for relon_ir::ascii_fold_simd / relon_ir::glob /
etc. compile unchanged.
§Module map
case_folding— UCD simple (1:1) upper / lower folding tables, generated at build time fromchar::to_uppercase/char::to_lowercase. Drives the wasm-AOT__casefold_lookuphelper.full_case_folding— UAX #21 full case folding (multi-codepoint mappings, Greek final sigma, Turkish / Azerbaijani locale overrides). Generated fromdata/SpecialCasing.txtviatools/gen_full_case_folding.py.full_case_folding_data— raw generated tables forfull_case_folding. Pulled in viainclude!()fromfull_case_folding.rsrather than declared as a sibling module, matching the pre-split layout so the generated symbols stay in a single namespace.combining_marks— Mn + Mc + Me range table used by every case-fold body to decide whether a codepoint resets the word boundary.whitespace— non-ASCIIWhite_Spaceranges (the ASCII subset is special-cased on the wasm fast path).normalization— UAX #15 NFD / NFKD / NFC / NFKC algorithms on top of thenormalization_datatables. UCD version pinned at 14.0.0; regenerate viatools/gen_normalization_tables.py.normalization_data— generated UCD 14.0.0 decomposition, canonical-combining-class, and composition-pair tables.ascii_fold_simd— v3++ item 4 SIMD ASCII fast path for the tree-walkupper/lower/titlebodies. Only the wasm32 arm usesunsafev128 intrinsics; other targets stay on the chunked scalar fallback.glob— linear-time Unicode-aware glob matcher backing theglob_match(s, pattern) -> Boolstdlib function.
UCD version: Unicode 14.0.0 across every regeneration script. When a future Unicode bump lands, regenerate the four data-bearing siblings in one commit so the wasm-AOT data section and the tree-walk algorithm stay consistent.
Modules§
- ascii_
fold_ simd - v3++ item 4 — SIMD ASCII fast-path for case folding.
- case_
folding - v3+ a-4 Unicode-aware case folding tables embedded into wasm-AOT
upper/lowerstdlib bodies. - combining_
marks - v3++ b-4 Unicode combining-mark range table embedded into the
wasm-AOT
title/upper/lowerstdlib bodies. - full_
case_ folding - v3++ b-6 full Unicode case folding (UAX #21).
- glob
- Linear-time Unicode-aware glob pattern matcher.
- normalization
- Unicode normalization (UAX #15).
- normalization_
data - AUTO-GENERATED — do not edit by hand.
- whitespace
- v3++ b-4 Unicode whitespace range table embedded into the wasm-AOT
titlestdlib body.
Functions§
- cp_
in_ ranges - Binary-search a sorted
(start, end)range table forcp— used by every compile-time membership predicate (whitespace, combining-marks, full-fold locale ranges). The wasm body emits the same comparison via a hand-unrolled loop instead so the per-cp cost stays O(log N) on both sides. - encode_
u32_ pair_ table - Encode a
(u32, u32)table into the wasm data-section layout shared by case-folding, combining-mark, whitespace, and full-fold range tables:[count: u32 LE][(a: u32 LE, b: u32 LE) × N]. The runtime helpers all binary-search with the same(addr + 4 + mid * 8)rebase arithmetic, so the byte format is identical regardless of whether the pair encodes(input_cp, output_cp)or(start, end). - encoded_
u32_ pair_ table_ size - Byte size of
encode_u32_pair_table’s output — header + 8 bytes per entry. Codegen calls this to pre-size data sections.