intl
Pure-Rust, #![no_std] internationalization primitives — a long-term, pure-Rust
analog of ICU (collation, number formatting, normalization, transliteration, …).
The foundational layer, available today, is the unicode module: Unicode
rune analysis driven by the official Unicode Character Database (UCD), with
character properties compiled directly into Rust match dispatch by an offline
code generator — so every lookup is a const fn, allocates nothing, and needs
no runtime initialization.
no_std, noalloc— usable in embedded, kernel, and WASM contexts.- Tables as code — the UCD is converted into a two-level paged
match("switch/case") index, not parsed at runtime. - Feature-selectable ranges — compile only the slice of the codepoint space
you need. Anything outside the compiled range resolves to the neutral default
(
Unassigned/false), so every lookup is total. - Targets Unicode 17.0.0.
Usage
[]
= "0.1"
use ;
assert_eq!;
assert_eq!;
assert!;
assert!; // Arabic-Indic digit three
assert!;
assert!; // a reserved codepoint
Every predicate exists both as a free const fn taking a char
(intl::unicode::is_uppercase('A')) and as a method via the CharExt trait
('A'.is_uppercase()).
Range tiers
Cargo features select how much of the codepoint space is compiled in, trading coverage for binary size. The tiers are nested (each implies the smaller ones):
| feature | codepoints compiled |
|---|---|
ascii |
U+0000..=U+007F |
latin1 |
U+0000..=U+00FF |
bmp |
U+0000..=U+FFFF (default) |
full |
U+0000..=U+10FFFF |
# Latin-1 only, no default BMP tables:
= { = "0.1", = false, = ["latin1"] }
# Everything, including supplementary planes:
= { = "0.1", = false, = ["full"] }
A codepoint outside the compiled tier reports GeneralCategory::Unassigned
(and false for every boolean predicate) — exactly as a genuinely unassigned
codepoint would.
What the unicode module covers
General_Category(the 29 UAX #44 categories) and their majorGroups, viageneral_category/general_category_u32.- Boolean predicates:
is_alphabetic,is_uppercase,is_lowercase,is_whitespace(from the derived Unicode properties), plus the category-derivedis_letter,is_mark,is_numeric,is_decimal_digit,is_punctuation,is_symbol,is_separator,is_control,is_format, andis_assigned. - Normalization (UAX #15) —
nfd,nfc,nfkd,nfkcas streaming, allocation-free iterator adaptors overIterator<Item = char>; quick-check helpersis_nfc/is_nfd/is_nfkc/is_nfkd(and tri-statequick_check_*→IsNormalized); pluscanonical_combining_class. Validated against the full officialNormalizationTest.txtconformance suite. - Full, unconditional case mapping —
to_uppercase,to_lowercase,to_titlecase, andcase_fold, each returning aCaseMapIter(1–3 chars, e.g.ß→SS; no allocation). ScriptandScript_Extensions(UAX #24) viascript/script_u32andscript_extensions/script_extensions_u32(Scriptenum with.long_name();ScriptExtensionswith.contains()/.iter()).East_Asian_Width(UAX #11) viaeast_asian_width/east_asian_width_u32(EastAsianWidthenum, with.is_wide()).Numeric_Typeand exactNumeric_Valuevianumeric_typeandnumeric_value/numeric_value_u32(NumericValueis a rationalnumerator / denominator, with.to_i64()/.as_f64()).UNICODE_VERSIONof the embedded tables.
Regenerating the tables
The committed files under src/unicode/generated/ are produced from the
vendored UCD text files in data/ucd/<version>/ by the codegen tool. It is a
packaging-time tool run only when updating the data or the Unicode version —
the published crate never builds or invokes it, and codegen/ is a standalone
package (not a workspace member and not part of intl).
Output is deterministic and rustfmt-clean, so regeneration with the same data
yields no diff. To update the Unicode version, drop the new UCD files into
data/ucd/<version>/, bump the version in codegen, and re-run.
License
MIT — see LICENSE.