intl
Pure-Rust, #![no_std] internationalization primitives — a long-term, pure-Rust
analog of ICU (collation, number formatting, normalization, transliteration, …).
See ROADMAP.md for the plan toward ICU feature parity.
The foundational layer, available today, is the unicode module: Unicode
rune analysis driven by the official Unicode Character Database (UCD), with
character properties compiled directly into Rust match dispatch by an offline
code generator — so every lookup is a const fn, allocates nothing, and needs
no runtime initialization.
no_std, noalloc— usable in embedded, kernel, and WASM contexts.- Tables as code — the UCD is converted into a two-level paged
match("switch/case") index, not parsed at runtime. - Feature-selectable ranges — compile only the slice of the codepoint space
you need. Anything outside the compiled range resolves to the neutral default
(
Unassigned/false), so every lookup is total. - Targets Unicode 17.0.0.
Usage
[]
= "0.1"
use ;
assert_eq!;
assert_eq!;
assert!;
assert!; // Arabic-Indic digit three
assert!;
assert!; // a reserved codepoint
Every predicate exists both as a free const fn taking a char
(intl::unicode::is_uppercase('A')) and as a method via the CharExt trait
('A'.is_uppercase()).
Normalization and collation (the latter behind the alloc feature):
use ;
assert_eq!;
assert_eq!;
// With the `alloc` feature:
use compare;
use Ordering;
assert_eq!; // é (≈ e) sorts before z
Beyond the unicode module:
-
intl::locale(alloc) parses and canonicalizes BCP-47 language tags (Locale::parse("zh-hant-hk")→"zh-Hant-HK"). -
intl::plural(no_std, no alloc) selects the CLDRPluralCategoryfor a number in a language —plural_category(cardinal) andordinal_category("1st"/"2nd"/"3rd"), rules compiled from CLDR into amatch.plural_category("pl", &PluralOperands::from_int(5))→Many. Validated against the CLDR sample data (cardinal + ordinal). -
intl::number(alloc) formats numbers in a locale's conventions —format_decimal("de", 1234.5)→"1.234,5",format_decimal("hi", 1234567.0)→"12,34,567"(Indian grouping),format_percent("en", 0.5)→"50%". Driven by CLDR symbols + patterns for a curated locale set. -
intl::list(alloc) joins items with locale connectors —format_list("en", &["a","b","c"], ListStyle::And)→"a, b, and c". -
intl::relative(alloc) formats relative times —format_relative("en", -2, RelativeUnit::Hour)→"2 hours ago",format_relative("en", -1, RelativeUnit::Day)→"yesterday"(plural- and number-aware).
These build out the CLDR/locale layer toward full ICU-style formatting.
Features
default = ["bmp"]. Range tiers are ascii ⊂ latin1 ⊂ bmp ⊂ full (below). The
alloc feature (still no_std) enables the allocating APIs
(unicode::collate, unicode::spoof, unicode::idna, intl::locale, …); it
implies full.
Range tiers
Cargo features select how much of the codepoint space is compiled in, trading coverage for binary size. The tiers are nested (each implies the smaller ones):
| feature | codepoints compiled |
|---|---|
ascii |
U+0000..=U+007F |
latin1 |
U+0000..=U+00FF |
bmp |
U+0000..=U+FFFF (default) |
full |
U+0000..=U+10FFFF |
# Latin-1 only, no default BMP tables:
= { = "0.1", = false, = ["latin1"] }
# Everything, including supplementary planes:
= { = "0.1", = false, = ["full"] }
A codepoint outside the compiled tier reports GeneralCategory::Unassigned
(and false for every boolean predicate) — exactly as a genuinely unassigned
codepoint would.
What the unicode module covers
General_Category(the 29 UAX #44 categories) and their majorGroups, viageneral_category/general_category_u32.- Boolean predicates:
is_alphabetic,is_uppercase,is_lowercase,is_whitespace(from the derived Unicode properties), plus the category-derivedis_letter,is_mark,is_numeric,is_decimal_digit,is_punctuation,is_symbol,is_separator,is_control,is_format, andis_assigned; plus the property predicatesis_math,is_dash,is_diacritic,is_hex_digit,is_quotation_mark,is_join_control, andis_default_ignorable. - Segmentation (UAX #29) — extended grapheme cluster, word, and sentence
boundary iteration via
graphemes(&str),words(&str), andsentences(&str)(each yielding&str, allocation-free). Grapheme breaking handles combining marks, Hangul, Indic conjuncts, regional-indicator flags, and emoji ZWJ sequences; word and sentence breaking implement the full WB / SB rule sets. All three validated against the officialGraphemeBreakTest/WordBreakTest/SentenceBreakTestsuites. - Line breaking (UAX #14) —
line_breaks(&str)yielding break opportunities (mandatory vs allowed). ~99.98% conformant againstLineBreakTest(a few CJK quotation/East-Asian-Width edge cases remain). - Collation (UTS #10) — DUCET root collation via
collate::compare/collate::Collator(andsort_key), with non-ignorable or shifted variable handling. Validated against the full officialCollationTestsuite (both modes). Requires theallocfeature. - Normalization (UAX #15) —
nfd,nfc,nfkd,nfkcas streaming, allocation-free iterator adaptors overIterator<Item = char>; quick-check helpersis_nfc/is_nfd/is_nfkc/is_nfkd(and tri-statequick_check_*→IsNormalized); pluscanonical_combining_class. Validated against the full officialNormalizationTest.txtconformance suite. - Full, unconditional case mapping — per-
charto_uppercase,to_lowercase,to_titlecase,case_fold(each aCaseMapIter, 1–3 chars, e.g.ß→SS), plus whole-stream adaptorsuppercase/lowercase/foldoverIterator<Item = char>(e.g.uppercase("Weiß".chars()); no allocation).foldgives caseless comparison. ScriptandScript_Extensions(UAX #24) viascript/script_u32andscript_extensions/script_extensions_u32(Scriptenum with.long_name();ScriptExtensionswith.contains()/.iter()).East_Asian_Width(UAX #11) viaeast_asian_width/east_asian_width_u32(EastAsianWidthenum, with.is_wide()).- Bidirectional text (UAX #9) —
bidi_class(theBidiClassenum),base_direction(&str)(rules P2–P3), and (withalloc) the full reordering algorithmbidi::process(&str, …) -> BidiInfo(embedding levels + visual order). ~99.996% conformant againstBidiCharacterTest. - Identifiers (UAX #31) —
is_xid_start,is_xid_continue, andis_identifier(&str)for default identifier validation. - Confusables / spoof detection (UTS #39) —
spoof::skeleton,spoof::confusable, andspoof::is_single_script(mixed-script detection). Requiresalloc. - IDNA / Punycode (UTS #46 / RFC 3492) —
idna::to_ascii/idna::to_unicodefor domain names (mapping + NFC + Punycode). The mapping/Punycode core passes every clean-success line of IdnaTestV2; the contextual validity rules (CheckBidi/CheckJoiners) are not yet enforced. Requiresalloc. Numeric_Typeand exactNumeric_Valuevianumeric_typeandnumeric_value/numeric_value_u32(NumericValueis a rationalnumerator / denominator, with.to_i64()/.as_f64()).UNICODE_VERSIONof the embedded tables.
Regenerating the tables
The committed files under src/unicode/generated/ are produced from the
vendored UCD text files in data/ucd/<version>/ by the codegen tool. It is a
packaging-time tool run only when updating the data or the Unicode version —
the published crate never builds or invokes it, and codegen/ is a standalone
package (not a workspace member and not part of intl).
Output is deterministic and rustfmt-clean, so regeneration with the same data
yields no diff. To update the Unicode version, drop the new UCD files into
data/ucd/<version>/, bump the version in codegen, and re-run.
License
MIT — see LICENSE.