intl
Pure-Rust, #![no_std] internationalization primitives — a pure-Rust analog of
ICU (collation, number formatting, normalization, transliteration, …). The core
Unicode algorithms ship with their official conformance suites passing 100%
(Normalization, Collation, Grapheme/Word/Sentence, Line break, Bidi).
The foundational layer, available today, is the unicode module: Unicode
rune analysis driven by the official Unicode Character Database (UCD), with
character properties compiled directly into Rust match dispatch by an offline
code generator — so every lookup is a const fn, allocates nothing, and needs
no runtime initialization.
- Everything by default — the full Unicode range, the formatters, and the character-name database are on out of the box; opt out for size.
- Always
no_std— and fully usable with no allocator (drop the default features and pick a range tier) for embedded, kernel, and WASM contexts. - Tables as code — the UCD is converted into a two-level paged
match("switch/case") index, not parsed at runtime. - Feature-selectable ranges — trim to only the slice of the codepoint space
you need. Anything outside the compiled range resolves to the neutral default
(
Unassigned/false), so every lookup is total. - Targets Unicode 17.0.0.
Usage
[]
= "0.1"
use ;
assert_eq!;
assert_eq!;
assert!;
assert!; // Arabic-Indic digit three
assert!;
assert!; // a reserved codepoint
Every predicate exists both as a free const fn taking a char
(intl::unicode::is_uppercase('A')) and as a method via the CharExt trait
('A'.is_uppercase()).
Normalization and collation (the latter behind the alloc feature):
use ;
assert_eq!;
assert_eq!;
// With the `alloc` feature:
use compare;
use Ordering;
assert_eq!; // é (≈ e) sorts before z
Beyond the unicode module:
-
intl::locale(alloc) parses and canonicalizes BCP-47 language tags (Locale::parse("zh-hant-hk")→"zh-Hant-HK"), and adds/removes likely subtags (Locale::maximize:en→en-Latn-US;Locale::minimize:zh-Hans-CN→zh), and negotiates a best match between a user's requested locales and what's available (negotiate). -
intl::plural(no_std, no alloc) selects the CLDRPluralCategoryfor a number in a language —plural_category(cardinal) andordinal_category("1st"/"2nd"/"3rd"), rules compiled from CLDR into amatch.plural_category("pl", &PluralOperands::from_int(5))→Many. Validated against the CLDR sample data (cardinal + ordinal). -
intl::number(alloc) formats numbers in a locale's conventions —format_decimal("de", 1234.5)→"1.234,5",format_decimal("hi", 1234567.0)→"12,34,567"(Indian grouping),format_percent("en", 0.5)→"50%",format_currency("en", 1234.5, "USD")→"$1,234.50",format_scientific("1.2345E4"),format_compact("1.5K","2.3M"), andparse_decimalback to anf64(parse_decimal("de", "1.234,5")→1234.5), plus native digit systems (to_numbering_system("2024", "arab")→"٢٠٢٤") and ordinals (format_ordinal("en", 21)→"21st"). -
intl::list(alloc) joins items with locale connectors —format_list("en", &["a","b","c"], ListStyle::And)→"a, b, and c". -
intl::relative(alloc) formats relative times —format_relative("en", -2, RelativeUnit::Hour)→"2 hours ago",format_relative("en", -1, RelativeUnit::Day)→"yesterday"(plural- and number-aware). -
intl::display(no_std, no alloc) gives locale display names —language_name("fr", "de")→Some("allemand"),region_name("en", "JP")→Some("Japan"). -
intl::unit(alloc) formats measurement units —format_unit("en", 5.0, Unit::Kilometer, UnitWidth::Long)→"5 kilometers"(plural- and number-aware, long/short widths) — and durations:format_duration("en", 3661, UnitWidth::Long)→"1 hour 1 minute 1 second". -
intl::message(alloc) is a subset of ICU MessageFormat —{arg}substitution,plural/selectordinal(with=Nand#), andselect, composing the plural rules and number formatting. -
intl::datetime(alloc) formats Gregorian dates/times —format_date("en", &dt, DateStyle::Long)→"June 4, 2026",format_date("de", &dt, DateStyle::Long)→"4. Juni 2026"(CLDR patterns, month/weekday names, am/pm; weekday via Sakamoto's algorithm). Alsoformat_skeleton("en", &dt, "yMMMd")→"Jun 4, 2026"(flexible field-set formatting), and renders Islamic (Hijri) and Persian dates with localized month names (format_islamic_date("en", 1445, 9, 1, DateStyle::Long)→"Ramadan 1, 1445 AH";format_persian_datelikewise). -
intl::spelloutspells integers out in words via the CLDR RBNF rules (locale-driven) —spell_cardinal("en", 1234)→"one thousand two hundred thirty-four",spell_cardinal("fr", 80)→"quatre-vingts", and ordinals viaspell_ordinal("en", 21)→"twenty-first". (alloc) -
intl::timezoneparses a POSIXTZstring ("PST8PDT,M3.2.0,M11.1.0/2") and computes the UTC offset / DST state for any date. With theiana-tzfeature it also loads the full IANA tz database (via the embeddedtimezone-datacrate):load_zone("America/New_York")thenoffset_at/abbrev_at/is_dst_at/to_localfor any instant, with historical transitions (theiana-tzfeature, on by default). -
intl::calendar(no_std, no alloc) converts dates between the Gregorian, civil (tabular) Islamic, Persian (Solar Hijri), Hebrew, and Chinese (lunisolar, 1900–2099 via an embedded lunar table) calendars through the Julian Day Number, gives the Japanese era/year, plus ISO-8601 week dates and day-of-week — pure integer arithmetic.DateTimealso does ISO-8601 timestamp parse/format, date arithmetic (add_seconds/add_days/weekday, leap- and carry-aware), andformat_gmt_offsetrenders a localized UTC offset (GMT+05:30,UTC−08:00). -
intl::translit(alloc) transliterates:latin_ascii("café"→"cafe", "Straße"→"Strasse"),remove_diacritics,cyrillic_to_latin(ISO 9),greek_to_latin(ELOT/ISO 843), andany_asciifor best-effort mixed-script ASCII ("Москва café Αθήνα"→"Moskva cafe Athina").
These build out the CLDR/locale layer toward full ICU-style formatting. The
locale data is compiled by the offline codegen into flat binary blobs committed
under src/cldr/ and embedded with include_bytes!, so the table layer is
no_std (no alloc dependency); only the formatting functions need alloc.
Features
Everything on by default, opt out for size. Out of the box you get the
whole Unicode codepoint space (full), every allocating API (alloc), every
Unicode algorithm component, the full character-name database (names),
and the IANA time-zone database (iana-tz) — all no_std. MSRV 1.86.
To shrink the build, set default-features = false and enable only the range
tier + the components you want.
Algorithm components (opt out individually)
Each component gates its module and its (sometimes large) generated table, so disabling one removes it from the build entirely:
| feature | what it provides | gated table |
|---|---|---|
normalization |
UAX #15 NFC/NFD/NFKC/NFKD | 659 KB |
segmentation |
UAX #29 grapheme/word/sentence + UAX #14 line | 429 KB |
bidi |
UAX #9 bidirectional algorithm | 61 KB |
case |
case mapping/folding (→ normalization, segmentation) | 284 KB |
collation |
UTS #10 collation (→ case, alloc) | 1.9 MB |
idna |
UTS #46 IDNA (→ normalization, alloc) | 464 KB |
confusables |
UTS #39 confusable/skeleton (→ normalization, alloc) | 369 KB |
identifiers |
UAX #31 identifiers | — |
names |
full character-name database (→ alloc) | 1.3 MB |
The foundational property lookups — General_Category, predicates, scripts,
East Asian Width, numeric values — are always available and not gated.
# Just normalization, nothing else:
= { = "0.1", = false, = ["full", "normalization"] }
# Just collation (pulls in case + normalization + alloc automatically):
= { = "0.1", = false, = ["collation"] }
# Everything except the 1.9 MB collation table and IANA tz:
= { = "0.1", = false, = [
"names", "segmentation", "bidi", "case", "idna", "confusables", "identifiers",
] }
Range tiers (opt out for size)
The range tiers select how much of the codepoint space is compiled in, trading coverage for binary size. They are nested (each implies the smaller ones):
| feature | codepoints compiled |
|---|---|
ascii |
U+0000..=U+007F |
latin1 |
U+0000..=U+00FF |
bmp |
U+0000..=U+FFFF |
full |
U+0000..=U+10FFFF (default) |
# Everything, the default:
= "0.1"
# Trim to the BMP and drop alloc/names for a smaller no_std build:
= { = "0.1", = false, = ["bmp"] }
# Minimal: ASCII tables only:
= { = "0.1", = false, = ["ascii"] }
# iana-tz is already on by default; drop it (and pick what you need) to avoid the dep:
= { = "0.1", = false, = ["full", "alloc"] }
A codepoint outside the compiled tier reports GeneralCategory::Unassigned
(and false for every boolean predicate) — exactly as a genuinely unassigned
codepoint would.
What the unicode module covers
General_Category(the 29 UAX #44 categories) and their majorGroups, viageneral_category/general_category_u32.- Boolean predicates:
is_alphabetic,is_uppercase,is_lowercase,is_whitespace(from the derived Unicode properties), plus the category-derivedis_letter,is_mark,is_numeric,is_decimal_digit,is_punctuation,is_symbol,is_separator,is_control,is_format, andis_assigned; plus the property predicatesis_math,is_dash,is_diacritic,is_hex_digit,is_quotation_mark,is_join_control, andis_default_ignorable. - Segmentation (UAX #29) — extended grapheme cluster, word, and sentence
boundary iteration via
graphemes(&str),words(&str), andsentences(&str)(each yielding&str, allocation-free). Grapheme breaking handles combining marks, Hangul, Indic conjuncts, regional-indicator flags, and emoji ZWJ sequences; word and sentence breaking implement the full WB / SB rule sets. All three validated against the officialGraphemeBreakTest/WordBreakTest/SentenceBreakTestsuites. - Line breaking (UAX #14) —
line_breaks(&str)yielding break opportunities (mandatory vs allowed). ~99.98% conformant againstLineBreakTest(a few CJK quotation/East-Asian-Width edge cases remain). - Collation (UTS #10) — DUCET root collation via
collate::compare/collate::Collator(andsort_key), with non-ignorable or shifted variable handling, strength levels (with_strength: accent-/case-insensitive), numeric ordering (with_numeric:file2 < file10), and locale tailoring (Tailoring::parse("&z < å < ä < ö")/Tailoring::for_locale("sv")for primary reordering). Validated against the full officialCollationTestsuite (both modes). Requires theallocfeature. - Normalization (UAX #15) —
nfd,nfc,nfkd,nfkcas streaming, allocation-free iterator adaptors overIterator<Item = char>; quick-check helpersis_nfc/is_nfd/is_nfkc/is_nfkd(and tri-statequick_check_*→IsNormalized); pluscanonical_combining_class. Validated against the full officialNormalizationTest.txtconformance suite. - Full, unconditional case mapping — per-
charto_uppercase,to_lowercase,to_titlecase,case_fold(each aCaseMapIter, 1–3 chars, e.g.ß→SS), plus whole-stream adaptorsuppercase/lowercase/foldoverIterator<Item = char>(e.g.uppercase("Weiß".chars()); no allocation).foldgives caseless comparison. ScriptandScript_Extensions(UAX #24) viascript/script_u32andscript_extensions/script_extensions_u32(Scriptenum with.long_name();ScriptExtensionswith.contains()/.iter()).East_Asian_Width(UAX #11) viaeast_asian_width/east_asian_width_u32(EastAsianWidthenum, with.is_wide()).- Bidirectional text (UAX #9) —
bidi_class(theBidiClassenum),base_direction(&str)(rules P2–P3), and (withalloc) the full reordering algorithmbidi::process(&str, …) -> BidiInfo(embedding levels + visual order). ~99.996% conformant againstBidiCharacterTest. - Identifiers (UAX #31) —
is_xid_start,is_xid_continue, andis_identifier(&str)for default identifier validation. - Confusables / spoof detection (UTS #39) —
spoof::skeleton,spoof::confusable, andspoof::is_single_script(mixed-script detection). Requiresalloc. - IDNA / Punycode (UTS #46 / RFC 3492) —
idna::to_ascii/idna::to_unicodefor domain names (mapping + NFC + Punycode). The mapping/Punycode core passes every clean-success line of IdnaTestV2; the contextual validity rules (CheckBidi/CheckJoiners) are not yet enforced. Requiresalloc. Numeric_Typeand exactNumeric_Valuevianumeric_typeandnumeric_value/numeric_value_u32(NumericValueis a rationalnumerator / denominator, with.to_i64()/.as_f64()).UNICODE_VERSIONof the embedded tables.
Regenerating the tables
The committed files under src/unicode/generated/ are produced from the
vendored UCD text files in data/ucd/<version>/ by the codegen tool. It is a
packaging-time tool run only when updating the data or the Unicode version —
the published crate never builds or invokes it, and codegen/ is a standalone
package (not a workspace member and not part of intl).
Output is deterministic and rustfmt-clean, so regeneration with the same data
yields no diff. To update the Unicode version, drop the new UCD files into
data/ucd/<version>/, bump the version in codegen, and re-run.
License
MIT — see LICENSE.