intl

Pure-Rust, #![no_std] internationalization primitives — a long-term, pure-Rust analog of ICU (collation, number formatting, normalization, transliteration, …).

The foundational layer, available today, is the unicode module: Unicode rune analysis driven by the official Unicode Character Database (UCD), with character properties compiled directly into Rust match dispatch by an offline code generator — so every lookup is a const fn, allocates nothing, and needs no runtime initialization.

no_std, no alloc — usable in embedded, kernel, and WASM contexts.
Tables as code — the UCD is converted into a two-level paged match ("switch/case") index, not parsed at runtime.
Feature-selectable ranges — compile only the slice of the codepoint space you need. Anything outside the compiled range resolves to the neutral default (Unassigned / false), so every lookup is total.
Targets Unicode 17.0.0.

Usage

[dependencies]
intl = "0.1"

use intl::unicode::{general_category, GeneralCategory, CharExt};

assert_eq!(general_category('A'), GeneralCategory::UppercaseLetter);
assert_eq!(general_category('中'), GeneralCategory::OtherLetter);

assert!('A'.is_uppercase());
assert!('٣'.is_numeric());          // Arabic-Indic digit three
assert!(' '.is_whitespace());
assert!(!'\u{0378}'.is_assigned()); // a reserved codepoint

Every predicate exists both as a free const fn taking a char (intl::unicode::is_uppercase('A')) and as a method via the CharExt trait ('A'.is_uppercase()).

Range tiers

Cargo features select how much of the codepoint space is compiled in, trading coverage for binary size. The tiers are nested (each implies the smaller ones):

feature	codepoints compiled
`ascii`	`U+0000..=U+007F`
`latin1`	`U+0000..=U+00FF`
`bmp`	`U+0000..=U+FFFF` (default)
`full`	`U+0000..=U+10FFFF`

# Latin-1 only, no default BMP tables:
intl = { version = "0.1", default-features = false, features = ["latin1"] }
# Everything, including supplementary planes:
intl = { version = "0.1", default-features = false, features = ["full"] }

A codepoint outside the compiled tier reports GeneralCategory::Unassigned (and false for every boolean predicate) — exactly as a genuinely unassigned codepoint would.

What the `unicode` module covers

General_Category (the 29 UAX #44 categories) and their major Groups, via general_category / general_category_u32.
Boolean predicates: is_alphabetic, is_uppercase, is_lowercase, is_whitespace (from the derived Unicode properties), plus the category-derived is_letter, is_mark, is_numeric, is_decimal_digit, is_punctuation, is_symbol, is_separator, is_control, is_format, and is_assigned.
Full, unconditional case mapping — to_uppercase, to_lowercase, to_titlecase, and case_fold, each returning a CaseMapIter (1–3 chars, e.g. ß → SS; no allocation).
Script and Script_Extensions (UAX #24) via script / script_u32 and script_extensions / script_extensions_u32 (Script enum with .long_name(); ScriptExtensions with .contains() / .iter()).
East_Asian_Width (UAX #11) via east_asian_width / east_asian_width_u32 (EastAsianWidth enum, with .is_wide()).
Numeric_Type and exact Numeric_Value via numeric_type and numeric_value / numeric_value_u32 (NumericValue is a rational numerator / denominator, with .to_i64() / .as_f64()).
UNICODE_VERSION of the embedded tables.

Regenerating the tables

The committed files under src/unicode/generated/ are produced from the vendored UCD text files in data/ucd/<version>/ by the codegen tool:

cargo run -p codegen

Output is deterministic and rustfmt-clean, so regeneration with the same data yields no diff. To update the Unicode version, drop the new UCD files into data/ucd/<version>/, bump the version in codegen, and re-run.

License

MIT — see LICENSE.