intl 0.1.1

Pure-Rust, no_std internationalization primitives. Provides a `unicode` module (General_Category + character predicates) with property tables compiled into const-fn match lookups and feature-selectable codepoint ranges.
Documentation

intl

Pure-Rust, #![no_std] internationalization primitives — a long-term, pure-Rust analog of ICU (collation, number formatting, normalization, transliteration, …).

The foundational layer, available today, is the unicode module: Unicode rune analysis driven by the official Unicode Character Database (UCD), with character properties compiled directly into Rust match dispatch by an offline code generator — so every lookup is a const fn, allocates nothing, and needs no runtime initialization.

  • no_std, no alloc — usable in embedded, kernel, and WASM contexts.
  • Tables as code — the UCD is converted into a two-level paged match ("switch/case") index, not parsed at runtime.
  • Feature-selectable ranges — compile only the slice of the codepoint space you need. Anything outside the compiled range resolves to the neutral default (Unassigned / false), so every lookup is total.
  • Targets Unicode 17.0.0.

Usage

[dependencies]
intl = "0.1"
use intl::unicode::{general_category, GeneralCategory, CharExt};

assert_eq!(general_category('A'), GeneralCategory::UppercaseLetter);
assert_eq!(general_category(''), GeneralCategory::OtherLetter);

assert!('A'.is_uppercase());
assert!('٣'.is_numeric());          // Arabic-Indic digit three
assert!(' '.is_whitespace());
assert!(!'\u{0378}'.is_assigned()); // a reserved codepoint

Every predicate exists both as a free const fn taking a char (intl::unicode::is_uppercase('A')) and as a method via the CharExt trait ('A'.is_uppercase()).

Range tiers

Cargo features select how much of the codepoint space is compiled in, trading coverage for binary size. The tiers are nested (each implies the smaller ones):

feature codepoints compiled
ascii U+0000..=U+007F
latin1 U+0000..=U+00FF
bmp U+0000..=U+FFFF (default)
full U+0000..=U+10FFFF
# Latin-1 only, no default BMP tables:
intl = { version = "0.1", default-features = false, features = ["latin1"] }
# Everything, including supplementary planes:
intl = { version = "0.1", default-features = false, features = ["full"] }

A codepoint outside the compiled tier reports GeneralCategory::Unassigned (and false for every boolean predicate) — exactly as a genuinely unassigned codepoint would.

What the unicode module covers

  • General_Category (the 29 UAX #44 categories) and their major Groups, via general_category / general_category_u32.
  • Boolean predicates: is_alphabetic, is_uppercase, is_lowercase, is_whitespace (from the derived Unicode properties), plus the category-derived is_letter, is_mark, is_numeric, is_decimal_digit, is_punctuation, is_symbol, is_separator, is_control, is_format, and is_assigned.
  • Full, unconditional case mappingto_uppercase, to_lowercase, to_titlecase, and case_fold, each returning a CaseMapIter (1–3 chars, e.g. ßSS; no allocation).
  • Script and Script_Extensions (UAX #24) via script / script_u32 and script_extensions / script_extensions_u32 (Script enum with .long_name(); ScriptExtensions with .contains() / .iter()).
  • East_Asian_Width (UAX #11) via east_asian_width / east_asian_width_u32 (EastAsianWidth enum, with .is_wide()).
  • Numeric_Type and exact Numeric_Value via numeric_type and numeric_value / numeric_value_u32 (NumericValue is a rational numerator / denominator, with .to_i64() / .as_f64()).
  • UNICODE_VERSION of the embedded tables.

Regenerating the tables

The committed files under src/unicode/generated/ are produced from the vendored UCD text files in data/ucd/<version>/ by the codegen tool:

cargo run -p codegen

Output is deterministic and rustfmt-clean, so regeneration with the same data yields no diff. To update the Unicode version, drop the new UCD files into data/ucd/<version>/, bump the version in codegen, and re-run.

License

MIT — see LICENSE.