oxitext-icu — ICU4X-backed Unicode services for OxiText

oxitext-icu is the Unicode/CLDR services layer of OxiText. It provides CLDR-compliant text boundary analysis (line, word, grapheme-cluster, sentence), locale-aware collation (Unicode Collation Algorithm), case mapping, normalization (NFC/NFD/NFKC/NFKD), script detection and itemization, and locale-aware number/list/plural/date-time formatting. The CLDR line-break and word-boundary results feed directly into the layout engine in oxitext-layout.

This crate wraps the ICU4X family of crates (icu_segmenter, icu_collator, icu_casemap, icu_normalizer, icu_properties, icu_locale_core, icu_decimal, icu_list, icu_plurals, icu_datetime, fixed_decimal). ICU4X is the Pure-Rust successor to the C/C++ ICU library, so this crate is 100% Pure Rust — no C ICU, no libicu. Unicode/CLDR data is compiled in via ICU4X's compiled_data feature, which bakes the tables into the binary at build time.

Binary-size note. Compiled CLDR data adds roughly 5–15 MB to the final binary depending on which modules are exercised (icu_segmenter ≈ 1–3 MB, icu_collator ≈ 2–5 MB, others smaller). For size-sensitive targets, use ICU4X's icu_provider_blob or icu_provider_fs to load data at runtime instead of baking it in.

Installation

[dependencies]
oxitext-icu = "0.1.0"

This crate has no default features; every type is available as soon as the dependency is added.

Quick Start

use oxitext_icu::{IcuSegmenter, SegmentKind};

let seg = IcuSegmenter::new();
let breaks = seg.break_points("Hello world", SegmentKind::Word);
assert!(breaks.len() >= 2);

Rich segmentation with byte offsets

use oxitext_icu::{IcuSegmenter, SegmentKind};

let seg = IcuSegmenter::new();
let segs = seg.segments("Hello world", SegmentKind::Word);
let words: Vec<&str> = segs.iter()
    .filter(|s| !s.text.trim().is_empty())
    .map(|s| s.text.as_str())
    .collect();
assert!(words.contains(&"Hello"));
assert!(words.contains(&"world"));

Locale-aware collation

use oxitext_icu::{IcuCollator, CollationStrength};
use std::cmp::Ordering;

// Primary strength: accents and case are ignored.
let c = IcuCollator::with_strength("en", CollationStrength::Primary)?;
assert_eq!(c.compare("Apple", "apple"), Ordering::Equal);

// Swedish: "z" sorts before "ä".
let sv = IcuCollator::new_for_locale("sv")?;
assert_eq!(sv.compare("z", "ä"), Ordering::Less);
# Ok::<(), oxitext_icu::CollateError>(())

API Overview

Segmentation — `segment` module

Item	Kind	Description
`IcuSegmenter`	struct	Multi-kind segmenter backed by compiled CLDR data; construction is allocation-free.
`IcuSegmenter::new()` / `with_locale(locale)` / `new_with_locale(id)`	fn	Construct (locale-aware variants return `Result<_, CollateError>`).
`break_points(text, kind)`	fn	Boundary byte offsets for the given `SegmentKind` → `Vec<usize>`.
`segment_strs(text, kind)`	fn	Borrowed `&str` slices between boundaries.
`segments(text, kind)`	fn	Rich `Vec<Segment>` (text + byte range + kind).
`iter_segments(text, kind)`	fn	Lazy `SegmentIter` over segments.
`word_boundaries(text)`	fn	Convenience UAX #29 word boundaries.
`line_break_opportunities(text)`	fn	UAX #14 line-break offsets (for word wrap).
`cjk_line_break_opportunities(text)`	fn	CJK-aware line-break offsets.
`needs_dictionary_segmentation(text)`	fn (assoc.)	`true` for Thai/CJK and other dictionary-segmented scripts.
`SegmentKind`	enum	`Line`, `Word`, `GraphemeCluster`, `Grapheme` (alias), `Sentence`.
`Segment`	struct	`text`, `byte_start`, `byte_end`, `kind`.
`SegmentIter`	struct	Iterator over `Segment` values.
`cldr_line_breaks(text)` (crate root)	fn	CLDR line-break offsets; pass straight to `oxitext-layout`'s `layout_with_break_points`.

Collation — `collate` module

Item	Kind	Description
`IcuCollator`	struct	Locale-aware comparator over compiled CLDR tables (no per-comparison allocation).
`IcuCollator::new(id)` / `new_for_locale(id)`	fn	Construct for a BCP-47 locale.
`IcuCollator::with_strength(id, strength)`	fn	Construct with an explicit `CollationStrength`.
`compare(a, b)`	fn	Locale-aware `Ordering`.
`sort_key(text)`	fn	Binary sort key (`Vec<u8>`) for repeated comparisons.
`compare_sort_keys(a, b)`	fn (assoc.)	Compare two precomputed sort keys.
`CollationStrength`	enum	`Primary`, `Secondary`, `Tertiary` (default), `Quaternary`, `Identical`.

Case mapping — `casemap` module

Item	Kind	Description
`CaseMapper`	struct	Locale-aware case conversion.
`to_uppercase(text, locale_id)` / `to_lowercase(...)` / `to_titlecase(...)`	fn	Locale-sensitive case folding (e.g. Turkish dotted/dotless `i`).

Normalization — `normalize` module

Item	Kind	Description
`Normalizer`	struct	Holds all four normalizers; cheap to construct.
`normalize(text, form)`	fn	Normalize to the requested `NormalizationForm`.
`is_normalized(text, form)`	fn	Test whether text is already in a form.
`nfc(text)`	fn	Shortcut for NFC.
`NormalizationForm`	enum	`Nfc`, `Nfd`, `Nfkc`, `Nfkd`.

Character properties — `properties` module

Item	Kind	Description
`CharProperties`	struct	Property-query engine over compiled UCD data.
`script(c)`	fn	Resolve a character's `TextScript`.
`is_alphabetic(c)` / `is_whitespace(c)` / `is_numeric(c)`	fn	Boolean property queries.
`general_category(c)`	fn	General-category lookup.
`itemize(text)`	fn	Split text into `ScriptRun`s sharing one script.
`dominant_script(text)`	fn	Most frequent script in the string.
`has_rtl(text)`	fn	`true` if any RTL character is present.
`TextScript`	enum	`Latin`, `Greek`, `Cyrillic`, `Arabic`, `Hebrew`, `Han`, `Hiragana`, `Katakana`, `Hangul`, `Thai`, `Devanagari`, `Common`, `Inherited`, `Other`; `is_rtl()`.
`ScriptRun`	struct	`start`, `end` (byte offsets), `script`.

Formatters — `number`, `list`, `plural`, `datetime` modules

Item	Kind	Description
`IcuNumberFormatter`	struct	Locale-aware numbers: `format_int`, `format_uint`, `format`, `format_with_precision`.
`IcuListFormatter`	struct	Locale-aware lists: `format(&[&str])`, `format_owned(&[String])`.
`ListType`	enum	`And`, `Or`, `Unit`.
`IcuPluralRules`	struct	Plural selection: `category_for`, `ordinal_category_for`, `select`, `categories`.
`PluralCategory`	enum	`Zero`, `One`, `Two`, `Few`, `Many`, `Other`.
`IcuDateTimeFormatter`	struct	Locale-aware date/time: `format_date`, `format_time`, `format_datetime`, `locale_id`.
`DateLength`	enum	`Full`, `Long`, `Medium` (default), `Short`.
`TimeLength`	enum	`Full`, `Long`, `Medium` (default), `Short`, `None`.

Error type

Variant	Description
`CollateError::InvalidLocale(String)`	The locale string is not a valid BCP-47 locale.
`CollateError::Icu(String)`	The ICU data provider returned an error (e.g. unknown tailoring).

IcuError is a crate-root type alias for CollateError, shared across the formatter constructors.

Cross-references

oxitext-layout — consumes the CLDR break offsets from cldr_line_breaks / line_break_opportunities for word wrapping.
oxitext — re-exports a curated subset of this crate under oxitext::icu behind the icu feature.
oxitext-core — shared text/style types.
oxitext-shape · oxitext-raster · oxitext-sdf · oxitext-bench — sibling crates in the OxiText pipeline.

License

Apache-2.0 — COOLJAPAN OU (Team Kitasan)

oxitext-icu 0.1.2