Expand description
§RFC 9839 Unicode Subset Validators
This crate provides fast, zero-allocation checks to validate whether individual characters, strings, or raw byte slices conform to the subsets defined in RFC 9839:
- Unicode Scalars — all code points except the UTF-16 surrogate range.
In Rust, all
charvalues are scalars by construction, but functions are included for completeness and defensive validation of&str/ byte data. - XML Characters — the “Char” production from XML:
{TAB, LF, CR} ∪ [0x20–0xD7FF] ∪ [0xE000–0xFFFD] ∪ [0x10000–0x10FFFF]. Excludes legacy controls and noncharacters such as U+FFFE/U+FFFF. - Unicode Assignables — “not problematic” characters: useful controls, printable ASCII (excluding DEL/C1), and all assigned scalars minus the standardized noncharacters (…FFFE/FFFF in each plane and U+FDD0–FDEF).
§Features
- Character-level APIs (
is_*_char) implemented asconst fnwith simple range tests. - String-level APIs (
is_*) with an ASCII fast-path: scan raw bytes first, and only fall back tochars()after the first non-ASCII byte. - Byte-level APIs (
is_*_bytes) for validating raw UTF-8 input. The tail is decoded once, returningfalseon invalid UTF-8. - Zero allocations, no heap lookups, no tables.
§Examples
use rfc9839::*;
// Scalars (always true for safe Rust strings)
assert!(is_unicode_scalar("hello 🌍"));
// XML Characters
assert!(is_xml_chars("ok\tline\n"));
assert!(!is_xml_chars("\u{0000}")); // NUL is disallowed
// Unicode Assignables
assert!(is_unicode_assignable("emoji 👍"));
assert!(!is_unicode_assignable("\u{007F}")); // DEL is excluded§Performance
All string/byte checks run in O(n). ASCII data is validated in a tight loop;
non-ASCII triggers a one-time chars() traversal or UTF-8 decode. These
functions are designed for high-throughput pipelines, parsers, and
validators.
Functions§
- is_
unicode_ assignable - Returns
trueif all characters insare Unicode Assignables. - is_
unicode_ assignable_ bytes - Returns
trueifbytesare all valid Unicode Assignables. - is_
unicode_ assignable_ char - Returns
trueifcis a Unicode Assignable character per RFC 9839. - is_
unicode_ scalar - Returns
trueif all code points insare Unicode scalar values. - is_
unicode_ scalar_ bytes - Returns
trueifbytesare valid Unicode scalar values. - is_
unicode_ scalar_ char - Returns
trueifcis a Unicode scalar value per RFC 9839. - is_
xml_ char - Returns
trueifcis an XML Character as defined in RFC 9839. - is_
xml_ chars - Returns
trueif all characters insare XML Characters. - is_
xml_ chars_ bytes - Returns
trueifbytesare all valid XML Characters.