Expand description
§RFC 9839 Unicode Subset Validators
This crate provides fast, zero-allocation checks to validate whether individual characters, strings, or raw byte slices conform to the subsets defined in RFC 9839:
- Unicode Scalars — all code points except the UTF-16 surrogate range.
In Rust, all
char
values are scalars by construction, but functions are included for completeness and defensive validation of&str
/ byte data. - XML Characters — the “Char” production from XML:
{TAB, LF, CR} ∪ [0x20–0xD7FF] ∪ [0xE000–0xFFFD] ∪ [0x10000–0x10FFFF]
. Excludes legacy controls and noncharacters such as U+FFFE/U+FFFF. - Unicode Assignables — “not problematic” characters: useful controls, printable ASCII (excluding DEL/C1), and all assigned scalars minus the standardized noncharacters (…FFFE/FFFF in each plane and U+FDD0–FDEF).
§Features
- Character-level APIs (
is_*_char
) implemented asconst fn
with simple range tests. - String-level APIs (
is_*
) with an ASCII fast-path: scan raw bytes first, and only fall back tochars()
after the first non-ASCII byte. - Byte-level APIs (
is_*_bytes
) for validating raw UTF-8 input. The tail is decoded once, returningfalse
on invalid UTF-8. - Zero allocations, no heap lookups, no tables.
§Examples
use rfc9839::*;
// Scalars (always true for safe Rust strings)
assert!(is_unicode_scalar("hello 🌍"));
// XML Characters
assert!(is_xml_chars("ok\tline\n"));
assert!(!is_xml_chars("\u{0000}")); // NUL is disallowed
// Unicode Assignables
assert!(is_unicode_assignable("emoji 👍"));
assert!(!is_unicode_assignable("\u{007F}")); // DEL is excluded
§Performance
All string/byte checks run in O(n). ASCII data is validated in a tight loop;
non-ASCII triggers a one-time chars()
traversal or UTF-8 decode. These
functions are designed for high-throughput pipelines, parsers, and
validators.
Functions§
- is_
unicode_ assignable - Returns
true
if all characters ins
are Unicode Assignables. - is_
unicode_ assignable_ bytes - Returns
true
ifbytes
are all valid Unicode Assignables. - is_
unicode_ assignable_ char - Returns
true
ifc
is a Unicode Assignable character per RFC 9839. - is_
unicode_ scalar - Returns
true
if all code points ins
are Unicode scalar values. - is_
unicode_ scalar_ bytes - Returns
true
ifbytes
are valid Unicode scalar values. - is_
unicode_ scalar_ char - Returns
true
ifc
is a Unicode scalar value per RFC 9839. - is_
xml_ char - Returns
true
ifc
is an XML Character as defined in RFC 9839. - is_
xml_ chars - Returns
true
if all characters ins
are XML Characters. - is_
xml_ chars_ bytes - Returns
true
ifbytes
are all valid XML Characters.