Crate rfc9839

Crate rfc9839 

Source
Expand description

§RFC 9839 Unicode Subset Validators

This crate provides fast, zero-allocation checks to validate whether individual characters, strings, or raw byte slices conform to the subsets defined in RFC 9839:

  • Unicode Scalars — all code points except the UTF-16 surrogate range. In Rust, all char values are scalars by construction, but functions are included for completeness and defensive validation of &str / byte data.
  • XML Characters — the “Char” production from XML: {TAB, LF, CR} ∪ [0x20–0xD7FF] ∪ [0xE000–0xFFFD] ∪ [0x10000–0x10FFFF]. Excludes legacy controls and noncharacters such as U+FFFE/U+FFFF.
  • Unicode Assignables — “not problematic” characters: useful controls, printable ASCII (excluding DEL/C1), and all assigned scalars minus the standardized noncharacters (…FFFE/FFFF in each plane and U+FDD0–FDEF).

§Features

  • Character-level APIs (is_*_char) implemented as const fn with simple range tests.
  • String-level APIs (is_*) with an ASCII fast-path: scan raw bytes first, and only fall back to chars() after the first non-ASCII byte.
  • Byte-level APIs (is_*_bytes) for validating raw UTF-8 input. The tail is decoded once, returning false on invalid UTF-8.
  • Zero allocations, no heap lookups, no tables.

§Examples

use rfc9839::*;

// Scalars (always true for safe Rust strings)
assert!(is_unicode_scalar("hello 🌍"));

// XML Characters
assert!(is_xml_chars("ok\tline\n"));
assert!(!is_xml_chars("\u{0000}")); // NUL is disallowed

// Unicode Assignables
assert!(is_unicode_assignable("emoji 👍"));
assert!(!is_unicode_assignable("\u{007F}")); // DEL is excluded

§Performance

All string/byte checks run in O(n). ASCII data is validated in a tight loop; non-ASCII triggers a one-time chars() traversal or UTF-8 decode. These functions are designed for high-throughput pipelines, parsers, and validators.

Functions§

is_unicode_assignable
Returns true if all characters in s are Unicode Assignables.
is_unicode_assignable_bytes
Returns true if bytes are all valid Unicode Assignables.
is_unicode_assignable_char
Returns true if c is a Unicode Assignable character per RFC 9839.
is_unicode_scalar
Returns true if all code points in s are Unicode scalar values.
is_unicode_scalar_bytes
Returns true if bytes are valid Unicode scalar values.
is_unicode_scalar_char
Returns true if c is a Unicode scalar value per RFC 9839.
is_xml_char
Returns true if c is an XML Character as defined in RFC 9839.
is_xml_chars
Returns true if all characters in s are XML Characters.
is_xml_chars_bytes
Returns true if bytes are all valid XML Characters.