Expand description
Portable, formally verified UTF-8 validation.
This crate provides two entry points for validating that a byte sequence is well-formed UTF-8:
-
verifytakes a plain&[u8]and is fully safe. It uses an 8-byte SWAR (SIMD-within-a-register) ASCII fast path, and handles the final partial chunk with overlapping in-bounds loads so that it never reads past the slice. -
SlackBufis the safe wrapper for zero-copy parsers that maintain at leastSLACKreadable bytes after every logical field (the “eps-copy” pattern used by hyperpb and UPB). The padding invariant is established once at the buffer level — typically viaSlackBuf::new_add_slack(withfeature = "alloc") orSlackBuf::new_embedded_slack— and per-fieldverify/to_str/le_u32calls are then safe and skip the per-string tail-handling cost.verify_with_slackis the underlyingunsafeper-call entry point; preferSlackBuffor new code.
Both share the same multi-byte path: a shift-encoded DFA (the Vognsen/Langdale encoding of Höhrmann’s UTF-8 automaton).
The crate is #![no_std] and zero-dependency by default.
§Features
alloc— addsSlackBuf::new_add_slack, which appends the padding to aVec<u8>itself. Everything else stays no-alloc.simdutf8— delegate inputs ≥ 128 bytes tosimdutf8::basic::from_utf8. Adds one dependency. Below the threshold the verified path runs.- Building with
-C target-cpu=x86-64-v3(ornativeon a Haswell+ machine) enables a 32-byte/iterationmovemaskASCII prefix scan (no runtime dispatch; not covered by the proofs below) and BMI2shrxfor the shift-DFA (~+40% on the multibyte path). verusis verification-only (CI); it does not change runtime behaviour and is not intended to be combined withsimdutf8.
§Verification
Under --features verus (portable 64-bit build, no simdutf8/avx2),
Verus proves functional correctness: verify and verify_with_slack
carry ensures ret == is_valid_utf8(b@), where spec::is_valid_utf8
is a direct transcription of Unicode §3.9 Table 3-7. Every bit-trick in
the SWAR fast path and the multi-byte decoder is connected to that table
by a by(bit_vector) lemma; nothing is assumed. Differential testing
against core::str::from_utf8 (proptest, libfuzzer) remains as a
cross-check on the trusted leaf-load specs.
Memory safety of the raw-pointer loads — the only unsafe on the
verified path — is checked by two tools targeting complementary parts:
-
Verus (SMT-backed; enable with
--features verus) verifies the slice-typed core end-to-end, including the multibyte state machine. The leaf load helpers areexternal_bodywith the specret == pack64(buf@, at)(andpack32/pack16for the sub-word loads) — the standard little-endian load contract. -
RefinedRust (Rocq-backed; build with
--cfg rr) verifies exactly those leaf bodies inraw: thatptr.add(at).cast().read_unaligned()is sound given separating ownership ofnbytes withat + N ≤ n. The slice-typed core is out of its reach (it does not currently model Rust slices).
Each tool’s trusted base is what the other proves. The connecting step —
that &[u8]::as_ptr() yields a pointer valid for len() initialized
bytes — is the standard-library contract for slices. The
simdutf8-feature delegation path, the cfg(avx2) prefix scan, and the
core::str::from_utf8 delegation on 32-bit targets are not covered by
these proofs. from_utf8’s call to from_utf8_unchecked is justified by
the functional-correctness proof of verify, on the assumption that
spec::is_valid_utf8 coincides with Rust’s str invariant — both are
Unicode §3.9, but neither tool checks that equivalence.
This crate is bool-only by design. Callers needing the byte position of a
validation error should use core::str::from_utf8.
Structs§
Constants§
- SLACK
- Number of readable bytes that must follow the logical end of the input
passed to
verify_with_slack.
Functions§
- from_
utf8 - Returns
Some(b as &str)ifbis well-formed UTF-8. - to_str
Deprecated - Renamed to
from_utf8. - verify
- Returns
trueifbis well-formed UTF-8. - verify_
with_ ⚠slack - Returns
trueifbuf[range]is well-formed UTF-8, using the slack-buffer fast path.