smoothutf8
Smooth sailing across input sizes and CPU architectures.
Portable, formally verified UTF-8 validation for Rust — #![no_std], zero-dependency by default, tuned for the short strings typical of serialized protocols. The portable build is mechanically verified for functional correctness against Unicode §3.9 Table 3-7; see Verification for what that does and does not cover.
[]
= "0.2"
use verify;
assert!;
assert!; // overlong NUL
On short ASCII inputs (≤32 bytes — the protobuf-field-value regime) verify_with_slack is 2–4× faster than both core::str::from_utf8 and simdutf8, because it has no runtime CPU dispatch and no tail handling. On long inputs it matches core::str::from_utf8 in the default build, and matches simdutf8 with feature = "simdutf8" enabled.
ASCII input, 250-sample criterion medians, c7i.metal-24xl (Sapphire Rapids; 48 KiB L1d, 2 MiB L2, 45 MiB shared L3; turbo off, core-pinned). stdlib is core::str::from_utf8; simdutf8 is simdutf8::basic::from_utf8. The +simdutf8 build is --features simdutf8 with -C target-cpu=x86-64-v3. Reproduce with cargo bench; raw data is in doc/throughput-data.csv and the plots are regenerated by python3 doc/gen-throughput-svg.py.
The shape of these curves follows from where the work is and what bounds it at each input size:
- 1–128 B (the short-string regime). Here the per-call fixed cost — function-call overhead, simdutf8's runtime CPU dispatch, the safe entry point's tail copy — dominates the per-byte work.
verify_with_slackhas none of those, so both smoothutf8 builds lead by 2–4× over stdlib and simdutf8. At 128 B the+simdutf8build hands off to simdutf8, which is why the red line joins the amber one from there on; the default build keeps running its own SWAR loop. - 128 B – ~32 KiB (L1-resident, compute-bound). The buffer fits in L1d, so throughput is set by instructions per byte. The
+simdutf8build and simdutf8 are the same code here and peak together at ~83 GiB/s; the default build runs the verified 16-byte/iteration scalar loop and plateaus at ~42 GiB/s alongside stdlib (which auto-vectorizes its ASCII fast path to the same width). - ~32 KiB – ~2 MiB (L2-resident). The first cache step-down. All implementations slow together; relative ordering is unchanged.
- ~2 MiB – ~45 MiB (L3-resident). Second step-down to ~22 GiB/s for the SIMD pair. The default build and stdlib are bandwidth-bound here too, so the relative gap narrows.
- Beyond L3 (DRAM-bound). Throughput is set by memory bandwidth, not by the validator. The SIMD pair settles at ~11 GiB/s and the default build at ~8 GiB/s — the wider loads still help because they issue fewer requests per cache line, but the curves converge towards a common floor as size grows.
This is the ASCII picture; the per-shape numbers for non-ASCII input are in the Build configurations table below.
Choosing an entry point
verify(b: &[u8]) -> bool — the safe default. Functionally equivalent to core::str::from_utf8(b).is_ok(). Use this unless you can guarantee readable bytes after your input. to_str(b) -> Option<&str> is the same check returning the string view on success.
SlackBuf<'a> — for zero-copy parsers that maintain at least SLACK (8) readable bytes after every logical field. The classic example is a protobuf decoder validating string fields inside a larger wire buffer: there is always more data after each field's end (the next field's tag, or the decoder's sentinel padding), so the invariant is free to satisfy. Construct the SlackBuf once per buffer; per-field verify / to_str / le_u32 calls are then safe and skip the per-string tail copy, which is the dominant cost on inputs of a few bytes.
use SlackBuf;
// Transport layer finishes reading the frame into a Vec, then:
let buf = new_add_slack; // appends SLACK zero bytes; needs feature = "alloc"
// — or, if you padded yourself / are using BytesMut etc.:
// let buf = SlackBuf::new_embedded_slack(&wire);
// Inside the length-delimited field decoder, per string field:
let field_end = pos + field_len;
let s: &str = buf.to_str.ok_or?;
SlackBuf::verify and verify share the same hot loop; on inputs ≥16 bytes they are equivalent. The slack path exists only to shave the ~5 ns tail cost on inputs shorter than that. unsafe verify_with_slack(buf, range) is the underlying per-call entry point and remains available; prefer SlackBuf for new code.
Build configurations
Throughput vs core::str::from_utf8 at 64 KiB (Sapphire Rapids, same setup as the plots above):
| build | ASCII | ~10% non-ASCII | ~30% non-ASCII | 100% non-ASCII | dependencies |
|---|---|---|---|---|---|
| default | 1.0× | 1.6× | 1.3× | 1.1× | none |
RUSTFLAGS="-C target-cpu=x86-64-v3" |
1.7× | 2.2× | 1.8× | 1.6× | none |
--features simdutf8 (with x86-64-v3) |
2.5× | 5.3× | 9.0× | 7.7× | simdutf8 |
The x86-64-v3 baseline (Haswell+, 2013–) enables both the AVX2 ASCII prefix scan and BMI2 shrx for the shift-DFA; -C target-feature=+avx2 alone gets the prefix scan but not shrx (LLVM treats them as independent features). Use -C target-cpu=native if you don't need portability across machines.
The verified path runs unconditionally for inputs <128 bytes in all configurations; the AVX2 prefix scan is external_body (see Verification).
32-bit targets
On 32-bit targets other than wasm32 (Cortex-M, i686, riscv32, …) the multibyte path delegates to core::str::from_utf8 and the 2 KB DFA table is compiled out entirely. The shift-DFA's (ROW[byte] >> state) & 63 is one instruction on AArch64 and 3–5 on x86-64, but ~10 on i686 (shrd/cmov) and ~13 on Cortex-M4 (emulated 64-bit shift plus two loads), so the standard library's branchy validator is the better choice there. wasm32 is opted in explicitly: it has native i64.shr_u, so the DFA runs at full speed even though pointers are 32-bit. The SWAR ASCII fast path and the slack-mode tail handling are kept on every target, so short-ASCII inputs still benefit. CI checks thumbv7em-none-eabihf, i686-unknown-linux-gnu, and wasm32-unknown-unknown.
Verification
The default build (no simdutf8, no target-cpu override, 64-bit target) is mechanically verified for functional correctness: under --features verus, both verify and verify_with_slack carry the postcondition ret == is_valid_utf8(buf@), where spec::is_valid_utf8 is a line-by-line transcription of Unicode §3.9 Table 3-7 — a reader can audit it against the standard in five minutes. The SWAR sign-bit ASCII test and the shift-DFA's (ROW[byte] >> state) & 63 step are each connected to that table by a by(bit_vector) lemma, and a 256-cell compile-time check pins ROW to spec_row; nothing in spec.rs is assumed or admitted. SlackBuf::verify carries the same postcondition (its body delegates to verify_with_slack). cargo verus verify --features verus reports 78 verified, 0 errors.
Verus is not foundational — it trusts Z3 and its own SMT encoding. The trusted base of the functional-correctness proof is four external_body items: the load64 and load8 leaf loads (ret == pack64(buf@, at) etc.; the standard contract for an unaligned read), SafeTail's stack-copy tail (byte j == buf@[at+j] for j < end−at; six lines of safe code), and the row() table lookup (whose contract ret == spec_row(byte) is checked at compile time against a literal transcription of spec_row for all 256 inputs — the residual trust is that the literal matches the spec fn). Differential testing against core::str::from_utf8 (table-driven cases, proptest, libfuzzer) cross-checks exactly that trusted base.
Memory safety of the raw-pointer loads — the only unsafe on the verified path — is additionally checked by RefinedRust (Rocq/Iris, reproducible — see verify/REFINEDRUST.md), which proves the leaf-load bodies — ptr.add(at).cast().read_unaligned() — sound given separating ownership of the buffer and the at + N ≤ n bound. 2 Qed, 0 failed. This is foundational (machine-checked in Rocq), but rests on an axiomatic spec for core::ptr::read_unaligned (src/raw/shims.rs) since RefinedRust's stdlib has none.
What is not covered by either proof, and is therefore trusted code reviewed and tested in the usual way:
- the
simdutf8-feature delegation path and thecfg(avx2)prefix scan (out of scope by design — third-party / intrinsic code); - the
core::str::from_utf8delegation on 32-bit targets (the standard-library implementation itself); - the bridge step "
&[u8]::as_ptr()yields a pointer valid forlen()initialized bytes" (the standard-library slice contract; neither tool models slices and raw pointers in the same proof); to_str's call tofrom_utf8_unchecked: justified by the functional-correctness proof ofverify, on the assumption thatspec::is_valid_utf8coincides with Rust'sstrinvariant — both are Unicode §3.9, but neither tool checks that equivalence.
The crate is also miri-clean under strict provenance and has 100% line coverage.
Minimum supported Rust version
1.60.
The MSRV is the minimum toolchain the library compiles on; we keep it there rather than tracking stable, to maximize compatibility for downstream crates. We will not, however, work around toolchain bugs in releases more than 12 months behind the current stable — if a problem only reproduces on a Rust older than that, the fix is to upgrade Rust. While the crate is pre-1.0, an MSRV bump is a minor (0.x) release and is noted in the CHANGELOG.
Dev-dependencies (criterion, proptest) may require a newer toolchain to run the test and benchmark suites; that does not affect downstream consumers.
Prior art
The multibyte validator is a shift-encoded DFA: Björn Höhrmann's UTF-8 automaton, with the per-byte transition packed into a single u64 per input byte so the state→state' chain is one shift and one mask (the encoding due to Per Vognsen and Geoff Langdale). The eps-copy slack-buffer pattern that motivates verify_with_slack is the one used by UPB and hyperpb.
License
Apache-2.0.