smoothutf8 0.2.3

Portable, formally verified UTF-8 validation with an optional slack-buffer fast path for zero-copy parsers
Documentation

smoothutf8

crates.io docs.rs CI MSRV deps.rs no_std License

Smooth sailing across input sizes and CPU architectures.

Portable, formally verified UTF-8 validation for Rust — #![no_std], zero-dependency by default, tuned for the short strings typical of serialized protocols. The portable build is mechanically verified for functional correctness against Unicode §3.9 Table 3-7; see Verification for what that does and does not cover.

[dependencies]
smoothutf8 = "0.2"
use smoothutf8::verify;
assert!(verify("hello, 世界! 🌍".as_bytes()));
assert!(!verify(&[0xC0, 0x80])); // overlong NUL

On short ASCII inputs (≤32 bytes — the protobuf-field-value regime) even the fully safe verify is 2–5× faster than core::str::from_utf8 and simdutf8, and verify_with_slack / SlackBuf::verify shave the remaining tail-dispatch branches off that. Since 0.2.2 the safe path never copies to the stack — partial chunks are covered by overlapping in-bounds loads — so the unsafe slack path is an increment, not a multiple. On long inputs the default build matches core::str::from_utf8, and a feature = "simdutf8" build matches simdutf8.

ASCII, ns/call verify (safe) verify_with_slack SlackBuf::verify core::str simdutf8
4 B Sapphire Rapids 1.49 1.40 1.33 7.05 7.16
4 B Graviton4 1.27 0.94 1.05 3.75 3.81
8 B Sapphire Rapids 2.23 1.37 1.31 8.23 7.96
8 B Graviton4 1.71 1.13 1.14 5.29 5.49
32 B Sapphire Rapids 2.99 2.42 2.76 6.91 6.39
32 B Graviton4 1.88 1.50 1.79 3.39 3.42
128 B Sapphire Rapids 6.58 5.91 5.94 10.60 5.26
128 B Graviton4 3.70 3.06 3.40 6.00 4.04

Default build (no target-cpu override, no simdutf8 feature on the smoothutf8 columns), 250-sample criterion medians, dedicated bare-metal. See doc/BENCHMARKS.md for methodology, the Graviton4 multibyte path, the full-size sweep plots, and the per-shape table.

Choosing an entry point

verify(b: &[u8]) -> bool — the safe default. Functionally equivalent to core::str::from_utf8(b).is_ok(). Use this unless you can guarantee readable bytes after your input. from_utf8(b) -> Option<&str> is the same check returning the string view on success.

SlackBuf<'a> — for zero-copy parsers that maintain at least SLACK (8) readable bytes after every logical field. The classic example is a protobuf decoder validating string fields inside a larger wire buffer: there is always more data after each field's end (the next field's tag, or the decoder's sentinel padding), so the invariant is free to satisfy. Construct the SlackBuf once per buffer; per-field verify / to_str / le_u32 calls are then safe and cover any sub-8-byte range with a single masked load instead of the safe path's overlapping load pair.

use smoothutf8::SlackBuf;

// Transport layer finishes reading the frame into a Vec, then:
let buf = SlackBuf::new_add_slack(&mut wire);  // appends SLACK zero bytes; needs feature = "alloc"
// — or, if you padded yourself / are using BytesMut etc.:
// let buf = SlackBuf::new_embedded_slack(&wire);

// Inside the length-delimited field decoder, per string field:
let field_end = pos + field_len;
let s: &str = buf.to_str(pos..field_end).ok_or(DecodeError::InvalidUtf8)?;

The slack path covers a 1–7 byte range with one masked load where verify issues an overlapping load pair. SlackBuf::verify adds a per-call range assert whose cost is at or below measurement noise (0–0.7 ns); unsafe verify_with_slack(buf, range) is the underlying zero-overhead entry point and remains available for callers who hold the range invariant elsewhere.

Build configurations

Throughput vs core::str::from_utf8 at 64 KiB (Sapphire Rapids, same setup as the plots above):

build ASCII ~10% non-ASCII ~30% non-ASCII 100% non-ASCII dependencies
default 1.0× 1.6× 1.3× 1.1× none
RUSTFLAGS="-C target-cpu=x86-64-v3" 1.7× 2.2× 1.8× 1.6× none
--features simdutf8 (with x86-64-v3) 2.5× 5.3× 9.0× 7.7× simdutf8

The x86-64-v3 baseline (Haswell+, 2013–) enables both the AVX2 ASCII prefix scan and BMI2 shrx for the shift-DFA; -C target-feature=+avx2 alone gets the prefix scan but not shrx (LLVM treats them as independent features). Use -C target-cpu=native if you don't need portability across machines.

The verified path runs unconditionally for inputs <128 bytes in all configurations; the AVX2 prefix scan is external_body (see Verification).

32-bit targets

On 32-bit targets other than wasm32 (Cortex-M, i686, riscv32, …) the multibyte path delegates to core::str::from_utf8 and the 2 KB DFA table is compiled out entirely. The shift-DFA's (ROW[byte] >> state) & 63 is one instruction on AArch64 and 3–5 on x86-64, but ~10 on i686 (shrd/cmov) and ~13 on Cortex-M4 (emulated 64-bit shift plus two loads), so the standard library's branchy validator is the better choice there. wasm32 is opted in explicitly: it has native i64.shr_u, so the DFA runs at full speed even though pointers are 32-bit. The SWAR ASCII fast path and the slack-mode tail handling are kept on every target, so short-ASCII inputs still benefit. CI checks thumbv7em-none-eabihf, i686-unknown-linux-gnu, and wasm32-unknown-unknown.

Verification

The default build (no simdutf8, no target-cpu override, 64-bit target) is mechanically verified for functional correctness: under --features verus, both verify and verify_with_slack carry the postcondition ret == is_valid_utf8(buf@), where spec::is_valid_utf8 is a line-by-line transcription of Unicode §3.9 Table 3-7 — a reader can audit it against the standard in five minutes. The SWAR sign-bit ASCII test and the shift-DFA's (ROW[byte] >> state) & 63 step are each connected to that table by a by(bit_vector) lemma, and a 256-cell compile-time check pins ROW to spec_row; nothing in spec.rs is assumed or admitted. SlackBuf::verify carries the same postcondition (its body delegates to verify_with_slack). cargo verus verify --features verus reports 84 verified, 0 errors.

Verus is not foundational — it trusts Z3 and its own SMT encoding. The trusted base of the functional-correctness proof is five external_body items: the load64, load32, load16, and load8 leaf loads (ret == pack64(buf@, at) etc.; the standard contract for an unaligned read — each a one-line body), and the row() table lookup (whose contract ret == spec_row(byte) is checked at compile time against a literal transcription of spec_row for all 256 inputs — the residual trust is that the literal matches the spec fn). Differential testing against core::str::from_utf8 (table-driven cases, proptest, libfuzzer) cross-checks exactly that trusted base.

Memory safety of the raw-pointer loads — the only unsafe on the verified path — is additionally checked by RefinedRust (Rocq/Iris, reproducible — see verify/REFINEDRUST.md), which proves the leaf-load bodiesptr.add(at).cast().read_unaligned() — sound given separating ownership of the buffer and the at + N ≤ n bound. 4 Qed, 0 failed. This is foundational (machine-checked in Rocq), but rests on an axiomatic spec for core::ptr::read_unaligned (src/raw/shims.rs) since RefinedRust's stdlib has none.

What is not covered by either proof, and is therefore trusted code reviewed and tested in the usual way:

  • the simdutf8-feature delegation path and the cfg(avx2) prefix scan (out of scope by design — third-party / intrinsic code);
  • the core::str::from_utf8 delegation on 32-bit targets (the standard-library implementation itself);
  • the bridge step "&[u8]::as_ptr() yields a pointer valid for len() initialized bytes" (the standard-library slice contract; neither tool models slices and raw pointers in the same proof);
  • from_utf8's call to from_utf8_unchecked: justified by the functional-correctness proof of verify, on the assumption that spec::is_valid_utf8 coincides with Rust's str invariant — both are Unicode §3.9, but neither tool checks that equivalence.

The crate is also miri-clean under strict provenance and has 100% line coverage.

Minimum supported Rust version

1.60.

The MSRV is the minimum toolchain the library compiles on; we keep it there rather than tracking stable, to maximize compatibility for downstream crates. We will not, however, work around toolchain bugs in releases more than 12 months behind the current stable — if a problem only reproduces on a Rust older than that, the fix is to upgrade Rust. While the crate is pre-1.0, an MSRV bump is a minor (0.x) release and is noted in the CHANGELOG.

Dev-dependencies (criterion, proptest) may require a newer toolchain to run the test and benchmark suites; that does not affect downstream consumers.

Prior art

The multibyte validator is a shift-encoded DFA: Björn Höhrmann's UTF-8 automaton, with the per-byte transition packed into a single u64 per input byte so the state→state' chain is one shift and one mask (the encoding due to Per Vognsen and Geoff Langdale). The eps-copy slack-buffer pattern that motivates verify_with_slack is the one used by UPB and hyperpb.

License

Apache-2.0.