Skip to main content

Crate smoothutf8

Crate smoothutf8 

Source
Expand description

Portable, formally verified UTF-8 validation.

This crate provides two entry points for validating that a byte sequence is well-formed UTF-8:

  • verify takes a plain &[u8] and is fully safe. It uses an 8-byte SWAR (SIMD-within-a-register) ASCII fast path, and handles the final partial chunk with overlapping in-bounds loads so that it never reads past the slice.

  • SlackBuf is the safe wrapper for zero-copy parsers that maintain at least SLACK readable bytes after every logical field (the “eps-copy” pattern used by hyperpb and UPB). The padding invariant is established once at the buffer level — typically via SlackBuf::new_add_slack (with feature = "alloc") or SlackBuf::new_embedded_slack — and per-field verify / to_str / le_u32 calls are then safe and skip the per-string tail-handling cost.

    verify_with_slack is the underlying unsafe per-call entry point; prefer SlackBuf for new code.

Both share the same multi-byte path: a shift-encoded DFA (the Vognsen/Langdale encoding of Höhrmann’s UTF-8 automaton).

The crate is #![no_std] and zero-dependency by default.

§Features

  • alloc — adds SlackBuf::new_add_slack, which appends the padding to a Vec<u8> itself. Everything else stays no-alloc.
  • simdutf8 — delegate inputs ≥ 128 bytes to simdutf8::basic::from_utf8. Adds one dependency. Below the threshold the verified path runs.
  • Building with -C target-cpu=x86-64-v3 (or native on a Haswell+ machine) enables a 32-byte/iteration movemask ASCII prefix scan (no runtime dispatch; not covered by the proofs below) and BMI2 shrx for the shift-DFA (~+40% on the multibyte path).
  • verus is verification-only (CI); it does not change runtime behaviour and is not intended to be combined with simdutf8.

§Verification

Under --features verus (portable 64-bit build, no simdutf8/avx2), Verus proves functional correctness: verify and verify_with_slack carry ensures ret == is_valid_utf8(b@), where spec::is_valid_utf8 is a direct transcription of Unicode §3.9 Table 3-7. Every bit-trick in the SWAR fast path and the multi-byte decoder is connected to that table by a by(bit_vector) lemma; nothing is assumed. Differential testing against core::str::from_utf8 (proptest, libfuzzer) remains as a cross-check on the trusted leaf-load specs.

Memory safety of the raw-pointer loads — the only unsafe on the verified path — is checked by two tools targeting complementary parts:

  • Verus (SMT-backed; enable with --features verus) verifies the slice-typed core end-to-end, including the multibyte state machine. The leaf load helpers are external_body with the spec ret == pack64(buf@, at) (and pack32/pack16 for the sub-word loads) — the standard little-endian load contract.

  • RefinedRust (Rocq-backed; build with --cfg rr) verifies exactly those leaf bodies in raw: that ptr.add(at).cast().read_unaligned() is sound given separating ownership of n bytes with at + N ≤ n. The slice-typed core is out of its reach (it does not currently model Rust slices).

Each tool’s trusted base is what the other proves. The connecting step — that &[u8]::as_ptr() yields a pointer valid for len() initialized bytes — is the standard-library contract for slices. The simdutf8-feature delegation path, the cfg(avx2) prefix scan, and the core::str::from_utf8 delegation on 32-bit targets are not covered by these proofs. from_utf8’s call to from_utf8_unchecked is justified by the functional-correctness proof of verify, on the assumption that spec::is_valid_utf8 coincides with Rust’s str invariant — both are Unicode §3.9, but neither tool checks that equivalence.

This crate is bool-only by design. Callers needing the byte position of a validation error should use core::str::from_utf8.

Structs§

SlackBuf
A borrowed byte buffer whose final SLACK bytes are padding.

Constants§

SLACK
Number of readable bytes that must follow the logical end of the input passed to verify_with_slack.

Functions§

from_utf8
Returns Some(b as &str) if b is well-formed UTF-8.
to_strDeprecated
Renamed to from_utf8.
verify
Returns true if b is well-formed UTF-8.
verify_with_slack
Returns true if buf[range] is well-formed UTF-8, using the slack-buffer fast path.