utf8-zero

Zero-copy, incremental UTF-8 decoding with error handling.

Unlike std::str::from_utf8(), which requires the entire input up front, this crate is designed for streaming: bytes can arrive in arbitrary chunks (from a network socket, file reader, etc.) and the decoder correctly handles multi-byte code points split across chunk boundaries.

The crate provides three levels of API:

utf8::decode() — low-level, single-shot decode of a byte slice. Returns the valid prefix and either an invalid sequence or an incomplete suffix that can be completed with more input.
LossyDecoder — a push-based streaming decoder. Feed it chunks of bytes and it calls back with &str slices, replacing errors with U+FFFD.
BufReadDecoder — a pull-based streaming decoder wrapping any BufRead, with both strict and lossy modes.

Example

use utf8::{decode, DecodeError};

let bytes = b"Hello\xC0World";
match decode(bytes) {
    Ok(s) => println!("valid: {s}"),
    Err(DecodeError::Invalid { valid_prefix, invalid_sequence, remaining_input }) => {
        // valid_prefix = "Hello", invalid_sequence = [0xC0], remaining_input = b"World"
        println!("got {:?} before error", valid_prefix);
    }
    Err(DecodeError::Incomplete { valid_prefix, incomplete_suffix }) => {
        // Input ended mid-codepoint — feed more bytes via incomplete_suffix.try_complete()
        println!("need more input after {:?}", valid_prefix);
    }
}

History

Originally written by Simon Sapin as SimonSapin/rust-utf8, published as the utf-8 crate.
The upstream repo was archived and is no longer maintained.
Used by ureq among others. Simon Sapin suggested inlining the code into crates that need it rather than republishing.
Forked here as a standalone repo (not a GitHub fork) to allow continued maintenance.
Added fuzz testing.
Modernized code: set Rust edition to 2021, ran cargo fmt, fixed lifetime syntax and clippy warnings.
Added GitHub Actions CI (lint, clippy, tests, Miri on every push/PR; nightly fuzzing).
Removed defunct bench setup (missing shared modules from upstream).
Added #![deny(missing_docs)] and documented all public items.
Added no_std support for all but BufReadDecoder.

Fuzzing

Fuzz tests use cargo-fuzz (libFuzzer). Three targets cover the main API surface:

fuzz_decode — utf8::decode(), validated against std::str::from_utf8()
fuzz_lossy_decoder — LossyDecoder with random chunk splits, validated against String::from_utf8_lossy()
fuzz_bufread_decoder — BufReadDecoder::read_to_string_lossy(), validated against String::from_utf8_lossy()

To run locally:

cargo install cargo-fuzz
cargo +nightly fuzz run fuzz_decode
cargo +nightly fuzz run fuzz_lossy_decoder
cargo +nightly fuzz run fuzz_bufread_decoder

A GitHub Actions workflow runs all targets nightly.

Miri

Miri runs on every push/PR to validate the unsafe code (three str::from_utf8_unchecked() calls). The test suite uses exhaustive input partitioning, which is exponential, so inputs longer than 10 bytes are skipped under Miri to keep CI fast.

cargo +nightly miri test

License