utf8-zero 0.8.1

Zero-copy, incremental UTF-8 decoding with error handling
Documentation
  • Coverage
  • 100%
    23 out of 23 items documented3 out of 9 items with examples
  • Size
  • Source code size: 45.99 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 2.86 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 16s Average build duration of successful builds.
  • all releases: 14s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • algesten/utf8-zero
    0 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • algesten

utf8-zero

Zero-copy, incremental UTF-8 decoding with error handling.

Unlike std::str::from_utf8(), which requires the entire input up front, this crate is designed for streaming: bytes can arrive in arbitrary chunks (from a network socket, file reader, etc.) and the decoder correctly handles multi-byte code points split across chunk boundaries.

The crate provides three levels of API:

  • utf8::decode() — low-level, single-shot decode of a byte slice. Returns the valid prefix and either an invalid sequence or an incomplete suffix that can be completed with more input.
  • LossyDecoder — a push-based streaming decoder. Feed it chunks of bytes and it calls back with &str slices, replacing errors with U+FFFD.
  • BufReadDecoder — a pull-based streaming decoder wrapping any BufRead, with both strict and lossy modes.

Example

use utf8::{decode, DecodeError};

let bytes = b"Hello\xC0World";
match decode(bytes) {
    Ok(s) => println!("valid: {s}"),
    Err(DecodeError::Invalid { valid_prefix, invalid_sequence, remaining_input }) => {
        // valid_prefix = "Hello", invalid_sequence = [0xC0], remaining_input = b"World"
        println!("got {:?} before error", valid_prefix);
    }
    Err(DecodeError::Incomplete { valid_prefix, incomplete_suffix }) => {
        // Input ended mid-codepoint — feed more bytes via incomplete_suffix.try_complete()
        println!("need more input after {:?}", valid_prefix);
    }
}

History

  • Originally written by Simon Sapin as SimonSapin/rust-utf8, published as the utf-8 crate.
  • The upstream repo was archived and is no longer maintained.
  • Used by ureq among others. Simon Sapin suggested inlining the code into crates that need it rather than republishing.
  • Forked here as a standalone repo (not a GitHub fork) to allow continued maintenance.
  • Added fuzz testing.
  • Modernized code: set Rust edition to 2021, ran cargo fmt, fixed lifetime syntax and clippy warnings.
  • Added GitHub Actions CI (lint, clippy, tests, Miri on every push/PR; nightly fuzzing).
  • Removed defunct bench setup (missing shared modules from upstream).
  • Added #![deny(missing_docs)] and documented all public items.
  • Added no_std support for all but BufReadDecoder.

Fuzzing

Fuzz tests use cargo-fuzz (libFuzzer). Three targets cover the main API surface:

  • fuzz_decodeutf8::decode(), validated against std::str::from_utf8()
  • fuzz_lossy_decoderLossyDecoder with random chunk splits, validated against String::from_utf8_lossy()
  • fuzz_bufread_decoderBufReadDecoder::read_to_string_lossy(), validated against String::from_utf8_lossy()

To run locally:

cargo install cargo-fuzz
cargo +nightly fuzz run fuzz_decode
cargo +nightly fuzz run fuzz_lossy_decoder
cargo +nightly fuzz run fuzz_bufread_decoder

A GitHub Actions workflow runs all targets nightly.

Miri

Miri runs on every push/PR to validate the unsafe code (three str::from_utf8_unchecked() calls). The test suite uses exhaustive input partitioning, which is exponential, so inputs longer than 10 bytes are skipped under Miri to keep CI fast.

cargo +nightly miri test

License

MIT OR Apache-2.0