utf8-zero
Zero-copy, incremental UTF-8 decoding with error handling.
Unlike std::str::from_utf8(), which requires the entire input up front, this crate is designed
for streaming: bytes can arrive in arbitrary chunks (from a network socket, file reader, etc.)
and the decoder correctly handles multi-byte code points split across chunk boundaries.
The crate provides three levels of API:
utf8::decode()— low-level, single-shot decode of a byte slice. Returns the valid prefix and either an invalid sequence or an incomplete suffix that can be completed with more input.LossyDecoder— a push-based streaming decoder. Feed it chunks of bytes and it calls back with&strslices, replacing errors with U+FFFD.BufReadDecoder— a pull-based streaming decoder wrapping anyBufRead, with both strict and lossy modes.
Example
use ;
let bytes = b"Hello\xC0World";
match decode
History
- Originally written by Simon Sapin as
SimonSapin/rust-utf8, published
as the
utf-8crate. - The upstream repo was archived and is no longer maintained.
- Used by ureq among others. Simon Sapin suggested inlining the code into crates that need it rather than republishing.
- Forked here as a standalone repo (not a GitHub fork) to allow continued maintenance.
- Added fuzz testing.
- Modernized code: set Rust edition to 2021, ran
cargo fmt, fixed lifetime syntax and clippy warnings. - Added GitHub Actions CI (lint, clippy, tests, Miri on every push/PR; nightly fuzzing).
- Removed defunct bench setup (missing shared modules from upstream).
- Added
#![deny(missing_docs)]and documented all public items. - Added
no_stdsupport for all butBufReadDecoder.
Fuzzing
Fuzz tests use cargo-fuzz (libFuzzer). Three targets cover the main API surface:
fuzz_decode—utf8::decode(), validated againststd::str::from_utf8()fuzz_lossy_decoder—LossyDecoderwith random chunk splits, validated againstString::from_utf8_lossy()fuzz_bufread_decoder—BufReadDecoder::read_to_string_lossy(), validated againstString::from_utf8_lossy()
To run locally:
A GitHub Actions workflow runs all targets nightly.
Miri
Miri runs on every push/PR to validate the unsafe code
(three str::from_utf8_unchecked() calls). The test suite uses exhaustive input partitioning,
which is exponential, so inputs longer than 10 bytes are skipped under Miri to keep CI fast.
License
MIT OR Apache-2.0