Qubit Text Codec
Buffer-oriented UTF codec primitives and Unicode/ASCII support utilities for Rust.
Overview
Qubit Text Codec is a low-level codec core for Rust code that needs explicit control below ordinary str, String, and char APIs. Its current built-in codecs focus on Unicode transfer formats: UTF-8, UTF-16, and UTF-32, with both code-unit and byte-oriented variants where that distinction matters.
The crate also provides the small shared surface that codec adapters need: charset identity metadata, the Coder trait and progress types re-exported from qubit-io, low-level CharsetCodec implementations (Unit identifies the storage unit), policy-aware CharsetEncoder / CharsetDecoder / CharsetConverter wrappers, decode status values, byte order and BOM helpers, and concrete encoding/decoding error types. ASCII and Unicode namespace helpers are included because UTF codecs and text parsers often need these checks close to the buffer boundary.
Use this crate when you need:
- ASCII classification, case conversion, digit conversion, and ASCII folding;
- Unicode code point and scalar value checks, surrogate checks, plane calculation, and noncharacter/control classification;
- UTF-8, UTF-16, and UTF-32 namespace helpers for byte or code-unit classification and length calculation;
- buffer-level
CharsetCodecimplementations for ASCII, ISO-8859-1 (Latin-1), UTF-8, UTF-16, and UTF-32; - policy-aware charset encoders, decoders, and converters with malformed/unmappable replacement, ignore, and report actions;
- byte-order and BOM handling for UTF-16 and UTF-32 byte streams;
- a small trait and error vocabulary that future non-Unicode encoding adapters can reuse without making this crate a text I/O framework.
Prefer Rust's standard text APIs for ordinary text handling. Use this crate when a parser, binary format, or text I/O adapter needs strict buffer-level UTF codec behavior and precise error positions.
API reference documentation is available on docs.rs.
Installation
[]
= "0.1"
= "0.2"
Quick Example
use ;
assert!;
assert_eq!;
assert_eq!;
assert_eq!;
let codec = Utf8Codec;
let decoded = codec.decode_one?;
assert_eq!;
let mut encoder = new;
let mut utf8 = ;
let progress = encoder.convert?;
assert_eq!;
assert_eq!;
let mut decoder = new;
let mut chars = ;
let progress = decoder.convert?;
assert_eq!;
assert_eq!;
let mut utf16 = new;
let mut bytes = ;
let progress = utf16.convert?;
assert_eq!;
# Ok::
Main Capabilities
Standards
UTF-8 decoding follows the well-formed byte sequence rules in the
Unicode Standard, Table 3-7
and the equivalent RFC 3629
syntax. Malformed byte sequences include overlong encodings, UTF-8 encodings of surrogate code points, invalid continuation bytes, and sequences above U+10FFFF.
Namespace Enums
qubit-text-codec exposes stateless namespace enums for constants, classification, conversion, and sizing. Encoding and decoding behavior lives in dedicated codec types.
| Namespace | Purpose |
|---|---|
Ascii |
ASCII constants, classification, case conversion, digit conversion, case-insensitive comparison, and ASCII folding |
Unicode |
Unicode code point range checks, scalar value checks, surrogate checks, plane calculation, noncharacter checks, control checks, and u32 to char conversion |
Utf8 |
UTF-8 byte classification and byte length calculation |
Utf16 |
UTF-16 surrogate classification, surrogate-pair composition/decomposition, code-unit length calculation, and UTF-16 BOM detection |
Utf32 |
UTF-32 scalar unit validation, unit length calculation, and UTF-32 BOM detection |
Codec Layers
Encoding and decoding are split into three layers over caller-provided buffers.
| Layer | Type | Purpose |
| --- | --- |
| Generic conversion | Coder<Input, Output> | Converts one sequence of code units into another and reports CoderProgress |
| Low-level charset algorithm | CharsetCodec | Encodes or decodes one Unicode char using its associated storage unit |
| Policy decoder | CharsetDecoder<C> | Converts source units into char values and applies MalformedAction |
| Policy encoder | CharsetEncoder<C> | Converts char values into target units and applies UnmappableAction |
| Charset conversion | CharsetConverter<D, E> | Combines one decoder and one encoder to convert between charsets |
The associated Unit type is the buffer storage unit, not always the Unicode code unit. UTF-8 uses u8, UTF-16 code-unit codecs use u16, byte-serialized UTF-16 uses u8, UTF-32 code-unit codecs use u32, and byte-serialized UTF-32 uses u8.
Coder is intentionally generic over unit types and has a small stateful interface:
- Use
convertwhen you are processing a potentially bounded stream and want to pause onNeedInput/NeedOutputwithout converting all data at once. - Use
CharsetDecoderandCharsetEncoderwhen you want charset-specific policy (replace/ignore/report, replacement chars, unmappable behavior). - Use
CharsetConverterwhen your pipeline is explicitlysource charset -> Unicode -> target charset, so malformed/unmappable decisions are centralized in one place.
All conversion APIs receive the full input/output slice plus an absolute start index. Progress counters are relative to those start indices, while errors carry absolute indices in the supplied buffers.
Charset is a lightweight charset identity descriptor with a stable id,
display name, and accepted aliases. Built-in descriptors are available as
Charset::ASCII, Charset::ISO_8859_1, Charset::UTF_8, Charset::UTF_16,
Charset::UTF_16LE, Charset::UTF_16BE, Charset::UTF_32,
Charset::UTF_32LE, and Charset::UTF_32BE. Generic UTF_16 and UTF_32
represent the Unicode code-unit form or a BOM-aware stream label, while the
LE and BE variants identify fixed-byte-order byte streams. External codec
crates can define their own static descriptors, for example
Charset::new("gbk", "GBK", &["cp936"]). Equality and hashing use only the
id, while matches_label accepts the id, display name, or aliases with ASCII
case-insensitive comparison.
Built-in Codecs
| Codec family | Storage unit | Low-level codec |
|---|---|---|
| ASCII bytes | u8 |
AsciiCodec |
| ISO-8859-1 bytes | u8 |
Latin1Codec |
| UTF-8 bytes | u8 |
Utf8Codec |
| UTF-16 code units | u16 |
Utf16U16Codec |
| UTF-16 bytes | u8 |
Utf16ByteCodec |
| UTF-32 code units | u32 |
Utf32U32Codec |
| UTF-32 bytes | u8 |
Utf32ByteCodec |
Byte codecs carry a ByteOrder value. Use UnicodeBom::detect, Utf16::detect_bom, or Utf32::detect_bom when a byte stream may include a BOM. The byte codecs do not detect, skip, or emit BOM bytes automatically. Streaming callers should buffer up to four bytes, or read until EOF, before deciding the BOM because UTF-32 little-endian (FF FE 00 00) overlaps the UTF-16 little-endian prefix (FF FE).
Decode Status and Errors
CharsetCodec::decode_one and the policy decoders distinguish incomplete input from malformed input:
| Type | Purpose |
|---|---|
DecodeStatus::Complete { value, consumed } |
A complete scalar value and consumed unit count |
DecodeStatus::NeedMore { required, available } |
The prefix is valid so far but more units are required |
CharsetDecodeError |
Charset, decoding error kind, input unit index, and optional raw value |
CharsetEncodeError |
Charset, encoding error kind, output/input index, and optional raw value |
DecodeStatus::NeedMore is not an error. A streaming text reader should read more input when possible, and convert it at EOF into an incomplete-sequence error or an appropriate std::io::Error.
Errors tied to a raw value, such as an invalid UTF-32 unit or an unmappable character, expose that value through value().
ASCII Helpers
Ascii keeps ASCII-only behavior explicit and predictable:
| Method group | Examples |
|---|---|
| Range checks | is_ascii_byte, is_ascii_char, is_ascii_code_point |
| Classification | is_whitespace_byte, is_letter_char, is_digit_code_point, is_hex_digit_char, is_printable_byte, is_control_code_point |
| Conversion | byte_to_uppercase, char_to_lowercase, char_to_digit, code_point_to_hex_digit |
| Comparison and folding | equals_ignore_case_char, equals_ignore_case_code_point, fold, fold_to_string |
Prelude
qubit_text_codec::prelude re-exports the core namespace enums, coder and charset codec traits, policy wrappers, built-in codec types, charset descriptors, byte-order/BOM helpers, decode-status types, actions, and charset encode/decode errors.
use *;
Crate Boundary
qubit-text-codec is not a general text processing library. It intentionally stays below grapheme-cluster segmentation, normalization, collation, locale-aware case mapping, transliteration, automatic encoding detection, display-width calculation, and std::io reader/writer adapters.
Its built-in codecs currently cover ASCII, ISO-8859-1 (Latin-1), UTF-8, UTF-16, and UTF-32. It does not replace encoding_rs for legacy or web-compatible encodings such as GBK, Big5, Shift_JIS, or Windows code pages. Future crates can add those encodings on top of the shared traits and error model, or delegate their tables and compatibility rules to specialized libraries.
Use specialized crates such as unicode-segmentation, unicode-normalization, unicode-width, or ICU4X for higher-level Unicode semantics.
Dependencies
This crate uses thiserror for error Display and Error implementations.
Testing & Code Coverage
This project maintains test coverage for ASCII classification and folding, Unicode code point helpers, BOM and byte-order handling, charset descriptors, ASCII/Latin-1/UTF-8/UTF-16/UTF-32 namespace helpers, buffer-level codecs, and charset encode/decode errors.
Running Tests
# Run all tests
# Run with coverage report
# Generate text format report
# Align code style with CI
# Run CI checks (format, clippy, test, coverage, audit)
License
Copyright (c) 2026. Haixing Hu.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
See LICENSE for the full license text.
Contributing
Contributions are welcome. Please feel free to submit a Pull Request.
Development Guidelines
- Follow the Rust API guidelines.
- Prefer standard Rust text APIs unless low-level buffer-oriented codec control is required.
- Keep namespace enums focused on constants, classification, and sizing helpers.
- Keep charset-specific algorithms in concrete
CharsetCodectypes and keep malformed/unmappable policy inCharsetEncoder,CharsetDecoder, orCharsetConverter. - Use specialized Unicode crates or ICU4X for normalization, segmentation, collation, display width, and locale-aware behavior.
- Maintain comprehensive test coverage.
- Document public APIs with examples when they clarify behavior.
- Ensure
./ci-check.shpasses before submitting a PR.
Author
Haixing Hu
Related Projects
- qubit-io: stream and byte I/O utilities for Rust.
- More Rust libraries from Qubit are published under the qubit-ltd organization on GitHub.
Repository: https://github.com/qubit-ltd/rs-text-codec