async-read-super-ext
A Rust library that provides extended functionality for async readers, specifically focusing on UTF-8 boundary-aware reading operations.
Overview
This library extends Tokio's AsyncBufRead
trait with additional methods for reading data while respecting UTF-8 character boundaries. The main feature is read_utf8_boundaries_lossy
, which reads data from an async source and ensures that the output contains only valid UTF-8, replacing invalid sequences with Unicode replacement characters.
Features
- UTF-8 Boundary Awareness: Reads data while respecting UTF-8 character boundaries
- Lossy Conversion: Invalid UTF-8 sequences are replaced with replacement characters (
�
) - Async/Await Support: Built on top of Tokio's async I/O primitives
- Buffer Management: Handles incomplete UTF-8 sequences across read boundaries
- Zero-Copy When Possible: Efficient handling of valid UTF-8 data
Installation
Add this to your Cargo.toml
:
[]
= "0.1.0"
Usage
use AsyncReadSuperExt;
use ;
use Cursor;
async
Handling Invalid UTF-8
The library gracefully handles invalid UTF-8 sequences by replacing them with Unicode replacement characters:
use AsyncReadSuperExt;
use BufReader;
use Cursor;
async
Reading Large Files
The library efficiently handles large files and streams:
use AsyncReadSuperExt;
use File;
use BufReader;
async
How It Works
The library implements a state machine that:
- Reads data from the underlying async reader using
poll_fill_buf
- Validates UTF-8 sequences using Rust's built-in UTF-8 validation
- Handles incomplete sequences by buffering partial UTF-8 characters across read boundaries
- Replaces invalid bytes with Unicode replacement characters (
U+FFFD
) - Outputs valid UTF-8 data to the provided buffer
Key Components
AsyncReadSuperExt
: Extension trait that adds theread_utf8_boundaries_lossy
method to anyAsyncBufRead
Utf8BoundariesLossy
: Future that implements the async reading logic- Internal state management: Handles incomplete UTF-8 sequences and invalid byte replacement
Performance Characteristics
- Memory Efficient: Uses a small fixed-size buffer (4 bytes) for handling incomplete UTF-8 sequences
- Streaming: Processes data incrementally without requiring the entire input in memory
- Zero-Copy: Valid UTF-8 data is copied directly to the output buffer without additional processing
Error Handling
The library follows Rust's standard error handling patterns:
- I/O errors from the underlying reader are propagated
- Invalid UTF-8 sequences are handled gracefully with replacement characters
- The output is always valid UTF-8
Dependencies
- tokio: Async runtime and I/O utilities
- pin-project-lite: For safe pin projection in async contexts
- tracing: For logging and debugging support
Compatibility
- Rust Edition: 2024
- Minimum Rust Version: Requires Rust with async/await support
- Tokio Version: Compatible with Tokio 1.x
Testing
The library includes comprehensive tests covering:
- Valid UTF-8 input
- Invalid UTF-8 sequences
- Incomplete UTF-8 at buffer boundaries
- Large file handling
- Mixed valid/invalid content
- Edge cases (empty input, leading/trailing invalid bytes)
Run tests with:
License
This project is licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.