async-read-super-ext 0.1.0

A super extension for tokio::io::AsyncRead
Documentation

async-read-super-ext

Rust Crates.io Documentation MIT/Apache-2 licensed

A Rust library that provides extended functionality for async readers, specifically focusing on UTF-8 boundary-aware reading operations.

Overview

This library extends Tokio's AsyncBufRead trait with additional methods for reading data while respecting UTF-8 character boundaries. The main feature is read_utf8_boundaries_lossy, which reads data from an async source and ensures that the output contains only valid UTF-8, replacing invalid sequences with Unicode replacement characters.

Features

  • UTF-8 Boundary Awareness: Reads data while respecting UTF-8 character boundaries
  • Lossy Conversion: Invalid UTF-8 sequences are replaced with replacement characters ()
  • Async/Await Support: Built on top of Tokio's async I/O primitives
  • Buffer Management: Handles incomplete UTF-8 sequences across read boundaries
  • Zero-Copy When Possible: Efficient handling of valid UTF-8 data

Installation

Add this to your Cargo.toml:

[dependencies]
async-read-super-ext = "0.1.0"

Usage

use async_read_super_ext::AsyncReadSuperExt;
use tokio::io::{AsyncBufRead, BufReader};
use std::io::Cursor;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Example with valid UTF-8 data
    let data = "Hello, 🦀 World!";
    let mut reader = BufReader::new(Cursor::new(data.as_bytes()));
    let mut output = Vec::new();
    
    let bytes_read = reader.read_utf8_boundaries_lossy(&mut output).await?;
    
    let result = String::from_utf8(output)?;
    println!("Read {} bytes: {}", bytes_read, result);
    
    Ok(())
}

Handling Invalid UTF-8

The library gracefully handles invalid UTF-8 sequences by replacing them with Unicode replacement characters:

use async_read_super_ext::AsyncReadSuperExt;
use tokio::io::BufReader;
use std::io::Cursor;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create data with invalid UTF-8 bytes
    let mut data = Vec::new();
    data.extend_from_slice("Hello ".as_bytes());
    data.push(0xFF); // Invalid UTF-8 byte
    data.push(0xFE); // Invalid UTF-8 byte
    data.extend_from_slice(" World".as_bytes());
    
    let mut reader = BufReader::new(Cursor::new(data));
    let mut output = Vec::new();
    
    let bytes_read = reader.read_utf8_boundaries_lossy(&mut output).await?;
    
    let result = String::from_utf8(output)?;
    println!("Read {} bytes: {}", bytes_read, result);
    // Output: "Hello �� World" (with replacement characters)
    
    Ok(())
}

Reading Large Files

The library efficiently handles large files and streams:

use async_read_super_ext::AsyncReadSuperExt;
use tokio::fs::File;
use tokio::io::BufReader;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = File::open("large_file.txt").await?;
    let mut reader = BufReader::new(file);
    let mut all_data = Vec::new();
    let mut buffer = Vec::new();
    
    loop {
        buffer.clear();
        let bytes_read = reader.read_utf8_boundaries_lossy(&mut buffer).await?;
        
        if bytes_read == 0 {
            break; // EOF
        }
        
        all_data.extend_from_slice(&buffer);
    }
    
    let content = String::from_utf8(all_data)?;
    println!("Total content length: {} characters", content.chars().count());
    
    Ok(())
}

How It Works

The library implements a state machine that:

  1. Reads data from the underlying async reader using poll_fill_buf
  2. Validates UTF-8 sequences using Rust's built-in UTF-8 validation
  3. Handles incomplete sequences by buffering partial UTF-8 characters across read boundaries
  4. Replaces invalid bytes with Unicode replacement characters (U+FFFD)
  5. Outputs valid UTF-8 data to the provided buffer

Key Components

  • AsyncReadSuperExt: Extension trait that adds the read_utf8_boundaries_lossy method to any AsyncBufRead
  • Utf8BoundariesLossy: Future that implements the async reading logic
  • Internal state management: Handles incomplete UTF-8 sequences and invalid byte replacement

Performance Characteristics

  • Memory Efficient: Uses a small fixed-size buffer (4 bytes) for handling incomplete UTF-8 sequences
  • Streaming: Processes data incrementally without requiring the entire input in memory
  • Zero-Copy: Valid UTF-8 data is copied directly to the output buffer without additional processing

Error Handling

The library follows Rust's standard error handling patterns:

  • I/O errors from the underlying reader are propagated
  • Invalid UTF-8 sequences are handled gracefully with replacement characters
  • The output is always valid UTF-8

Dependencies

  • tokio: Async runtime and I/O utilities
  • pin-project-lite: For safe pin projection in async contexts
  • tracing: For logging and debugging support

Compatibility

  • Rust Edition: 2024
  • Minimum Rust Version: Requires Rust with async/await support
  • Tokio Version: Compatible with Tokio 1.x

Testing

The library includes comprehensive tests covering:

  • Valid UTF-8 input
  • Invalid UTF-8 sequences
  • Incomplete UTF-8 at buffer boundaries
  • Large file handling
  • Mixed valid/invalid content
  • Edge cases (empty input, leading/trailing invalid bytes)

Run tests with:

cargo test

License

This project is licensed under either of

at your option.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.