utf8-bufread 0.1.2

A BufReader that doesn't stop on newlines
Documentation

UTF-8 Buffered Reader

Provides a read_utf8 function for all types implementing BufRead, allowing to read text file without worrying about loading huge files without newline delimiters.

Usage

Add this crate as a dependency in your Cargo.toml:

[dependencies]
utf8-bufread = "0.1.1"

This will allow you to use the function read_utf8 on any object implementing std::io::BufRead. This function essentially reads a stream and returns an UTF-8 String:

use utf8_bufread::BufRead;
use std::io::BufReader;

assert_eq!(
    "💖", 
    BufReader::<&[u8]>::new("💖".as_ref())
        .read_utf8()
        .unwrap()
);

A common issue encountered when using the standard Rust library to read large files of text is that these may have extremely long lines or no newline delimiters at all. This makes BufReader::read_line or BufReader::lines load a large amount of data into memory, which may not be desirable.

The function read_utf8, on the other hand, will only read up until the reader's buffer is full.

If valid utf-8 is read it will always be returned. If an invalid or incomplete codepoint is read, the function will first return all the valid bytes read and an InvalidData error will be returned on the next call:

 use utf8_bufreader::BufRead;
 use std::io::{BufReader, ErrorKind};

fn main() {
    // "foo\nbar" + some invalid bytes
    // We give the buffer more than enough capacity to be 
    // able to read all the bytes in one call
    let mut reader = BufReader::with_capacity(
        16,
        [0x66u8, 0x6f, 0x6f, 0xa, 0x62, 0x61, 0x72, 0x9f, 0x92, 0x96].as_ref(),
    );
   
    // On the first read_utf8() call, we will read up to
    // the first byte of the invalid codepoint 
    // (ie "foo\nbar")
    let read_str = reader
        .read_utf8()
        .expect("We will get all the valid bytes");
    assert_eq!("foo\nbar", read_str);
   
    // Then on the second call we will get the InvalidData
    // error caused by the Utf8Error error, as there is no 
    // bytes forming valid codepoints left
    let read_err = reader
        .read_utf8()
        .expect_err("We will get an error");
    assert_eq!(ErrorKind::InvalidData, read_err.kind());
}

Work in progress

This crate is fairly new, and for now only provides the read_utf8 function, with a rather simple implementation. In the near future these features should be added:

  • A lossy and unchecked version of read_utf8 (see from_utf8_lossy & from_utf8_unchecked).
  • A chars iterator from the buffer, and its lossy version.
  • I'm open to suggestion, if you have ideas 😉

I am also looking for a way for read_utf8 to return a &str instead of a String, meaning the reader is borrowed until the returned reference goes out of scope, so that I let the user choose if they want to clone the data or not. For the moment, the read codepoints are always cloned into a new String.

Finally, I want to test and benchmark this crate.

Given I'm not the most experience developer at all, you are very welcome to submit push requests here

License

Utf8-BufRead is distributed under the terms of the Apache License 2.0, see the LICENSE file in the root directory of this repository.