UTF-8 Buffered Reader

Provides alternatives to BufRead::read_line and BufRead::lines that allow getting UTF-8 strings but do not stop on newline delimiters, to avoid loading large amount of data in memory when reading files with few newlines.

Usage

Add this crate as a dependency in your Cargo.toml:

[dependencies]
utf8-bufread = "0.1.5"

This will allow you to use the BufRead trait provided by this crate and automatically implemented on any type implementing std::io::BufRead.

This trait provides functions to read utf8 strings from a stream, but none of those functions guarantee the read chunk of data will end on a newline delimiter (unlike BufRead::read_line or BufRead::lines). This allows you to use buffered readers and std::io::BufRead's API on a large stream without worrying about loading a huge amount of data into memory if there is no newline delimiter.

The functions of this trait are centered around BufRead::with_utf8_chunk, which takes a closure being passed the string slice of utf8 data read from the inner reader, and returns an io::Result of the number of bytes read, in the same same fashion as most functions from std::io's traits and structs functions. The string slice may be of arbitrary length and may stop at any point in the stream, but will always contain valid UTF-8.

fn main() {
  use std::io::Cursor;
  use utf8_bufread::BufRead;

  // Cursor implements BufRead when wrapping a string slice
  let mut reader = Cursor::new(
    "The quick fox jumps over the lazy dog"
  );
  let mut o_counter = 0;

  // Counts the number of "o"s in the stream
  loop {
    match reader.with_utf8_chunk(|s| {
      o_counter += s.matches('o').count()
    }) {
      Ok(0) | Err(_) => break,
      Ok(_) => continue,
    }
  }
  assert_eq!(3, o_counter);
}

The trait also provides functions to append to a provided buffer and to iterate over read chunks.

use utf8_bufread::BufRead;
use std::io::BufReader;
use std::fs::File;

fn main() {
  use std::fs::File;
  use std::io::BufReader;
  use utf8_bufread::BufRead;

  // Open our file
  let mut reader = BufReader::new(
    File::open("my_file.txt").unwrap()
  );
  // The string we'll use to store the text of the file
  let mut text = String::new();
  loop {
    // Loop until EOF
    match reader.read_utf8(&mut text) {
      Ok(0) => break, // EOF
      Ok(_) => {
        continue
      }
      Err(e) => std::panic::panic_any(e),
    }
  }
  
  // Do something with `text` ...
}

If valid utf-8 codepoint is read it will always be processed, be it passed to a closure or appended to provided buffer. If an invalid or incomplete codepoint is read, the functions of this crate will first process all the valid bytes read and a relevant io::Error will be returned on the next call:

fn main() {
  use std::io::{Cursor, ErrorKind};
  use std::str::Utf8Error;
  use utf8_bufread::BufRead;

  // Cursor implements BufRead when wrapping a u8 slice
  // "foo\nbar" + some invalid bytes
  let mut reader = Cursor::new([
    0x66u8, 0x6f, 0x6f, 0xa, 0x62, 0x61, 0x72, 0x9f, 0x92, 0x96, 0x0,
  ]);
  let mut n_read = 0;
  let mut buf = String::new();

  // First read all the valid bytes until EOF or error
  // (in this case, an error)
  let err = loop {
    match reader.read_utf8(&mut buf) {
      Ok(0) => break Ok(()),
      Ok(n) => {
        n_read += n;
        continue;
      }
      Err(e) => break Err(e),
    };
  };
  // We did get all our valid bytes
  assert_eq!("foo\nbar", buf.as_str());
  assert_eq!(7, n_read);

  // And our last call gave us an `io::Error` caused by an
  // std::str::Utf8Error
  assert!(err.is_err());
  let err = err.unwrap_err();
  assert_eq!(ErrorKind::InvalidData, err.kind());
  let err = err.into_inner();
  assert!(err.is_some());
  assert!(err.unwrap().is::<Utf8Error>());
}

Work in progress

This crate is fairly new, and for now only provides a limited amount API, with a rather simple implementation. In the near future these features should be added:

A lossy and unchecked version of read_utf8 (see from_utf8_lossy & from_utf8_unchecked).
A chars iterator from the buffer, and its lossy version.
I'm open to suggestion, if you have ideas 😉

This also means it may have a pretty unstable API

Given I'm not the most experience developer at all, you are very welcome to submit push requests here

License

Utf8-BufRead is distributed under the terms of the Apache License 2.0, see the LICENSE file in the root directory of this repository.

utf8-bufread 0.1.5

UTF-8 Buffered Reader

Usage

Work in progress

License