utf8-bufread 0.1.5

Provides alternatives to BufRead's read_line & lines that stop not on newlines
Documentation

UTF-8 Buffered Reader

Provides alternatives to BufRead::read_line and BufRead::lines that allow getting UTF-8 strings but do not stop on newline delimiters, to avoid loading large amount of data in memory when reading files with few newlines.

crates.io docs.rs build status

Usage

Add this crate as a dependency in your Cargo.toml:

[dependencies]
utf8-bufread = "0.1.5"

This will allow you to use the BufRead trait provided by this crate and automatically implemented on any type implementing std::io::BufRead.

This trait provides functions to read utf8 strings from a stream, but none of those functions guarantee the read chunk of data will end on a newline delimiter (unlike BufRead::read_line or BufRead::lines). This allows you to use buffered readers and std::io::BufRead's API on a large stream without worrying about loading a huge amount of data into memory if there is no newline delimiter.

The functions of this trait are centered around BufRead::with_utf8_chunk, which takes a closure being passed the string slice of utf8 data read from the inner reader, and returns an io::Result of the number of bytes read, in the same same fashion as most functions from std::io's traits and structs functions. The string slice may be of arbitrary length and may stop at any point in the stream, but will always contain valid UTF-8.

fn main() {
  use std::io::Cursor;
  use utf8_bufread::BufRead;

  // Cursor implements BufRead when wrapping a string slice
  let mut reader = Cursor::new(
    "The quick fox jumps over the lazy dog"
  );
  let mut o_counter = 0;

  // Counts the number of "o"s in the stream
  loop {
    match reader.with_utf8_chunk(|s| {
      o_counter += s.matches('o').count()
    }) {
      Ok(0) | Err(_) => break,
      Ok(_) => continue,
    }
  }
  assert_eq!(3, o_counter);
}

The trait also provides functions to append to a provided buffer and to iterate over read chunks.

use utf8_bufread::BufRead;
use std::io::BufReader;
use std::fs::File;

fn main() {
  use std::fs::File;
  use std::io::BufReader;
  use utf8_bufread::BufRead;

  // Open our file
  let mut reader = BufReader::new(
    File::open("my_file.txt").unwrap()
  );
  // The string we'll use to store the text of the file
  let mut text = String::new();
  loop {
    // Loop until EOF
    match reader.read_utf8(&mut text) {
      Ok(0) => break, // EOF
      Ok(_) => {
        continue
      }
      Err(e) => std::panic::panic_any(e),
    }
  }
  
  // Do something with `text` ...
}

If valid utf-8 codepoint is read it will always be processed, be it passed to a closure or appended to provided buffer. If an invalid or incomplete codepoint is read, the functions of this crate will first process all the valid bytes read and a relevant io::Error will be returned on the next call:

fn main() {
  use std::io::{Cursor, ErrorKind};
  use std::str::Utf8Error;
  use utf8_bufread::BufRead;

  // Cursor implements BufRead when wrapping a u8 slice
  // "foo\nbar" + some invalid bytes
  let mut reader = Cursor::new([
    0x66u8, 0x6f, 0x6f, 0xa, 0x62, 0x61, 0x72, 0x9f, 0x92, 0x96, 0x0,
  ]);
  let mut n_read = 0;
  let mut buf = String::new();

  // First read all the valid bytes until EOF or error
  // (in this case, an error)
  let err = loop {
    match reader.read_utf8(&mut buf) {
      Ok(0) => break Ok(()),
      Ok(n) => {
        n_read += n;
        continue;
      }
      Err(e) => break Err(e),
    };
  };
  // We did get all our valid bytes
  assert_eq!("foo\nbar", buf.as_str());
  assert_eq!(7, n_read);

  // And our last call gave us an `io::Error` caused by an
  // std::str::Utf8Error
  assert!(err.is_err());
  let err = err.unwrap_err();
  assert_eq!(ErrorKind::InvalidData, err.kind());
  let err = err.into_inner();
  assert!(err.is_some());
  assert!(err.unwrap().is::<Utf8Error>());
}

Work in progress

This crate is fairly new, and for now only provides a limited amount API, with a rather simple implementation. In the near future these features should be added:

  • A lossy and unchecked version of read_utf8 (see from_utf8_lossy & from_utf8_unchecked).
  • A chars iterator from the buffer, and its lossy version.
  • I'm open to suggestion, if you have ideas 😉

This also means it may have a pretty unstable API

Given I'm not the most experience developer at all, you are very welcome to submit push requests here

License

Utf8-BufRead is distributed under the terms of the Apache License 2.0, see the LICENSE file in the root directory of this repository.