# UTF-8 Buffered Reader
This crate provides functions to read utf-8 text from any
type implementing [`io::BufRead`](io::BufRead) through a
trait, [`BufRead`](BufRead), without waiting for newline
delimiters. These functions take advantage of buffering and
either return `&`[`str`](str) or [`char`](char)s. Each has
an associated iterator, some have an equivalent to a
[`Map`](Map) iterator that avoids allocation and cloning as
well.
[![crates.io](http://img.shields.io/crates/v/utf8-bufread.svg)](https://crates.io/crates/utf8_bufread)
[![docs.rs](https://docs.rs/utf8-bufread/badge.svg)](https://docs.rs/utf8-bufread/latest/utf8-bufread)
[![build status](https://gitlab.com/Austreelis/utf8-bufread/badges/main/pipeline.svg)](https://gitlab.com/Austreelis/utf8-bufread/-/commits/main)
# Usage
Add this crate as a dependency in your `Cargo.toml`:
```toml
[dependencies]
utf8-bufread = "1.0.0"
```
The simplest way to read a file using this crate may be
something along the following:
```rust
// Reader may be any type implementing io::BufRead
// We'll just use a cursor wrapping a slice for this example
let mut reader = Cursor::new("Löwe 老虎 Léopard");
loop { // Loop until EOF
match reader.read_str() {
Ok(s) => {
if s.is_empty() {
break; // EOF
}
// Do something with `s` ...
print!("{}", s);
}
Err(e) => {
// We should try again if we get interrupted
if e.kind() != ErrorKind::Interrupted {
break;
}
}
}
}
```
## Reading arbitrary-length string slices
The [`read_str`](read_str) function returns a
`&`[`str`](str) of arbitrary length (up to the reader's
buffer capacity) read from the inner reader, without cloning
data, unless a valid codepoint ends up cut at the end of the
reader's buffer. Its associated iterator can be obtained by
calling [`str_iter`](str_iter), and since it involves
cloning the data at each iteration, [`str_map`](str_map) is
also provided.
## Reading codepoints
The [`read_char`](read_char) function returns a
[`char`](char) read from the inner reader. Its associated
iterator can be obtained by calling
[`char_iter`](char_iter).
## Iterator types
This crate provides several structs for several ways of
iterating over the inner reader's data:
- [`StrIter`](StrIter) and
[`CodepointIter`](CodepointIter) clone the data on each
iteration, but use an [`Rc`](Rc) to check if the returned
[`String`](String) buffer is still used. If not, it is
re-used to avoid re-allocating.
```rust
let mut reader = Cursor::new("Löwe 老虎 Léopard");
for s in reader.str_iter().filter_map(|r| r.ok()) {
print!("{}", s);
}
```
- [`StrMap`](StrMap) and [`CodepointMap`](CodepointMap)
allow having access to read data without allocating nor
copying, but then it cannot be passed to further iterator
adapters.
```rust
let s = "Löwe 老虎 Léopard";
let mut reader = Cursor::new(s);
let count: usize = reader
.str_map(|s| s.len())
.filter_map(Result::ok)
.sum();
println!("There is {} valid utf-8 bytes in {}", count, s);
```
- [`CharIter`](CharIter) is similar to [`StrIter`](StrIter)
and others, except it relies on [`char`](char)s
implementing [`Copy`](Copy) and thus doesn't need a buffer
nor the "`Rc` trick".
```rust
let s = "Löwe 老虎 Léopard";
let mut reader = Cursor::new(s);
let count = reader
.char_iter()
.filter_map(Result::ok)
.filter(|c| c.is_lowercase())
.count();
assert_eq!(count, 9);
```
All these iterators may read data until EOF or an invalid
codepoint is found. If valid codepoints are read from the
inner reader, they *will* be returned before reporting an
error. After encountering an error or EOF, they always
return `None`(option). They always ignore any
[`Interrupted`](Interrupted) error.
# Work in progress
This crate is still a work in progress. Part of its API can
be considered stable:
- [`read_str`](read_str), [`read_codepoint`](read_codepoint) and [`read_char`](read_char)'s behavior and signature.
- [`str_iter`](str_iter), [`str_map`](str_map), [`codepoints_iter`](codepoints_iter), [`codepoints_map`](codepoints_map)
and [`char_iter`](char_iter)'s behavior and signature.
- [`StrIter`](StrIter), [`StrMap`](StrMap), [`CodepointIter`](CodepointIter), [`CodepointMap`](CodepointMap) and
[`CharIter`](CharIter)'s API.
However some features are still considered unstable:
- [`Error`](Error)'s behavior, particularly regarding its [`kind`](kind) and how it avoids
data loss (see [`leftovers`](leftovers)).
And some features still have to be added:
- A lossy and unchecked version of `read_*` (see
[`from_utf8_lossy`](from_ut8_lossy) &
[`from_utf8_unchecked`](from_utf8_unchecked)).
- (Optional) Support for grapheme clusters using the [`unicode-segmentation`](unicode-segmentation)
crate, in the same fashion as [`read_codepoint`](read_codepoint).
- I'm open to suggestion, if you have ideas 😉
Given I'm not the most experience developer at all, you are
very welcome to submit issues and push requests
[here](https://gitlab.com/Austreelis/utf8-bufread)
# License
Utf8-BufRead is distributed under the terms of the Apache
License 2.0, see the
[LICENSE](https://gitlab.com/Austreelis/utf8-bufread/-/blob/main/LICENSE)
file in the root directory of this repository.
[io::BufRead]: https://doc.rust-lang.org/std/io/trait.BufRead.html
[str]: https://doc.rust-lang.org/std/primitive.str.html
[char]: https://doc.rust-lang.org/std/primitive.char.html
[Map]: https://doc.rust-lang.org/std/iter/struct.Map.html
[Rc]: https://doc.rust-lang.org/std/rc/struct.Rc.html
[String]: https://doc.rust-lang.org/std/string/struct.String.html
[Copy]: https://doc.rust-lang.org/std/marker/trait.Copy.html
[option]: https://doc.rust-lang.org/std/option/index.html
[Interrupted]: https://doc.rust-lang.org/std/io/enum.ErrorKind.html#variant.Interrupted
[from_utf8_lossy]: https://doc.rust-lang.org/nightly/alloc/string/struct.String.html#method.from_utf8_lossy
[from_utf8_unchecked]: https://doc.rust-lang.org/nightly/alloc/string/struct.String.html#method.from_utf8_unchecked
[unicode-segmentation]: https://docs.rs/unicode-segmentation/latest/unicode_segmentation/index.html
[BufRead]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html
[read_str]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.read_str
[str_iter]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.str_iter
[str_map]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.str_map
[read_codepoint]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.read_codepoint
[codepoints_iter]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.codepoints_iter
[codepoints_map]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.codepoints_map
[read_char]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.read_char
[char_iter]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/trait.BufRead.html#method.char_iter
[StrIter]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/struct.StrIter.html
[StrMap]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/struct.StrMap.html
[CodepointIter]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/struct.CodepointIter.html
[CodepointMap]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/struct.CodepointMap.html
[CharIter]: https://docs.rs/utf8-bufread/1.0.0/utf8_bufread/struct.CharIter.html