Crate regex_chunker

Expand description

The centerpiece of this crate is the ByteChunker, which takes a regular expression and wraps a Read type, becoming an iterator over the bytes read from the wrapped type, yielding chunks delimited by the supplied regular expression.

The example program below uses a ByteChunker to do a crude word tally on text coming in on the standard input.

use std::{collections::BTreeMap, error::Error};
use regex_chunker::ByteChunker;
  
fn main() -> Result<(), Box<dyn Error>> {
    let mut counts: BTreeMap<String, usize> = BTreeMap::new();
    let stdin = std::io::stdin();
    
    // The regex is a stab at something matching strings of
    // "between-word" characters in general English text.
    let chunker = ByteChunker::new(stdin, r#"[ "\r\n.,!?:;/]+"#)?;
    for chunk in chunker {
        let word = String::from_utf8_lossy(&chunk?).to_lowercase();
        *counts.entry(word).or_default() += 1;
    }

    println!("{:#?}", &counts);
    Ok(())
}

Enabling the async feature also exposes the stream module, which features an async version of the ByteChunker, wrapping an AsyncRead and implementing Stream.

(This also pulls in several crates of tokio machinery, which is why it’s behind a feature flag.)

Modules

streamasync
Asynchronous analogs to the base *Chunker types that wrap Tokio’s AsyncRead types and implement Stream.

Structs

ByteChunker
The ByteChunker takes a bytes::Regex, wraps a byte source (that is, a type that implements std::io::Read) and iterates over chunks of bytes from that source that are delimited by the regular expression. It operates very much like bytes::Regex::split, except that it works on an incoming stream of bytes instead of a necessarily-already-in-memory slice.
CustomChunker
A chunker that has additionally been supplied with an Adapter, so it can produce arbitrary types. The CustomChunkers does not have a separate constructor; it is built by combining a ByteChunker with an Adapter using ByteChunker::with_adapter.
SimpleCustomChunker
A version of CustomChunker that takes a SimpleAdapter type.
StringAdapter
An example Adapter type for producing a chunker that yields Strings.

Enums

ErrorResponse
Type for specifying a Chunker’s behavior upon encountering an error.
MatchDisposition
Specify what the chunker should do with the matched text.
RcErr
Wraps various types of errors that can happen in the internals of a Chunker. The way Chunkers respond to and report these errors can be controlled through builder-pattern methods that take the ErrorResponse and Utf8FailureMode types.
Utf8FailureMode
Type for specifying a StringAdapter’s behavior upon encountering non-UTF-8 data.

Traits

Adapter
Trait used to implement a CustomChunker by transforming the output of a ByteChunker.
SimpleAdapter
Simpler, less flexible, version of the Adapter trait.