Crate regex_chunker

source ·
Expand description

The centerpiece of this crate is the ByteChunker, which takes a regular expression and wraps a Read type, becoming an iterator over the bytes read from the wrapped type, yielding chunks delimited by the supplied regular expression.

The example program below uses a ByteChunker to do a crude word tally on text coming in on the standard input.

use std::{collections::BTreeMap, error::Error};
use regex_chunker::ByteChunker;
  
fn main() -> Result<(), Box<dyn Error>> {
    let mut counts: BTreeMap<String, usize> = BTreeMap::new();
    let stdin = std::io::stdin();
    
    // The regex is a stab at something matching strings of
    // "between-word" characters in general English text.
    let chunker = ByteChunker::new(stdin, r#"[ "\r\n.,!?:;/]+"#)?;
    for chunk in chunker {
        let word = String::from_utf8_lossy(&chunk?).to_lowercase();
        *counts.entry(word).or_default() += 1;
    }

    println!("{:#?}", &counts);
    Ok(())
}

Enabling the async feature also exposes the stream module, which features an async version of the ByteChunker, wrapping an AsyncRead and implementing Stream.

(This also pulls in several crates of tokio machinery, which is why it’s behind a feature flag.)

Modules

Structs

Enums

  • Type for specifying a Chunker’s behavior upon encountering an error.
  • Specify what the chunker should do with the matched text.
  • Wraps various types of errors that can happen in the internals of a Chunker. The way Chunkers respond to and report these errors can be controlled through builder-pattern methods that take the ErrorResponse and Utf8FailureMode types.
  • Type for specifying a StringAdapter’s behavior upon encountering non-UTF-8 data.

Traits