Skip to main content

Crate word_tally

Crate word_tally 

Source
Expand description

A tally of words with a count of the number of times each appears.

WordTally tallies the number of times words appear in source input. By default, the I/O strategy is automatically selected based on input type:

  • Files: Parallel memory-mapped I/O
  • Character & block devices: Sequential streaming I/O
  • Pipes, sockets, stdin, et cetera: Parallel streaming I/O

You can override this automatic selection with specific modes: sequential streaming minimizes memory usage, parallel streaming provides balanced performance and memory usage, and memory-mapped I/O offers the fastest processing for seekable files.

Word boundaries are determined using the icu_segmenter crate from ICU4X, which provides Unicode text segmentation following the Unicode Standard Annex #29 specification. The memchr crate provides SIMD-accelerated newline detection for efficient parallel chunk processing.

§Configuration

The Options struct provides a unified interface for configuring all aspects of word tallying. See the options module for detailed configuration documentation.

§Examples

use anyhow::Result;
use word_tally::{Options, TallyMap, WordTally};

let options = Options::default();
let tally_map = TallyMap::from_path("example.txt", &options)?;
let words = WordTally::from_tally_map(tally_map, &options);
println!("Total words: {}", words.count());
use anyhow::Result;
use word_tally::{Case, Io, Options, TallyMap, WordTally};

// Memory-mapped file with lowercase normalization
let options = Options::default()
    .with_io(Io::ParallelMmap)
    .with_case(Case::Lower);
let tally_map = TallyMap::from_path("large-file.txt", &options)?;
let words = WordTally::from_tally_map(tally_map, &options);

Re-exports§

pub use input::Buffered;
pub use input::FileType;
pub use input::Mapped;
pub use input::Metadata;
pub use options::Options;
pub use options::case::Case;
pub use options::delimiters::Delimiters;
pub use options::filters::ExcludeWords;
pub use options::filters::Filters;
pub use options::filters::MinChars;
pub use options::filters::MinCount;
pub use options::io::Io;
pub use options::patterns::ExcludeSet;
pub use options::patterns::IncludeSet;
pub use options::patterns::PatternList;
pub use options::performance::Performance;
pub use options::serialization::Serialization;
pub use options::sort::Sort;
pub use options::threads::Threads;
pub use output::Output;
pub use tally_map::TallyMap;

Modules§

exit_code
Exit codes following Unix sysexits.h conventions.
input
Input handling with buffered and mapped sources.
options
Configuration options for word tallying.
output
Output writing for stdout and file serialization.
tally_map
A collection for tallying word counts using HashMap.

Structs§

WordTally
A tally of word frequencies and counts, along with processing options.

Enums§

WordTallyError
Structured error types for word-tally

Type Aliases§

Count
The count of occurrences for a word.
Tally
A collection of word-count pairs.
Word
A word represented as a boxed string.