Expand description
A tally of words with a count of the number of times each appears.
WordTally tallies the number of times words appear in source input. By default, the
I/O strategy is automatically selected based on input type:
- Files: Parallel memory-mapped I/O
- Character & block devices: Sequential streaming I/O
- Pipes, sockets, stdin, et cetera: Parallel streaming I/O
You can override this automatic selection with specific modes: sequential streaming minimizes memory usage, parallel streaming provides balanced performance and memory usage, and memory-mapped I/O offers the fastest processing for seekable files.
Word boundaries are determined using the icu_segmenter
crate from ICU4X, which provides Unicode text segmentation following
the Unicode Standard Annex #29 specification. The memchr
crate provides SIMD-accelerated newline detection for efficient parallel chunk processing.
§Configuration
The Options struct provides a unified interface for configuring all aspects of word
tallying. See the options module for detailed configuration documentation.
§Examples
use anyhow::Result;
use word_tally::{Options, TallyMap, WordTally};
let options = Options::default();
let tally_map = TallyMap::from_path("example.txt", &options)?;
let words = WordTally::from_tally_map(tally_map, &options);
println!("Total words: {}", words.count());use anyhow::Result;
use word_tally::{Case, Io, Options, TallyMap, WordTally};
// Memory-mapped file with lowercase normalization
let options = Options::default()
.with_io(Io::ParallelMmap)
.with_case(Case::Lower);
let tally_map = TallyMap::from_path("large-file.txt", &options)?;
let words = WordTally::from_tally_map(tally_map, &options);Re-exports§
pub use input::Buffered;pub use input::FileType;pub use input::Mapped;pub use input::Metadata;pub use options::Options;pub use options::case::Case;pub use options::delimiters::Delimiters;pub use options::filters::ExcludeWords;pub use options::filters::Filters;pub use options::filters::MinChars;pub use options::filters::MinCount;pub use options::io::Io;pub use options::patterns::ExcludeSet;pub use options::patterns::IncludeSet;pub use options::patterns::PatternList;pub use options::performance::Performance;pub use options::serialization::Serialization;pub use options::sort::Sort;pub use options::threads::Threads;pub use output::Output;pub use tally_map::TallyMap;
Modules§
- exit_
code - Exit codes following Unix sysexits.h conventions.
- input
- Input handling with buffered and mapped sources.
- options
- Configuration options for word tallying.
- output
- Output writing for stdout and file serialization.
- tally_
map - A collection for tallying word counts using
HashMap.
Structs§
- Word
Tally - A tally of word frequencies and counts, along with processing options.
Enums§
- Word
Tally Error - Structured error types for word-tally