word-tally
Output a tally of the number of times unique unicode words appear in a source. Provides a command line and Rust library interface. I/O is streamed by default, but buffered and memory-mapped I/O are also supported to optimize for different file sizes and workloads. Memory-mapping is only supported for files with seekable file descriptors. Parallel processing can be enabled to take advantage of multiple CPU cores through parallelizing I/O processing and sorting.
Installation
Usage
Usage: word-tally [OPTIONS] [PATH]
Arguments:
[PATH] File path to use as input rather than stdin ("-") [default: -]
Options:
-I, --io <STRATEGY> I/O strategy [default: streamed] [possible values: mmap, streamed, buffered]
-p, --parallel Use threads for parallel processing
-c, --case <FORMAT> Case normalization [default: lower] [possible values: original, upper, lower]
-s, --sort <ORDER> Sort order [default: desc] [possible values: desc, asc, unsorted]
-m, --min-chars <COUNT> Exclude words containing fewer than min chars
-M, --min-count <COUNT> Exclude words appearing fewer than min times
-E, --exclude-words <WORDS> Exclude words from a comma-delimited list
-i, --include <PATTERN> Include only words matching a regex pattern
-x, --exclude <PATTERN> Exclude words matching a regex pattern
-f, --format <FORMAT> Output format [default: text] [possible values: text, json, csv]
-d, --delimiter <VALUE> Delimiter between keys and values [default: " "]
-o, --output <PATH> Write output to file rather than stdout
-v, --verbose Print verbose details
-h, --help Print help (see more with '--help')
-V, --version Print version
Examples
Basic usage
|
#>> word 48
#>> tally 47
#>> default 24
|
#>> three 3
#>> two 2
#>> one 1
|
#>> source -
#>> total-words 6
#>> unique-words 3
#>> delimiter " "
#>> case lower
#>> order desc
#>> processing sequential
#>> io streamed
#>> min-chars none
#>> min-count none
#>> exclude-words none
#>> exclude-patterns none
#>>
#>> three 3
#>> two 2
#>> one 1
I/O & parallelization
word-tally supports various combinations of I/O modes and parallel processing:
# Streamed I/O from stdin (`--io=streamed` is default unless another I/O is specified)
|
# Streamed I/O from file
# Streamed I/O with parallel processing
# Buffered I/O with parallel processing
# Memory-mapped I/O with efficient parallel processing (requires a file rather than stdin)
The --io=mmap memory-mapped processing mode only works with files and cannot be used with piped input. Parallel processing with memory mapping can be very efficient but mmap requires a file with a seekable file descriptor.
Output formats
Text (default)
# Write to file instead of stdout
# Custom delimiter between word and count
# Pipe to other tools
|
CSV
# Using a comma delimiter (unescaped without headers)
# Using proper CSV format (escaped with headers)
JSON
Visualization
Convert JSON output for visualization with d3-cloud:
|
Format and pipe the JSON output to the wordcloud_cli to produce an image:
| |
Case normalization
# Default lowercase normalization
# Preserve original case
# Convert all to uppercase
Sorting options
# Sort by frequency (descending, default)
# Sort alphabetically (ascending)
# No sorting (sorted by order seen)
Filtering words
# Only include words that appear at least 10 times
# Exclude words with fewer than 5 characters
# Exclude words by pattern
# Combining include and exclude patterns
# Exclude specific words
Environment Variables
The following environment variables configure various aspects of the library:
I/O and processing strategy configuration:
WORD_TALLY_IO- I/O strategy (default: streamed, options: streamed, buffered, memory-mapped)WORD_TALLY_PROCESSING- Processing strategy (default: sequential, options: sequential, parallel)WORD_TALLY_VERBOSE- Enable verbose mode (default: false, options: true/1/yes/on)
Memory allocation and performance:
WORD_TALLY_UNIQUENESS_RATIO- Divisor for estimating unique words from input size (default: 10)WORD_TALLY_DEFAULT_CAPACITY- Default initial capacity when there is no size hint (default: 1024)WORD_TALLY_WORD_DENSITY- Multiplier for estimating unique words per chunk (default: 15)
Parallel processing configuration:
WORD_TALLY_THREADS- Number of threads for parallel processing (default: all available cores)WORD_TALLY_CHUNK_SIZE- Size of chunks for parallel processing in bytes (default: 65536)
Library Usage
[]
= "0.24.0"
use File;
use ;
The library supports customization including case normalization, sorting, filtering, and I/O and processing strategies.
Stability Notice
Pre-release level stability: This is prerelease software. Expect breaking interface changes at MINOR version (0.x.0) bumps until there is a stable release.
Tests & Benchmarks
Clone the repository.
Run all tests.
Run specific test modules.
Run individual tests
cargo test --test filters_tests -- test_min_chars
cargo test --test io_tests -- test_memory_mapped
Benchmarks
Run all benchmarks.
Run specific benchmark groups
Run specific individual benchmarks