word-tally
Tallies the number of times each word appears in one or more unicode input sources. Use word-tally as a command-line tool or WordTally via the Rust library interface.
Four I/O strategies are available:
- stream: Sequential single-threaded streaming with minimal memory usage
- parallel-stream (default): Parallel streaming with balanced performance and memory usage
- parallel-in-memory: Load entire input into memory for parallel processing
- parallel-mmap: Memory-mapped I/O for best performance with large files
All parallel modes use SIMD-accelerated chunk boundary detection. Memory mapping requires seekable file descriptors and won't work with stdin or pipes.
Usage
Usage: word-tally [OPTIONS] [PATHS]...
Arguments:
[PATHS]... File paths to use as input (use "-" for stdin) [default: -]
Options:
-I, --io <STRATEGY> I/O strategy [default: parallel-stream] [possible values: parallel-stream, stream, parallel-mmap, parallel-in-memory]
-e, --encoding <ENCODING> Word boundary detection encoding [default: unicode] [possible values: unicode, ascii]
-c, --case <FORMAT> Case normalization [default: original] [possible values: original, upper, lower]
-s, --sort <ORDER> Sort order [default: desc] [possible values: desc, asc, unsorted]
-m, --min-chars <COUNT> Exclude words containing fewer than min chars
-n, --min-count <COUNT> Exclude words appearing fewer than min times
-w, --exclude-words <WORDS> Exclude words from a comma-delimited list
-i, --include <PATTERNS> Include only words matching a regex pattern
-x, --exclude <PATTERNS> Exclude words matching a regex pattern
-f, --format <FORMAT> Output format [default: text] [possible values: text, json, csv]
-d, --delimiter <VALUE> Delimiter between keys and values [default: " "]
-o, --output <PATH> Write output to file rather than stdout
-v, --verbose Print verbose details
-h, --help Print help (see more with '--help')
-V, --version Print version
Installation
Examples
I/O strategies
Choose an I/O strategy based on your performance and memory requirements:
# Default: Parallel streaming - balanced performance and memory
|
# Sequential streaming - minimize memory usage
# Parallel in-memory - fastest for small inputs and stdin
# Parallel memory-mapped - fastest for large files
Additional features:
# Process multiple files
# Mix stdin and files
|
Note: Memory mapping (parallel-mmap) requires seekable files and cannot be used with stdin or pipes.
Output formats
Text (default)
# Write to file instead of stdout
# Custom delimiter between word and count
# Pipe to other tools
|
CSV
# Using a comma delimiter (unescaped without headers)
# Using proper CSV format (escaped with headers)
JSON
Visualization
Convert JSON output for visualization with d3-cloud:
|
Format and pipe the JSON output to the wordcloud_cli to produce an image:
| |
Case normalization
# Convert to lowercase
# Preserve original case
# Convert all to uppercase
Sorting options
# Sort by frequency (descending, default)
# Sort alphabetically (ascending)
# No sorting (sorted by order seen)
Filtering words
# Only include words that appear at least 10 times
# Exclude words with fewer than 5 characters
# Exclude words by pattern
# Combining include and exclude patterns
# Exclude specific words
Verbose output
|
#>> source -
#>> total-words 6
#>> unique-words 3
#>> delimiter " "
#>> case original
#>> order desc
#>> processing parallel
#>> io streamed
#>> min-chars none
#>> min-count none
#>> exclude-words none
#>> exclude-patterns none
#>>
#>> fo 3
#>> fi 2
#>> fe 1
Environment variables
The following environment variables configure various aspects of the library:
I/O and processing strategy configuration:
WORD_TALLY_IO- I/O strategy (default: parallel-stream, options: stream, parallel-stream, parallel-in-memory, parallel-mmap)
Memory allocation and performance:
WORD_TALLY_UNIQUENESS_RATIO- Ratio of total words to unique words for capacity estimation. Higher values allocate less initial memory. Books tend to have a 10:1 ratio, but a more conservative 256:1 is used as default to reduce unnecessary memory overhead (default: 256)WORD_TALLY_WORDS_PER_KB- Estimated words per KB of text for capacity calculation (default: 128, max: 512)WORD_TALLY_STDIN_BUFFER_SIZE- Buffer size for stdin when size cannot be determined (default: 262144)WORD_TALLY_DEFAULT_CAPACITY- Default initial capacity when there is no size hint (default: 1024)WORD_TALLY_WORD_DENSITY- Multiplier for estimating unique words per chunk (default: 15)
Parallel processing configuration:
WORD_TALLY_THREADS- Number of threads for parallel processing (default: all available cores)WORD_TALLY_CHUNK_SIZE- Size of chunks for parallel processing in bytes (default: 65536)
Exit codes
word-tally uses standard unix exit codes to indicate success or the types of failure:
0: Success1: General error64: Command line usage error65: Data format error66: Cannot open input73: Cannot create output file74: I/O error77: Permission denied
Library usage
[]
= "0.26.0"
use File;
use ;
The library supports case normalization, sorting, filtering and I/O and processing strategies.
Stability notice
Pre-release level stability: This is prerelease software. Expect breaking interface changes at MINOR version (0.x.0) bumps until a stable release.
Tests & benchmarks
Tests
Clone the repository.
Run all tests.
Run specific test modules.
Run individual tests
cargo test --test filters_tests -- test_min_chars
cargo test --test io_tests -- test_memory_mapped
Benchmarks
Run all benchmarks.
Run specific benchmark groups
Run specific individual benchmarks