word-tally 0.19.0

Output a tally of the number of times unique words appear in source input.
Documentation

word-tally

Crates.io docs.rs GitHub Actions Workflow Status

Output a tally of the number of times unique words appear in source input.

Usage

Usage: word-tally [OPTIONS] [PATH]

Arguments:
  [PATH]  File path to use as input rather than stdin ("-") [default: -]

Options:
  -s, --sort <ORDER>       Sort order [default: desc] [possible values: desc, asc, unsorted]
  -c, --case <FORMAT>      Case normalization [default: lower] [possible values: original, upper, lower]
  -m, --min-chars <COUNT>  Exclude words containing fewer than min chars
  -M, --min-count <COUNT>  Exclude words appearing fewer than min times
  -e, --exclude <WORDS>    Exclude words from a comma-delimited list
  -d, --delimiter <VALUE>  Delimiter between keys and values [default: " "]
  -o, --output <PATH>      Write output to file rather than stdout
  -f, --format <FORMAT>    Output format [default: text] [possible values: text, json, csv]
  -v, --verbose            Print verbose details
  -p, --parallel           Use parallel processing for word counting
  -h, --help               Print help
  -V, --version            Print version

Examples

word-tally README.md | head -n3
#>> tally 22
#>> word 20
#>> https 11

CSV output:

# Using delimiter (manual CSV)
word-tally --delimiter="," --output="tally.csv" README.md

# Using CSV format (with header)
word-tally --format=csv --output="tally.csv" README.md

JSON output:

word-tally --format=json --output="tally.json" README.md

Parallel processing can be much faster for large files:

word-tally --parallel README.md

# Tune with environment variables
WORD_TALLY_THREADS=4 WORD_TALLY_CHUNK_SIZE=32768 word-tally --parallel huge-file.txt

Environment Variables

  • WORD_TALLY_UNIQUENESS_RATIO - Divisor for estimating unique words from input size (default: 10)
  • WORD_TALLY_DEFAULT_CAPACITY - Default initial capacity when there is no size hint (default: 1024)

These variables only affect the program when using the --parallel flag:

  • WORD_TALLY_THREADS - Number of threads for parallel processing (default: number of cores)
  • WORD_TALLY_CHUNK_SIZE - Size of chunks for parallel processing in bytes (default: 16384)
  • WORD_TALLY_WORD_DENSITY - Multiplier for estimating unique words per chunk (default: 15)

Installation

cargo install word-tally

Cargo.toml

Add word-tally as a dependency.

[dependencies]
word-tally = "0.19.0"

Documentation

https://docs.rs/word-tally

Tests & benchmarks

Clone the repository.

git clone https://github.com/havenwood/word-tally
cd word-tally

Run the tests.

cargo test

And run the benchmarks.

cargo bench