corpa 0.4.0

The ripgrep of text analysis. Blazing-fast CLI for corpus-level NLP statistics.
Documentation

corpa

Blazing-fast text analysis for the command line, Python, and the browser.

A unified tool for corpus-level NLP statistics — n-gram frequencies, readability scores, entropy analysis, language detection, BPE token counting, and more — written in Rust for performance, with bindings for Python and JavaScript/WASM.

Installation · Quick Start · Commands · Documentation · Contributing


Highlights

  • High performance — Parallel processing via rayon. Analyzes multi-GB corpora in seconds.
  • Composable — Unix-friendly design with structured output (JSON, CSV, table). Pipes seamlessly with jq, awk, and standard tooling.
  • Comprehensive — Nine analysis commands covering vocabulary statistics, n-gram frequencies, readability indices, Shannon entropy, Zipf's law, language model perplexity, language detection, and BPE tokenization.
  • Multi-platform — Available as a native CLI binary, a Python package via PyO3, and an npm/WASM module for browser and Node.js environments.
  • Streaming — Process unbounded stdin streams with incremental chunk-based output for stats, ngrams, and entropy.

Installation

CLI

cargo install corpa

Or build from source:

git clone https://github.com/Flurry13/corpa
cd corpa
cargo build --release

Python

pip install corpa

JavaScript / WASM

npm install corpa

Quick Start

CLI

corpa stats corpus.txt
corpa ngrams -n 2 --top 20 corpus.txt
corpa readability essay.txt
corpa entropy corpus.txt
corpa perplexity corpus.txt --smoothing laplace
corpa lang mystery.txt
corpa tokens corpus.txt --model gpt4
corpa zipf corpus.txt --top 10

All commands accept file paths, directories (with --recursive), or stdin. Output format is controlled with --format (table, json, csv).

Python

import corpa

corpa.stats(text="The quick brown fox jumps over the lazy dog.")
# {'tokens': 9, 'types': 8, 'sentences': 1, 'type_token_ratio': 0.8889, ...}

corpa.ngrams("corpus.txt", n=2, top=10)
# [{'ngram': 'of the', 'frequency': 4521, 'relative_pct': 2.09}, ...]

corpa.lang(text="Bonjour le monde")
# {'language': 'Français', 'code': 'fra', 'script': 'Latin', 'confidence': 0.99}

All functions accept a file path as the first argument or a text= keyword argument for direct string input.

JavaScript / WASM

import { stats, lang, entropy } from 'corpa';

const result = stats("The quick brown fox jumps over the lazy dog.");
// { tokens: 9, types: 8, sentences: 1, type_token_ratio: 0.8889, ... }

const detected = lang("Bonjour le monde");
// { language: 'Français', code: 'fra', script: 'Latin', confidence: 0.99 }

All functions accept text strings directly and return plain JavaScript objects.


Commands

Command Description
stats Token, type, sentence counts, type-token ratio, hapax legomena, average sentence length
ngrams N-gram frequency analysis with configurable N, top-K, minimum frequency, case folding, stopword filtering
tokens Whitespace, sentence, and character tokenization; BPE token counts for GPT-3, GPT-4, and GPT-4o
readability Flesch-Kincaid Grade, Flesch Reading Ease, Coleman-Liau Index, Gunning Fog Index, SMOG Index
entropy Unigram, bigram, and trigram Shannon entropy; entropy rate; vocabulary redundancy
perplexity N-gram language model perplexity with Laplace smoothing and Stupid Backoff
lang Language and script detection with confidence scoring
zipf Zipf's law rank-frequency distribution with exponent fitting and terminal sparkline plotting
completions Shell completion generation for bash, zsh, and fish

Example Output

$ corpa stats prose.txt

  corpa · prose.txt
┌─────────────────────┬────────────┐
│ Metric              ┆      Value │
╞═════════════════════╪════════════╡
│ Tokens (words)      ┆        175 │
│ Types (unique)      ┆         95 │
│ Characters          ┆        805 │
│ Sentences           ┆          6 │
│ Type-Token Ratio    ┆     0.5429 │
│ Hapax Legomena      ┆ 70 (73.7%) │
│ Avg Sentence Length ┆ 29.2 words │
└─────────────────────┴────────────┘
$ corpa readability prose.txt

  corpa · prose.txt
┌──────────────────────┬───────┬─────────────┐
│ Metric               ┆ Score ┆       Grade │
╞══════════════════════╪═══════╪═════════════╡
│ Flesch-Kincaid Grade ┆ 12.73 ┆ High School │
│ Flesch Reading Ease  ┆ 41.16 ┆   Difficult │
│ Coleman-Liau Index   ┆ 13.82 ┆     College │
│ Gunning Fog Index    ┆ 16.97 ┆     College │
│ SMOG Index           ┆ 14.62 ┆     College │
└──────────────────────┴───────┴─────────────┘
$ corpa tokens prose.txt --model all

  corpa · prose.txt
┌──────────────┬────────┐
│ Tokenizer    ┆ Tokens │
╞══════════════╪════════╡
│ Whitespace   ┆    126 │
│ Sentences    ┆      6 │
│ Characters   ┆    805 │
│ BPE (GPT-4)  ┆    150 │
│ BPE (GPT-4o) ┆    148 │
│ BPE (GPT-3)  ┆    151 │
└──────────────┴────────┘

Streaming

The --stream flag enables incremental processing of unbounded stdin, emitting cumulative results after each chunk. Chunk size is configurable with --chunk-lines (default: 1000).

cat huge_corpus.txt | corpa stats --stream --chunk-lines 500 --format json

Supported commands: stats, ngrams, entropy.

Format Behavior
json JSON Lines — one object per chunk
csv Header row once, data rows per chunk
table Table per chunk with chunk number in title

Global Options

Flag Description
--format <fmt> Output format: table (default), json, csv
--recursive Process directories recursively
--stream Process stdin incrementally, emitting results per chunk
--chunk-lines <N> Lines per chunk in streaming mode (default: 1000)

Performance

Benchmarks on a 1GB English text corpus (Apple M2, 8 cores):

Command corpa Python (NLTK) Speedup
Word count 0.8s 34s 42x
Bigram frequency 1.2s 89s 74x
Readability 0.9s 41s 45x

Benchmarks are targets and will be validated with formal benchmarking infrastructure.


Documentation

Resource Description
CLI Commands Full command reference with options and examples
Streaming Incremental stdin processing for large-scale analysis
Python API PyO3 bindings — all commands as native Python functions
JavaScript API WASM bindings for browser and Node.js environments

Roadmap

Completed

  • v0.1.0 — Core CLI: stats, ngrams, tokens, JSON/CSV/table output, stdin and file input, recursive directories
  • v0.2.0 — Analysis: readability, entropy, zipf, stopword filtering, case folding, parallel processing
  • v0.3.0 — Language Models: perplexity with Laplace/Stupid Backoff, lang detection, BPE token counting
  • v0.4.0 — Ecosystem: Python bindings (PyO3), WASM/npm package, streaming mode, shell completions

Planned

  • Custom vocabulary and dictionary support
  • Concordance / KWIC (keyword in context) search
  • Collocation analysis (PMI, chi-squared)
  • Sentiment lexicon scoring
  • Diff mode for comparing two corpora

Contributing

Contributions are welcome. Please open an issue to discuss proposed changes before submitting a pull request.

cargo test            # Run test suite
cargo clippy          # Lint
cargo bench           # Run benchmarks

License

This project is licensed under the MIT License.


Acknowledgments