text_analysis
A fast, pragmatic CLI & library for multi-language text analysis across .txt, .pdf, .docx, and .odt files.
Highlights
- Unicode-aware tokenization
- Optional stopword filtering (custom list)
- Optional stemming (auto-detected or forced language)
- N‑gram counts
- Word frequencies
- Context stats (±N) & direct neighbors (±1)
- Collocation analysis with Pointwise Mutual Information (PMI) for all word pairs in the context window
- Named‑Entity extraction (simple capitalization heuristic)
- Parallel per‑file compute (safe, serialized writes)
- Combined (Map‑Reduce) mode to aggregate multiple files
- Deterministic, sorted exports (CSV/TSV/JSON/TXT)
- Robust I/O: errors are reported, never panic
Installation
-
With cargo:
-
Download binary from Releases
-
Clone the repository and build from source
Quick start
# Default TXT summary (one file)
# CSV exports (multiple files: ngrams, wordfreq, context, neighbors, pmi, namedentities)
# Combine all files into one corpus (Map-Reduce) and export as JSON
Path can be a file or a directory (recursively scanned). Supported: .txt, .pdf, .docx, .odt.
CLI
text_analysis <path> [--stopwords <FILE>] [--ngram N] [--context N]
[--export-format {txt|csv|tsv|json}] [--entities-only]
[--combine]
[--stem] [--stem-lang <CODE>] [--stem-strict]
--stopwords <FILE>– optional stopword list (one token per line).--ngram N– n‑gram size (default: 2).--context N– context window size for context & PMI (default: 5).--export-format–txt(default),csv,tsv,json.--entities-only– only export Named Entities (skips other tables).--combine– analyze all files as one corpus (Map‑Reduce) and write a single set of outputs.--stem– enable stemming with auto language detection.--stem-lang <CODE>– force stemming language (e.g.,en,de,fr,es,it,pt,nl,ru,sv,fi,no,ro,hu,da,tr).--stem-strict– in auto mode, require detectable & supported language:- Per‑file mode: files without detectable/supported language are skipped (reported).
- Combined mode: the whole run aborts (prevents mixed stemming).
STDOUT summary (human-readable)
When the CLI finishes, it prints a concise summary to stdout. The order is tuned for usefulness:
- Top 20 N‑grams (count ↓, lexicographic tie‑break)
- Top 20 PMI pairs (count ↓, then PMI ↓, then words)
- Top 20 words (count ↓, lexicographic tie‑break)
This surfaces phrases and salient collocations before common function words.
Outputs
TXT (default)
- Exactly one file per run:
<stem>_<timestamp>_summary.txt
Contains the three sorted blocks (Top 20 N‑grams → Top 20 PMI → Top 20 words).
CSV / TSV / JSON
-
Multiple files per run (one per analysis):
<stem>_<timestamp>_ngrams.<ext><stem>_<timestamp>_wordfreq.<ext><stem>_<timestamp>_context.<ext><stem>_<timestamp>_neighbors.<ext><stem>_<timestamp>_pmi.<ext><stem>_<timestamp>_namedentities.<ext>
Output file overview
| File suffix | Contents | Notes |
|---|---|---|
_ngrams.<ext> |
List of all observed n-grams and their counts | Sorted by count ↓, then lexicographically ↑ |
_wordfreq.<ext> |
Word frequency table (unigrams only) | Sorted by count ↓, then lexicographically ↑ |
_context.<ext> |
Directed co-occurrence counts for all tokens in a ±N window around each center token | Window size set by --context (default 5); includes all words except the center word |
_neighbors.<ext> |
Directed co-occurrence counts for immediate left/right neighbors (±1 distance) | Always exactly one left and one right position per center token |
_pmi.<ext> |
Word pairs within the context window with their counts, distances, and Pointwise Mutual Information | Pairs are unordered in storage, sorted by count ↓, PMI ↓ in export |
_namedentities.<ext> |
Named entities detected via capitalization heuristic and their counts | Case-sensitive; ignores acronyms and common articles/determiners |
Sorting rules applied to all tabular exports:
- N‑grams & Wordfreq: by count desc, then key asc.
- Context & Neighbors (flattened): by count desc, then keys.
- PMI: by count desc, then PMI desc, then words.
Combined mode
With --combine, all inputs are processed as one corpus and exported once with stem "combined":
combined_<timestamp>_wordfreq.<ext>,combined_<timestamp>_ngrams.<ext>, …
File naming
<stem> is collision‑safe: derived from the file name plus a short path hash. In per‑file mode each input gets its own stem; in combined mode the stem is literally combined.
Library usage
Add to Cargo.toml:
[]
= "0.4.7"
Basic example:
use HashSet;
use *;
Named‑Entity heuristic
- Token starts with an uppercase letter
- Token is not all uppercase (filters acronyms)
- Filters very common determiners/articles across DE/EN/FR/ES/IT
Counts are case‑sensitive and computed on original tokens (not stemmed).
Stemming
StemMode::Off– no stemmingStemMode::Auto– language viawhatlang; stem if supportedStemMode::Force(lang)– use a specific stemmer
stem_require_detected controls strictness in auto mode (see CLI).
PDF support
Uses pdf-extract. Files that fail to parse are listed in the warnings and don’t abort the run.
Office document support
- DOCX: Parsed natively (pure Rust) by reading
word/document.xmland extracting text content. - ODT: Parsed natively (pure Rust) by reading
content.xmland extracting text content.
Notes:
- Extraction focuses on plaintext content for analysis; complex formatting, headers/footers, and footnotes may be ignored.
- Files that fail to parse are listed in the warnings and don't abort the run.
Best practices
- Use
--export-format csv(ortsv/json) for downstream analysis in pandas/R/Excel. - In noisy corpora, prefer
--ngram 2or--ngram 3and check PMI first. - For mixed‑language corpora, consider
--stem-strictto avoid inconsistent stemming.
License
MIT
Security: CSV/TSV safety
If you open exports in Excel/LibreOffice, cells that begin with =, +, -, or @ can be interpreted
as formulas. The recommended approach is:
- Use a proper CSV library (this project uses
csv::Writer) for escaping. - Prefix a
'for any text cell that starts with one of those characters.
This prevents spreadsheet software from executing user-provided content.