Text_Analysis
A robust, fast, modern CLI tool for linguistic text analysis in .txt
and .pdf
files, supporting:
- Automatic language detection (English, German, French, Spanish, Italian, Arabic)
- Optional stopword filtering (user-provided custom list; no automatic removal)
- Optional stemming (via
rust-stemmers
for supported languages) - N-gram analysis (user-defined N)
- Word frequency and context statistics
- Sliding-window co-occurrence and direct neighbors
- Named Entity recognition (simple capitalization heuristic)
- Collocation analysis with Pointwise Mutual Information (PMI)
- Export as TXT, CSV, TSV, or JSON
- Recursively scans directories
- Per-file analysis is parallelized (Rayon); output writing is serialized
- Combined Mode uses Map‑Reduce (see below)
- Never panics: all file errors are reported, not fatal
Note: PDF parsing is built-in via
pdf-extract
— no feature flag required.
Features
- Automatic language detection (
whatlang
) - Custom stopword lists (plain
.txt
, one word per line) - Stemming (optional): auto by detected language or forced via CLI
- N-grams (size configurable), word frequencies
- Context statistics (±N window) & direct neighbors (±1)
- PMI collocations for word pairs within the window
- Named Entities via capitalization heuristic (see below)
- All errors (unreadable files, PDF issues) are reported at the end
- CLI built with
clap
- Results written to timestamped files; a concise run summary is printed to the terminal
Installation
-
With cargo:
-
Download binary from Releases
-
Clone the repository and build from source
Use in your own Rust project:
[]
= { = "path/to/text_analysis" }
Usage
<path>
: file or directory (recursively scans for.txt
and.pdf
)--stopwords <file>
: (optional) stopword list (one word per line; if not provided, no filtering)--ngram N
: (optional, default: 2) N-gram size (2=bigrams, 3=trigrams, …)--context N
: (optional, default: 5) context window size (±N words)--export-format FORMAT
:txt
(default),csv
,tsv
, orjson
--entities-only
: only export named entities (not all statistics)--combine
: analyze all files together as one corpus (Map‑Reduce, see below)--stem
: enable stemming (based on detected language)--stem-lang LANG
: force stemming language (en
,de
,fr
,es
,it
,pt
,nl
,ru
,sv
,fi
,no
,ro
,hu
,da
,tr
); only effective with--stem
By default, each file is analyzed and exported individually.
With --combine
, files are analyzed as a single corpus using Map‑Reduce.
Example:
Output files & naming
Output files use a collision-safe stem, an 8-char path hash, a timestamp, and the analysis type.
<stem[.ext]>_<hash8>_<timestamp>_<analysis-type>.<ext>
Examples:
cli.txt_f3a9c2b1_20250810_155411_wordfreq.csv
cli.txt_f3a9c2b1_20250810_155411_ngrams.csv
combined_20250810_155411_wordfreq.csv
(combined mode has no hash)
The short hash prevents filename collisions (e.g., same stem across different files), especially with parallel runs.
Combined Mode (Map‑Reduce)
When --combine
is set, the corpus is processed via Map‑Reduce for scalability and consistency:
Map (parallel): for each file, build partial counts from normalized tokens
(lowercased, optional stopwords removed, optional stemming):
ngrams: HashMap<String, usize>
wordfreq: HashMap<String, usize>
context_pairs: HashMap<(String, String), usize>
(center, neighbor in ±window)neighbor_pairs: HashMap<(String, String), usize>
(direct neighbors ±1)cooc_by_dist: HashMap<(String, String, usize), usize>
(canonical pair, distance)named_entities: HashMap<String, usize>
from the original (non‑stemmed) tokensn_tokens: usize
Reduce (serial): merge all partial counts into global totals.
Finalize (serial):
- Construct the final
AnalysisResult
from totals - Compute PMI from global pair counts & unigram counts (single global pass)
- Write one set of combined outputs, e.g.
combined_<timestamp>_wordfreq.csv
Benefits: avoids holding a giant concatenated string in memory, maximizes parallelism, and ensures PMI/frequencies are consistent across the whole corpus.
Using as a Library
You get n-gram extraction, frequency analysis, PMI collocations, optional stemming, and custom stopword support.
Example 1: English bigrams (no stopwords, no stemming)
use HashSet;
use *;
Example 2: German unigrams with auto stemming
use HashSet;
use *;
Example 3: PMI with custom stopwords (no stemming)
use HashSet;
use *;
Named-Entity Heuristic (how it works)
The current NER is a simple capitalization heuristic:
- Tokenize the original (non-stemmed) text.
- Count a token as an entity candidate if it:
- starts with an uppercase letter (Unicode-aware),
- is not fully uppercase (filters acronyms like “NASA”),
- is not a common function word at sentence start (basic list).
- Counts are case-sensitive (so “Berlin” ≠ “BERLIN”).
This heuristic is fast and deterministic and will overgenerate in some cases (e.g., sentence-initial words). For higher quality, post-filter with custom lists or integrate a proper NER model. NER uses original tokens; stemming affects only statistics.
Performance Notes
- Per-file analysis runs in parallel using Rayon (compute); output writing is serialized to avoid I/O contention.
- Combined mode uses Map‑Reduce: files are mapped in parallel to partial counts, then reduced. PMI is computed once globally from aggregated counts.
- The short hash in filenames avoids collisions across files with the same stem when running in parallel.
To Do / Ideas
- Multi-language support
- Custom stopword list from file
- N-gram statistics
- Direct neighbor analysis
- Named Entity detection (heuristic)
- Collocation/PMI output
- CSV/JSON/TSV export
- Context window size (CLI flag)
- Parallel per-file analysis (Rayon)
- Map‑Reduce combined mode
- Lemmatization for more languages
- Richer reporting (collocation metrics, word clouds)
Contributions welcome — especially for more languages, better PDF parsing, or improved output!
License
MIT
Feedback & Issues
Feedback, bug reports, and pull requests are highly appreciated! Open an Issue or start a Discussion.