text_analysis 0.4.0

A robust multilingual text analysis CLI with context, N-grams, named entities, and CSV/JSON export.

Documentation

Text_Analysis

A robust, modern CLI tool for linguistic text analysis in .txt and .pdf files, supporting:

Automatic language detection (English, German, French, Spanish, Italian, Arabic)
Stemming (where possible)
Stopword removal (per language, and optionally custom stoplist)
N-gram analysis (user-defined N)
Word frequency and context statistics
Sliding-window co-occurrence and direct neighbors
Named Entity recognition (simple heuristic)
Collocation analysis with Pointwise Mutual Information (PMI) for all word pairs in the context window
Export as TXT, CSV, TSV, or JSON for further processing
Recursively scans directories
Live progress bar for file reading
Never panics: all file errors are reported, not fatal

Features

Automatic language detection (whatlang)
Per-language stemming (via rust-stemmers), or none (e.g. for Arabic)
Built-in stopword lists for English, German, French, Spanish, Italian, Arabic
Custom stopword lists supported (plain .txt, one word per line)
Counts and outputs N-grams (size configurable via CLI, e.g. bigrams, trigrams)
Context statistics: for every word, which words appear nearby most frequently (±N window)
Direct neighbors (±1) are reported separately
Collocation analysis with PMI, for all word pairs in the context window (CSV/JSON export)
Named Entity recognition via capitalization heuristic
Progress bar and current file output during analysis (indicatif)
All errors (unreadable files, PDF problems) are reported at the end, never panic
CLI built with clap
Results output to timestamped files in the working directory

Installation

With cargo:
```
cargo install text_analysis
```
Download binary from Releases
Clone the repository and build from source

Requires Rust toolchain (rustup.rs)

Usage

text_analysis <path> [--stopwords stoplist.txt] [--ngram N] [--context N] [--export-format FORMAT] [--entities-only]

<path>: file or directory (recursively scans for .txt and .pdf)
--stopwords <file>: (optional) additional stopword list (one word per line)
--ngram N: (optional, default: 2) N-gram size (e.g. 2 = bigrams, 3 = trigrams)
--context N: (optional, default: 5) context window size (N = ±N words)
--export-format FORMAT: txt (default), csv, tsv, or json (exports results as separate files)
--entities-only: only export named entities (names), not full statistics

During analysis, a progress bar and the current file being read are shown in the terminal.

Example:

text_analysis ./my_corpus/ --stopwords my_stoplist.txt --ngram 3 --context 4 --export-format csv

Output Example

The output file and stdout print N-gram statistics first, then per-word frequency and context, then named entities, then PMI collocations.

=== N-gram Analysis (N=3) ===
Ngram: "the quick brown" — Count: 18
Ngram: "quick brown fox" — Count: 18
Ngram: "brown fox jumps" — Count: 17
...

=== Word Frequencies and Context (window ±5) ===
Word: "fox" — Frequency: 25
    Words near: [("the", 22), ("quick", 18), ("brown", 15), ...]
    Direct neighbors: [("quick", 10), ("jumps", 9), ...]

Word: "dog" — Frequency: 19
    Words near: [("lazy", 14), ("brown", 9), ...]
    Direct neighbors: [("lazy", 7), ...]

=== Named Entities ===
  Fox                    — Count: 8
  Dog                    — Count: 5

=== PMI Collocations (min_count=5, top 20) ===
(      fox,      quick) @ d= 1  PMI= 4.13  count=19
(     lazy,        dog) @ d= 1  PMI= 4.02  count=18
...

# At the end of run (stderr):
Warning: The following files could not be read:
  ./broken.pdf: PDF error: ...
  ./unreadable.txt: ...

Exported files:

20250803_191010_wordfreq.csv (or .json)
20250803_191010_ngrams.csv
20250803_191010_namedentities.csv
20250803_191010_pmi.csv

Library Example

use text_analysis::{analyze_text, trim_to_words, english_stopwords};
let text = "The quick brown fox jumps over the lazy dog.";
let stopwords = english_stopwords();
let out = analyze_text(text, None, &stopwords, 2);
println!("{}", out);

Scientific Features & Best Practices

Language-aware stemming and stopwords for English, German, French, Spanish, Italian, Arabic
Optional additional stoplist (e.g. for project-specific terms)
N-gram and co-occurrence analysis for computational linguistics or stylometry
Collocation statistics with mutual information (PMI)
Configurable context window size (±N words)
All outputs can be processed as CSV/TSV/JSON, e.g. in R, Python, Excel, pandas, etc.
Named Entities exported for further annotation or statistics
Errors and files skipped are always listed at the end

To Do / Ideas

Multi-language support
Custom stopword list from file
N-gram statistics
Direct neighbor analysis
Named Entity detection (heuristic)
Collocation/PMI output
CSV/JSON export
Robust error handling & test coverage
Lemmatization for more languages (if crates become available)
Option for context window size (CLI flag)
Parallel file analysis for very large corpora
More advanced reporting (collocation metrics, word clouds)

Contributions welcome, especially for more languages, better PDF/docx parsing, or improved output!

License

MIT

Feedback & Issues

Feedback, bug reports, and pull requests are highly appreciated! Open an Issue or start a discussion.