Text_Analysis

A robust, fast, modern CLI tool for linguistic text analysis in .txt and .pdf files, supporting:

Automatic language detection (English, German, French, Spanish, Italian, Arabic)
Optional stopword filtering (user-provided custom list; no automatic removal)
Optional stemming (via rust-stemmers for supported languages)
N-gram analysis (user-defined N)
Word frequency and context statistics
Sliding-window co-occurrence and direct neighbors
Named Entity recognition (simple capitalization heuristic)
Collocation analysis with Pointwise Mutual Information (PMI)
Export as TXT, CSV, TSV, or JSON
Recursively scans directories
Per-file analysis is parallelized (Rayon); output writing is serialized
Combined Mode uses Map‑Reduce (see below)
Never panics: all file errors are reported, not fatal

Note: PDF parsing is built-in via pdf-extract — no feature flag required.

Features

Automatic language detection (whatlang)
Custom stopword lists (plain .txt, one word per line)
Stemming (optional): auto by detected language or forced via CLI
N-grams (size configurable), word frequencies
Context statistics (±N window) & direct neighbors (±1)
PMI collocations for word pairs within the window
Named Entities via capitalization heuristic (see below)
All errors (unreadable files, PDF issues) are reported at the end
CLI built with clap
Results written to timestamped files; a concise run summary is printed to the terminal

Installation

With cargo:
```
cargo install text_analysis
```
Download binary from Releases
Clone the repository and build from source

Use in your own Rust project:

[dependencies]
text_analysis = { path = "path/to/text_analysis" }

Usage

text_analysis <path> [--stopwords FILE] [--ngram N] [--context N] [--export-format FORMAT] [--entities-only] [--combine] [--stem] [--stem-lang LANG]

<path>: file or directory (recursively scans for .txt and .pdf)
--stopwords <file>: (optional) stopword list (one word per line; if not provided, no filtering)
--ngram N: (optional, default: 2) N-gram size (2=bigrams, 3=trigrams, …)
--context N: (optional, default: 5) context window size (±N words)
--export-format FORMAT: txt (default), csv, tsv, or json
--entities-only: only export named entities (not all statistics)
--combine: analyze all files together as one corpus (Map‑Reduce, see below)
--stem: enable stemming (based on detected language)
--stem-lang LANG: force stemming language (en, de, fr, es, it, pt, nl, ru, sv, fi, no, ro, hu, da, tr); only effective with --stem

By default, each file is analyzed and exported individually.
With --combine, files are analyzed as a single corpus using Map‑Reduce.

Example:

text_analysis ./my_corpus/ --stopwords my_stoplist.txt --ngram 3 --context 4 --export-format csv --stem --combine

Output files & naming

Output files use a collision-safe stem, an 8-char path hash, a timestamp, and the analysis type.

<stem[.ext]>_<hash8>_<timestamp>_<analysis-type>.<ext>

Examples:

cli.txt_f3a9c2b1_20250810_155411_wordfreq.csv
cli.txt_f3a9c2b1_20250810_155411_ngrams.csv
combined_20250810_155411_wordfreq.csv (combined mode has no hash)

The short hash prevents filename collisions (e.g., same stem across different files), especially with parallel runs.

Combined Mode (Map‑Reduce)

When --combine is set, the corpus is processed via Map‑Reduce for scalability and consistency:

Map (parallel): for each file, build partial counts from normalized tokens
(lowercased, optional stopwords removed, optional stemming):

ngrams: HashMap<String, usize>
wordfreq: HashMap<String, usize>
context_pairs: HashMap<(String, String), usize> (center, neighbor in ±window)
neighbor_pairs: HashMap<(String, String), usize> (direct neighbors ±1)
cooc_by_dist: HashMap<(String, String, usize), usize> (canonical pair, distance)
named_entities: HashMap<String, usize> from the original (non‑stemmed) tokens
n_tokens: usize

Reduce (serial): merge all partial counts into global totals.

Finalize (serial):

Construct the final AnalysisResult from totals
Compute PMI from global pair counts & unigram counts (single global pass)
Write one set of combined outputs, e.g. combined_<timestamp>_wordfreq.csv

Benefits: avoids holding a giant concatenated string in memory, maximizes parallelism, and ensures PMI/frequencies are consistent across the whole corpus.

Using as a Library

You get n-gram extraction, frequency analysis, PMI collocations, optional stemming, and custom stopword support.

Example 1: English bigrams (no stopwords, no stemming)

use std::collections::HashSet;
use text_analysis::*;

fn main() {
    let text = "The quick brown fox jumps over the lazy dog. The fox was very quick!";
    let stopwords: HashSet<String> = HashSet::new();
    let options = AnalysisOptions { ngram: 2, context: 2, export_format: ExportFormat::Json, entities_only: false, combine: false, stem_mode: StemMode::Off };
    let result = analyze_text_with(text, &stopwords, &options);

    println!("Top 3 bigrams:");
    for (ngram, count) in result.ngrams.iter().take(3) {
        println!("{}: {}", ngram, count);
    }
}

Example 2: German unigrams with auto stemming

use std::collections::HashSet;
use text_analysis::*;

fn main() {
    let text = "Goethe schrieb den Faust. Faust ist ein Klassiker der deutschen Literatur.";
    let stopwords: HashSet<String> = HashSet::new();
    let options = AnalysisOptions { ngram: 1, context: 2, export_format: ExportFormat::Json, entities_only: false, combine: false, stem_mode: StemMode::Auto };
    let result = analyze_text_with(text, &stopwords, &options);

    println!("Most frequent words:");
    for (word, count) in result.wordfreq.iter().take(5) {
        println!("{}: {}", word, count);
    }
    println!("\nNamed entities:");
    for (entity, count) in result.named_entities.iter() {
        println!("{}: {}", entity, count);
    }
}

Example 3: PMI with custom stopwords (no stemming)

use std::collections::HashSet;
use text_analysis::*;

fn main() {
    let text = "Alice loves Bob. Bob loves Alice. Alice and Bob are friends.";
    let mut stopwords = HashSet::new();
    for w in ["and", "are", "loves"] { stopwords.insert(w.to_string()); }
    let options = AnalysisOptions { ngram: 1, context: 2, export_format: ExportFormat::Json, entities_only: false, combine: false, stem_mode: StemMode::Off };
    let result = analyze_text_with(text, &stopwords, &options);

    println!("PMI pairs (min_count=5):");
    for entry in result.pmi.iter().take(5) {
        println!("({}, {})  PMI: {:.2}", entry.word1, entry.word2, entry.pmi);
    }
}

Named-Entity Heuristic (how it works)

The current NER is a simple capitalization heuristic:

Tokenize the original (non-stemmed) text.
Count a token as an entity candidate if it:
- starts with an uppercase letter (Unicode-aware),
- is not fully uppercase (filters acronyms like “NASA”),
- is not a common function word at sentence start (basic list).
Counts are case-sensitive (so “Berlin” ≠ “BERLIN”).

This heuristic is fast and deterministic and will overgenerate in some cases (e.g., sentence-initial words). For higher quality, post-filter with custom lists or integrate a proper NER model. NER uses original tokens; stemming affects only statistics.

Performance Notes

Per-file analysis runs in parallel using Rayon (compute); output writing is serialized to avoid I/O contention.
Combined mode uses Map‑Reduce: files are mapped in parallel to partial counts, then reduced. PMI is computed once globally from aggregated counts.
The short hash in filenames avoids collisions across files with the same stem when running in parallel.

To Do / Ideas

Multi-language support
Custom stopword list from file
N-gram statistics
Direct neighbor analysis
Named Entity detection (heuristic)
Collocation/PMI output
CSV/JSON/TSV export
Context window size (CLI flag)
Parallel per-file analysis (Rayon)
Map‑Reduce combined mode
Lemmatization for more languages
Richer reporting (collocation metrics, word clouds)

Contributions welcome — especially for more languages, better PDF parsing, or improved output!

License

MIT

Feedback & Issues

Feedback, bug reports, and pull requests are highly appreciated! Open an Issue or start a Discussion.

text_analysis 0.4.4