text_analysis 0.4.1

A robust multilingual text analysis CLI with context, N-grams, named entities, and CSV/JSON export.
Documentation

rust-clippy analyze Crates.io Documentation Crates.io Deploy

Text_Analysis

A robust, modern CLI tool for linguistic text analysis in .txt and .pdf files, supporting:

  • Automatic language detection (English, German, French, Spanish, Italian, Arabic)
  • Stemming (where possible)
  • Optional stopword removal (via custom stoplist)
  • N-gram analysis (user-defined N)
  • Word frequency and context statistics
  • Sliding-window co-occurrence and direct neighbors
  • Named Entity recognition (simple heuristic)
  • Collocation analysis with Pointwise Mutual Information (PMI) for all word pairs in the context window
  • Export as TXT, CSV, TSV, or JSON for further processing
  • Recursively scans directories
  • Live progress bar for file reading
  • Never panics: all file errors are reported, not fatal

Features

  • Automatic language detection (whatlang)
  • Per-language stemming (via rust-stemmers), or none (e.g. for Arabic)
  • Custom stopword lists supported (plain .txt, one word per line)
  • Counts and outputs N-grams (size configurable via CLI, e.g. bigrams, trigrams)
  • Context statistics: for every word, which words appear nearby most frequently (±N window)
  • Direct neighbors (±1) are reported separately
  • Collocation analysis with PMI, for all word pairs in the context window (CSV/JSON export)
  • Named Entity recognition via capitalization heuristic
  • Progress bar and current file output during analysis (indicatif)
  • All errors (unreadable files, PDF problems) are reported at the end, never panic
  • CLI built with clap
  • Results output to timestamped files in the working directory
  • Failsafe: Always outputs a .txt file containing the whole analysis

Installation

  • With cargo:

    cargo install text_analysis
    
  • Download binary from Releases

  • Clone the repository and build from source


Usage

text_analysis <path> [--stopwords stoplist.txt] [--ngram N] [--context N] [--export-format FORMAT] [--entities-only]
  • <path>: file or directory (recursively scans for .txt and .pdf)
  • --stopwords <file>: (optional) additional stopword list (one word per line)
  • --ngram N: (optional, default: 2) N-gram size (e.g. 2 = bigrams, 3 = trigrams)
  • --context N: (optional, default: 5) context window size (N = ±N words)
  • --export-format FORMAT: txt (default), csv, tsv, or json (exports results as separate files)
  • --entities-only: only export named entities (names), not full statistics
  • --combine: Analyze all files together and output combined result files

By default, each file is analyzed and exported individually. With --combine, all files are analyzed as a single corpus and combined result files are exported.

During analysis, a progress bar and the current file being read are shown in the terminal.

Example:

text_analysis ./my_corpus/ --stopwords my_stoplist.txt --ngram 3 --context 4 --export-format csv

Output Example

The output file and stdout print N-gram statistics first, then per-word frequency and context, then named entities, then PMI collocations.

=== N-gram Analysis (N=3) ===
Ngram: "the quick brown" — Count: 18
Ngram: "quick brown fox" — Count: 18
Ngram: "brown fox jumps" — Count: 17
...

=== Word Frequencies and Context (window ±5) ===
Word: "fox" — Frequency: 25
    Words near: [("the", 22), ("quick", 18), ("brown", 15), ...]
    Direct neighbors: [("quick", 10), ("jumps", 9), ...]

Word: "dog" — Frequency: 19
    Words near: [("lazy", 14), ("brown", 9), ...]
    Direct neighbors: [("lazy", 7), ...]

=== Named Entities ===
  Fox                    — Count: 8
  Dog                    — Count: 5

=== PMI Collocations (min_count=5, top 20) ===
(      fox,      quick) @ d= 1  PMI= 4.13  count=19
(     lazy,        dog) @ d= 1  PMI= 4.02  count=18
...

# At the end of run (stderr):
Warning: The following files could not be read:
  ./broken.pdf: PDF error: ...
  ./unreadable.txt: ...

Exported Files:

The output files now start with the analyzed filename, followed by the analysis type and a timestamp. For example:

  • mytext_wordfreq_20250803_191010.csv
  • mytext_ngrams_20250803_191010.csv
  • mytext_namedentities_20250803_191010.csv
  • mytext_pmi_20250803_191010.csv

When using combined analysis (--combine):

  • combined_wordfreq_20250803_191010.csv
  • combined_ngrams_20250803_191010.csv
  • etc.

The exact file naming scheme is:
<filename>_<analysis-type>_<timestamp>.<ext>


Using as a Library

This crate can be used directly in your own Rust projects for fast multi-language text analysis.
You get all core functions, including n-gram extraction, frequency analysis, collocation statistics (PMI), and automatic or custom stopword support.

Add to your Cargo.toml:

[dependencies]
text_analysis = { path = "path/to/your/text_analysis" }

Example 1: Analyze a Text for English Bigrams

use text_analysis::*;

fn main() {
    let text = "The quick brown fox jumps over the lazy dog. The fox was very quick!";
    let stopwords = default_stopwords_for_language("en");
    let result = analyze_text(text, &stopwords, 2, 2); // bigrams, window = 2

    println!("Top 3 bigrams:");
    for (ngram, count) in result.ngrams.iter().take(3) {
        println!("{}: {}", ngram, count);
    }
}

Example 2: Frequency and Named Entity Extraction for German

use text_analysis::*;

fn main() {
    let text = "Goethe schrieb den Faust. Faust ist ein Klassiker der deutschen Literatur.";
    let stopwords = default_stopwords_for_language("de");
    let result = analyze_text(text, &stopwords, 1, 2); // unigrams, window = 2

    println!("Most frequent words:");
    for (word, count) in result.wordfreq.iter().take(5) {
        println!("{}: {}", word, count);
    }
    println!("
Named entities:");
    for (entity, count) in result.named_entities.iter() {
        println!("{}: {}", entity, count);
    }
}

Example 3: PMI Collocation Extraction with Custom Stopwords

use std::collections::HashSet;
use text_analysis::*;

fn main() {
    let text = "Alice loves Bob. Bob loves Alice. Alice and Bob are friends.";
    let mut stopwords = HashSet::new();
    for w in ["and", "are", "loves"] { stopwords.insert(w.to_string()); }
    let result = analyze_text(text, &stopwords, 1, 2); // unigrams, window = 2

    println!("PMI pairs (min_count=5):");
    for entry in result.pmi.iter().take(5) {
        println!("({}, {})  PMI: {:.2}", entry.word1, entry.word2, entry.pmi);
    }
}

Tip:
All functions work with any Unicode text.


Scientific Features & Best Practices

  • Language-aware stemming and stopwords for English, German, French, Spanish, Italian, Arabic
  • Optional additional stoplist (e.g. for project-specific terms)
  • N-gram and co-occurrence analysis for computational linguistics or stylometry
  • Collocation statistics with mutual information (PMI)
  • Configurable context window size (±N words)
  • All outputs can be processed as CSV/TSV/JSON, e.g. in R, Python, Excel, pandas, etc.
  • Named Entities exported for further annotation or statistics
  • Errors and files skipped are always listed at the end

To Do / Ideas

  • Multi-language support
  • Custom stopword list from file
  • N-gram statistics
  • Direct neighbor analysis
  • Named Entity detection (heuristic)
  • Collocation/PMI output
  • CSV/JSON export
  • Robust error handling & test coverage
  • Lemmatization for more languages (if crates become available)
  • Option for context window size (CLI flag)
  • Parallel file analysis for very large corpora
  • More advanced reporting (collocation metrics, word clouds)

Contributions welcome, especially for more languages, better PDF/docx parsing, or improved output!


License

MIT


Feedback & Issues

Feedback, bug reports, and pull requests are highly appreciated! Open an Issue or start a discussion.