Crate text_analysis

Expand description

Text Analysis Library

This crate provides a fast, pragmatic toolkit for linguistic text analysis over .txt, .pdf, .docx, and .odt files. It supports:

Tokenization (Unicode-aware, simple alphanumeric rules)
Optional stopword filtering (user-supplied list)
Optional stemming (auto-detected or forced language)
N-gram counting
Word frequency counting
Context statistics (±N window) and direct neighbors (±1)
PMI (Pointwise Mutual Information) collocations
Simple Named-Entity extraction (capitalization heuristic)
Parallel per-file analysis (compute) with serialized writes
Combined (Map-Reduce) mode that aggregates counts across files
Deterministic, sorted outputs in CSV/TSV/JSON/TXT

§Security & CSV/TSV export safety

If you open CSV/TSV in spreadsheet software (Excel/LibreOffice), cells that start with one of =, +, -, or @ may be interpreted as formulas (e.g., =HYPERLINK(...)). To prevent this, always:

Write CSV/TSV using a proper CSV library (this project uses csv::Writer) so commas, tabs, quotes, and newlines are escaped correctly.
Sanitize text cells by prefixing a single quote when they begin with one of the dangerous characters.

Structs§

AnalysisOptions: Parameters controlling analysis and export behavior.
AnalysisReport: Summary of a completed run.
AnalysisResult: Full analysis result for a single text/corpus.
PmiEntry: PMI entry for a pair of words at a given distance.

Enums§

ExportFormat: Export format for analysis outputs.
StemLang: Supported stemming languages (subset of rust-stemmers).
StemMode: Stemming behavior selector.

Functions§

analyze_path: Analyze a path (file or directory).
analyze_text_with: Analyze a single text buffer with the given stopwords and options. This is the core pipeline used by both per-file and combined modes.
collect_files: Collect all supported files (.txt, .pdf, .docx, .odt) recursively from path.
csv_safe_cell
extract_text_from_docx
extract_text_from_odt
stem_for: Collision-safe stem used in output filenames: <stem[.ext]>_<hash8>. The hash is a stable hash of the full path to avoid collisions across parallel runs.

Crate text_analysis

Crate text_analysis Copy item path

§Security & CSV/TSV export safety

Structs§

Enums§

Functions§

Crate text_analysis