Expand description
Text Analysis Library
This crate provides a fast, pragmatic toolkit for linguistic text analysis over .txt
, .pdf
, .docx
, and .odt
files. It supports:
- Tokenization (Unicode-aware, simple alphanumeric rules)
- Optional stopword filtering (user-supplied list)
- Optional stemming (auto-detected or forced language)
- N-gram counting
- Word frequency counting
- Context statistics (±N window) and direct neighbors (±1)
- PMI (Pointwise Mutual Information) collocations
- Simple Named-Entity extraction (capitalization heuristic)
- Parallel per-file analysis (compute) with serialized writes
- Combined (Map-Reduce) mode that aggregates counts across files
- Deterministic, sorted outputs in CSV/TSV/JSON/TXT
§Security & CSV/TSV export safety
If you open CSV/TSV in spreadsheet software (Excel/LibreOffice), cells that start with one of
=
, +
, -
, or @
may be interpreted as formulas (e.g., =HYPERLINK(...)
). To prevent this, always:
- Write CSV/TSV using a proper CSV library (this project uses
csv::Writer
) so commas, tabs, quotes, and newlines are escaped correctly. - Sanitize text cells by prefixing a single quote when they begin with one of the dangerous characters.
Structs§
- Analysis
Options - Parameters controlling analysis and export behavior.
- Analysis
Report - Summary of a completed run.
- Analysis
Result - Full analysis result for a single text/corpus.
- PmiEntry
- PMI entry for a pair of words at a given distance.
Enums§
- Export
Format - Export format for analysis outputs.
- Stem
Lang - Supported stemming languages (subset of
rust-stemmers
). - Stem
Mode - Stemming behavior selector.
Functions§
- analyze_
path - Analyze a path (file or directory).
- analyze_
text_ with - Analyze a single text buffer with the given
stopwords
andoptions
. This is the core pipeline used by both per-file and combined modes. - collect_
files - Collect all supported files (.txt, .pdf, .docx, .odt) recursively from
path
. - csv_
safe_ cell - extract_
text_ from_ docx - extract_
text_ from_ odt - stem_
for - Collision-safe stem used in output filenames: “<stem[.ext]>_
”. The hash is a stable hash of the full path to avoid collisions across parallel runs.