Crate text_analysis

Crate text_analysis 

Source
Expand description

Text Analysis Library

This crate provides a fast, pragmatic toolkit for linguistic text analysis over .txt, .pdf, .docx, and .odt files. It supports:

  • Tokenization (Unicode-aware, simple alphanumeric rules)
  • Optional stopword filtering (user-supplied list)
  • Optional stemming (auto-detected or forced language)
  • N-gram counting
  • Word frequency counting
  • Context statistics (±N window) and direct neighbors (±1)
  • PMI (Pointwise Mutual Information) collocations
  • Simple Named-Entity extraction (capitalization heuristic)
  • Parallel per-file analysis (compute) with serialized writes
  • Combined (Map-Reduce) mode that aggregates counts across files
  • Deterministic, sorted outputs in CSV/TSV/JSON/TXT

§Security & CSV/TSV export safety

If you open CSV/TSV in spreadsheet software (Excel/LibreOffice), cells that start with one of =, +, -, or @ may be interpreted as formulas (e.g., =HYPERLINK(...)). To prevent this, always:

  1. Write CSV/TSV using a proper CSV library (this project uses csv::Writer) so commas, tabs, quotes, and newlines are escaped correctly.
  2. Sanitize text cells by prefixing a single quote when they begin with one of the dangerous characters.

Structs§

AnalysisOptions
Parameters controlling analysis and export behavior.
AnalysisReport
Summary of a completed run.
AnalysisResult
Full analysis result for a single text/corpus.
PmiEntry
PMI entry for a pair of words at a given distance.

Enums§

ExportFormat
Export format for analysis outputs.
StemLang
Supported stemming languages (subset of rust-stemmers).
StemMode
Stemming behavior selector.

Functions§

analyze_path
Analyze a path (file or directory).
analyze_text_with
Analyze a single text buffer with the given stopwords and options. This is the core pipeline used by both per-file and combined modes.
collect_files
Collect all supported files (.txt, .pdf, .docx, .odt) recursively from path.
csv_safe_cell
extract_text_from_docx
extract_text_from_odt
stem_for
Collision-safe stem used in output filenames: “<stem[.ext]>_”. The hash is a stable hash of the full path to avoid collisions across parallel runs.