Crate text_analysis

Source
Expand description

§Text Analysis Library

This crate provides core functionality for analyzing plain text (.txt) and PDF documents (.pdf). It can be used as a library or via the CLI in main.rs.

§Features

  • File collection: Recursively search for .txt and .pdf files.
  • N-gram extraction: Count contiguous word sequences of length N.
  • Word frequency analysis: Count individual word occurrences.
  • Context mapping: Identify co-occurring words within a configurable window.
  • Direct neighbor detection: Identify words immediately before and after a given word.
  • Named entity detection: Heuristic extraction of capitalized words.
  • PMI (Pointwise Mutual Information) calculation: Measure statistical association strength between word pairs.
  • Export: Save results in TXT, CSV, TSV, or JSON formats.

§Typical Usage

use text_analysis::{analyze_text, load_stopwords};
use std::collections::HashSet;
let text = "This is a sample text for analysis.";
let stopwords: HashSet<String> = HashSet::new();
let result = analyze_text(text, &stopwords, 2, 5);
println!("{}", result.summary());

All file I/O, PDF extraction, and export helpers are included.

Structs§

AnalysisReport
Analysis report for a (single or combined) analysis run
AnalysisResult
Struct with all statistics for a text
PmiEntry
Struct for PMI Collocations (for test compatibility and export)

Enums§

ExportFormat
Supported export formats

Functions§

analyze_path
Analyze all files as separate documents (default mode). Each file is processed independently and results are exported separately.
analyze_path_combined
Analyze all found files as a single corpus (“–combine” mode).
analyze_text
Main text analysis function.
collect_files
Recursively collect all .txt and .pdf files from a given path.
export_results
Export results to TXT/CSV/TSV/JSON with correct file naming.
load_stopwords
Load stopwords from file if given; else empty set.
print_failed_files
Print any files that failed to be processed.