Expand description
§Text Analysis Library
This crate provides core functionality for analyzing plain text (.txt
) and PDF documents (.pdf
).
It can be used as a library or via the CLI in main.rs
.
§Features
- File collection: Recursively search for
.txt
and.pdf
files. - N-gram extraction: Count contiguous word sequences of length N.
- Word frequency analysis: Count individual word occurrences.
- Context mapping: Identify co-occurring words within a configurable window.
- Direct neighbor detection: Identify words immediately before and after a given word.
- Named entity detection: Heuristic extraction of capitalized words.
- PMI (Pointwise Mutual Information) calculation: Measure statistical association strength between word pairs.
- Export: Save results in TXT, CSV, TSV, or JSON formats.
§Typical Usage
use text_analysis::{analyze_text, load_stopwords};
use std::collections::HashSet;
let text = "This is a sample text for analysis.";
let stopwords: HashSet<String> = HashSet::new();
let result = analyze_text(text, &stopwords, 2, 5);
println!("{}", result.summary());
All file I/O, PDF extraction, and export helpers are included.
Structs§
- Analysis
Report - Analysis report for a (single or combined) analysis run
- Analysis
Result - Struct with all statistics for a text
- PmiEntry
- Struct for PMI Collocations (for test compatibility and export)
Enums§
- Export
Format - Supported export formats
Functions§
- analyze_
path - Analyze all files as separate documents (default mode). Each file is processed independently and results are exported separately.
- analyze_
path_ combined - Analyze all found files as a single corpus (“–combine” mode).
- analyze_
text - Main text analysis function.
- collect_
files - Recursively collect all
.txt
and.pdf
files from a given path. - export_
results - Export results to TXT/CSV/TSV/JSON with correct file naming.
- load_
stopwords - Load stopwords from file if given; else empty set.
- print_
failed_ files - Print any files that failed to be processed.