Crate text_analysis

Expand description

§Text Analysis Library

This crate provides core functionality for analyzing plain text (.txt) and PDF documents (.pdf). It can be used as a library or via the CLI in main.rs.

§Features

File collection: Recursively search for .txt and .pdf files.
N-gram extraction: Count contiguous word sequences of length N.
Word frequency analysis: Count individual word occurrences.
Context mapping: Identify co-occurring words within a configurable window.
Direct neighbor detection: Identify words immediately before and after a given word.
Named entity detection: Heuristic extraction of capitalized words.
PMI (Pointwise Mutual Information) calculation: Measure statistical association strength between word pairs.
Export: Save results in TXT, CSV, TSV, or JSON formats.

§Typical Usage

use text_analysis::{analyze_text, load_stopwords};
use std::collections::HashSet;
let text = "This is a sample text for analysis.";
let stopwords: HashSet<String> = HashSet::new();
let result = analyze_text(text, &stopwords, 2, 5);
println!("{}", result.summary());

All file I/O, PDF extraction, and export helpers are included.

Structs§

AnalysisReport: Analysis report for a (single or combined) analysis run
AnalysisResult: Struct with all statistics for a text
PmiEntry: Struct for PMI Collocations (for test compatibility and export)

Enums§

ExportFormat: Supported export formats

Functions§

analyze_path: Analyze all files as separate documents (default mode). Each file is processed independently and results are exported separately.
analyze_path_combined: Analyze all found files as a single corpus (“–combine” mode).
analyze_text: Main text analysis function.
collect_files: Recursively collect all .txt and .pdf files from a given path.
export_results: Export results to TXT/CSV/TSV/JSON with correct file naming.
load_stopwords: Load stopwords from file if given; else empty set.
print_failed_files: Print any files that failed to be processed.