Expand description
§Text Analysis Library
Core functions for analyzing .txt
and .pdf
documents.
§NER Heuristic (documentation)
The Named-Entity recognition uses a simple capitalization heuristic:
- Tokenize the original (non‑stemmed) text.
- A token is counted as a candidate entity if it
- starts with an uppercase letter (Unicode-aware), and
- is not fully uppercase (to avoid acronyms), and
- is not a common function word at a sentence start (basic list check).
- Counts are case-sensitive (so “Berlin” ≠ “BERLIN”).
Note: This heuristic is fast and deterministic but will overgenerate in some cases (e.g., sentence-initial words). For higher quality, apply a custom post-filter or integrate a proper NER model.
§clone() avoidance (key places)
- Use
HashMap::entry
to avoid double lookups and to allocate keys only on insertion. - For context maps, allocate strings only on first insertion (
entry(key.to_owned())
). - Serialization writes directly to files to avoid unnecessary intermediate allocations.
§No double scanning of files
analyze_path
collects files once and then processes either combined or per-file.
Structs§
- Analysis
Options - Analysis options
- Analysis
Report - Compact report (large structures are written to files)
- Analysis
Result - Detailed analysis result
- PmiEntry
Enums§
- Export
Format - Export format
- Stem
Lang - Supported stemming languages (subset of
rust-stemmers
algorithms) - Stem
Mode - Stemming control
Functions§
- analyze_
path - Entry point: analyze a path (file or directory). Files are collected once; then either combined or per-file processing happens.
- analyze_
text_ with - Analyze a text with options
- collect_
files - Recursively collect .txt/.pdf files