Crate text_analysis

Source
Expand description

§Text Analysis Library

Core functions for analyzing .txt and .pdf documents.

§NER Heuristic (documentation)

The Named-Entity recognition uses a simple capitalization heuristic:

  1. Tokenize the original (non‑stemmed) text.
  2. A token is counted as a candidate entity if it
    • starts with an uppercase letter (Unicode-aware), and
    • is not fully uppercase (to avoid acronyms), and
    • is not a common function word at a sentence start (basic list check).
  3. Counts are case-sensitive (so “Berlin” ≠ “BERLIN”).

Note: This heuristic is fast and deterministic but will overgenerate in some cases (e.g., sentence-initial words). For higher quality, apply a custom post-filter or integrate a proper NER model.

§clone() avoidance (key places)

  • Use HashMap::entry to avoid double lookups and to allocate keys only on insertion.
  • For context maps, allocate strings only on first insertion (entry(key.to_owned())).
  • Serialization writes directly to files to avoid unnecessary intermediate allocations.

§No double scanning of files

analyze_path collects files once and then processes either combined or per-file.

Structs§

AnalysisOptions
Analysis options
AnalysisReport
Compact report (large structures are written to files)
AnalysisResult
Detailed analysis result
PmiEntry

Enums§

ExportFormat
Export format
StemLang
Supported stemming languages (subset of rust-stemmers algorithms)
StemMode
Stemming control

Functions§

analyze_path
Entry point: analyze a path (file or directory). Files are collected once; then either combined or per-file processing happens.
analyze_text_with
Analyze a text with options
collect_files
Recursively collect .txt/.pdf files