Expand description
Preprocessing for mention detection. Preprocessing utilities for text normalization and morphological analysis.
This module provides preprocessing support for:
- Morphologically complex languages (polysynthetic, agglutinative)
- Text normalization and cleaning
- Script-specific handling (Cherokee syllabary, etc.)
- Parenthetical analysis - aliases, abbreviations, temporal bounds
- Reference resolution - URLs, citations, cross-references
§Morphological Preprocessing
For polysynthetic languages like Cherokee, Navajo, and Mohawk, standard
word-level tokenization fails because a single word can encode an entire
sentence. The morphology module provides segmentation strategies and
the MorphologicalAnalyzer trait for integrating external analyzers.
§Parenthetical Analysis
The parenthetical module extracts valuable entity information from
parenthetical text:
use anno::preprocess::parenthetical::{ParentheticalExtractor, ParentheticalType};
let extractor = ParentheticalExtractor::new();
let results = extractor.extract("The World Health Organization (WHO) announced guidelines.");
assert_eq!(results[0].content, "WHO");
assert_eq!(results[0].parenthetical_type, ParentheticalType::Abbreviation);§Reference Resolution
The reference module detects URLs, citations, and cross-references
that can be resolved to additional entity information:
ⓘ
use anno::preprocess::reference::{ReferenceExtractor, ReferenceType};
let extractor = ReferenceExtractor::new();
let refs = extractor.extract("See https://en.wikipedia.org/wiki/Einstein");
assert_eq!(refs[0].reference_type, ReferenceType::WikipediaUrl);§Integration with Coalesce and Tier
These modules integrate with the entity resolution pipeline:
- Coalesce: Parenthetical aliases help link “WHO” ↔ “World Health Organization”
- Tier: Reference graphs create hierarchical entity relationships
§Example
use anno::preprocess::morphology::{MorphologicalPreprocessor, SegmentationStrategy};
// For Quechua with hyphenated morpheme boundaries
let preprocessor = MorphologicalPreprocessor::new()
.with_strategy(SegmentationStrategy::RuleBased {
boundary_chars: vec!['-', '='],
});
let result = preprocessor.segment("wasi-kuna-y-ki").unwrap();
assert_eq!(result.morphemes.len(), 4); // wasi, kuna, y, kiRe-exports§
pub use morphology::cherokee_syllable_inventory;pub use morphology::quechua_boundary_chars;pub use morphology::Morpheme;pub use morphology::MorphemeType;pub use morphology::MorphologicalAnalyzer;pub use morphology::MorphologicalPreprocessor;pub use morphology::ProdropConfig;pub use morphology::SegmentationResult;pub use morphology::SegmentationStrategy;pub use parenthetical::extract_aliases;pub use parenthetical::AliasPair;pub use parenthetical::Parenthetical;pub use parenthetical::ParentheticalExtractor;pub use parenthetical::ParentheticalType;pub use reference::ExtractedEntity;pub use reference::Reference;pub use reference::ReferenceExtractor;pub use reference::ReferenceGraph;pub use reference::ReferenceType;pub use reference::ResolvedReference;pub use apposition::extract_all_aliases;pub use apposition::Apposition;pub use apposition::AppositionExtractor;pub use apposition::AppositionType;
Modules§
- apposition
- Apposition and Alias Pattern Extraction.
- morphology
- Morphological preprocessing for polysynthetic and agglutinative languages.
- parenthetical
- Parenthetical text analysis and entity extraction.
- reference
- Reference Resolution for Entity Extraction.