Skip to main content

Module preprocess

Module preprocess 

Source
Expand description

Preprocessing for mention detection. Preprocessing utilities for text normalization and morphological analysis.

This module provides preprocessing support for:

  • Morphologically complex languages (polysynthetic, agglutinative)
  • Text normalization and cleaning
  • Script-specific handling (Cherokee syllabary, etc.)
  • Parenthetical analysis - aliases, abbreviations, temporal bounds
  • Reference resolution - URLs, citations, cross-references

§Morphological Preprocessing

For polysynthetic languages like Cherokee, Navajo, and Mohawk, standard word-level tokenization fails because a single word can encode an entire sentence. The morphology module provides segmentation strategies and the MorphologicalAnalyzer trait for integrating external analyzers.

§Parenthetical Analysis

The parenthetical module extracts valuable entity information from parenthetical text:

use anno::preprocess::parenthetical::{ParentheticalExtractor, ParentheticalType};

let extractor = ParentheticalExtractor::new();
let results = extractor.extract("The World Health Organization (WHO) announced guidelines.");

assert_eq!(results[0].content, "WHO");
assert_eq!(results[0].parenthetical_type, ParentheticalType::Abbreviation);

§Reference Resolution

The reference module detects URLs, citations, and cross-references that can be resolved to additional entity information:

use anno::preprocess::reference::{ReferenceExtractor, ReferenceType};

let extractor = ReferenceExtractor::new();
let refs = extractor.extract("See https://en.wikipedia.org/wiki/Einstein");

assert_eq!(refs[0].reference_type, ReferenceType::WikipediaUrl);

§Integration with Coalesce and Tier

These modules integrate with the entity resolution pipeline:

  • Coalesce: Parenthetical aliases help link “WHO” ↔ “World Health Organization”
  • Tier: Reference graphs create hierarchical entity relationships

§Example

use anno::preprocess::morphology::{MorphologicalPreprocessor, SegmentationStrategy};

// For Quechua with hyphenated morpheme boundaries
let preprocessor = MorphologicalPreprocessor::new()
    .with_strategy(SegmentationStrategy::RuleBased {
        boundary_chars: vec!['-', '='],
    });

let result = preprocessor.segment("wasi-kuna-y-ki").unwrap();
assert_eq!(result.morphemes.len(), 4); // wasi, kuna, y, ki

Re-exports§

pub use morphology::cherokee_syllable_inventory;
pub use morphology::navajo_prefix_inventory;
pub use morphology::quechua_boundary_chars;
pub use morphology::Morpheme;
pub use morphology::MorphemeType;
pub use morphology::MorphologicalAnalyzer;
pub use morphology::MorphologicalPreprocessor;
pub use morphology::ProdropConfig;
pub use morphology::SegmentationResult;
pub use morphology::SegmentationStrategy;
pub use parenthetical::extract_aliases;
pub use parenthetical::AliasPair;
pub use parenthetical::Parenthetical;
pub use parenthetical::ParentheticalExtractor;
pub use parenthetical::ParentheticalType;
pub use reference::ExtractedEntity;
pub use reference::Reference;
pub use reference::ReferenceExtractor;
pub use reference::ReferenceGraph;
pub use reference::ReferenceType;
pub use reference::ResolvedReference;
pub use apposition::extract_all_aliases;
pub use apposition::Apposition;
pub use apposition::AppositionExtractor;
pub use apposition::AppositionType;

Modules§

apposition
Apposition and Alias Pattern Extraction.
morphology
Morphological preprocessing for polysynthetic and agglutinative languages.
parenthetical
Parenthetical text analysis and entity extraction.
reference
Reference Resolution for Entity Extraction.