Expand description
Reference Resolution for Entity Extraction.
§Overview
Documents often contain references to external content:
- URLs: Links to web pages with additional entity information
- Citations: Academic references (Smith et al., 2020)
- Cross-references: Internal document references (see Section 3)
- Footnotes/Endnotes: Additional contextual information
- Entity Links: Wikipedia, Wikidata, or other KB references
This module provides infrastructure for:
- Detecting references in text
- Resolving them to content
- Extracting entities from resolved content
- Linking back to the source document
§Integration with Coalesce
Resolved references provide additional evidence for entity coalescing:
- A URL pointing to a Wikipedia page confirms entity identity
- Citations can link entities mentioned in different contexts
- Resolved content may contain canonical names or aliases
§Integration with Tier
References create hierarchical relationships:
- Level 0: Entities in source document
- Level 1: Entities in directly referenced documents
- Level 2+: Entities in transitively referenced documents
This creates a “citation graph” that tier can cluster.
§Example
ⓘ
use anno::preprocess::reference::{ReferenceExtractor, ReferenceType};
let extractor = ReferenceExtractor::new();
let text = "See https://en.wikipedia.org/wiki/Albert_Einstein for more info.";
let refs = extractor.extract(text);
assert_eq!(refs.len(), 1);
assert_eq!(refs[0].reference_type, ReferenceType::WikipediaUrl);Structs§
- Extracted
Entity - An entity extracted from resolved reference content.
- Reference
- A detected reference in text.
- Reference
Extractor - Extractor for references in text.
- Reference
Graph - Reference graph for tracking relationships between documents.
- Resolved
Reference - Resolved content from a reference.
Enums§
- Reference
Type - Type of reference detected in text.