Expand description
§anno
Information extraction for unstructured text: named entity recognition (NER), within-document coreference resolution, and structured pattern extraction.
This is the published facade crate for the anno workspace. It re-exports
the internal anno-lib library API. The full CLI lives in crates/anno-cli/.
- NER: variable-length spans with character offsets (Unicode scalar values).
- Coreference: mention clusters (“tracks”) within a single document.
- Patterns: dates, monetary amounts, emails, URLs, phone numbers.
Internal crates (anno-lib, anno-core, anno-metrics, anno-eval,
anno-cli, anno-graph) are workspace-private and not separately published.
Modules§
- backends
- NER backend implementations.
- core
anno-core’s stable types under a namespaced module.- edit_
distance - Edit distance algorithms. Edit distance utilities for fuzzy string matching.
- env
- Environment variable utilities.
- error
- Error types for anno.
- export
- Export entity results to annotation and interchange formats (brat, CoNLL, JSONL, RDF, JSON-LD, CSV). Export entity extraction results to annotation and interchange formats.
- heuristics
- Small, dependency-light heuristics (negation, quantifiers, etc.). Small, dependency-light heuristics shared across the repo.
- ingest
- Lightweight URL/file ingestion helpers (not a crawling/pipeline product). Document ingestion and text preparation.
- lang
- Language detection and classification utilities.
- models
- Default model identifiers for backend construction.
- offset
- Unified byte/character/token offset handling.
- pii
- PII detection and redaction (library-level privacy functions). PII (personally identifiable information) detection and redaction.
- prelude
- Common imports for working with anno.
- rag
- Coreference preprocessing for RAG: rewrite pronouns for self-contained chunks.
- schema
- Schema harmonization for multi-dataset NER training.
- similarity
- Text similarity utilities for entity matching and coreference resolution.
- types
- Type-level programming patterns for compile-time safety.
Structs§
- Annotated
Doc - Text paired with its extraction outputs (entities, relations, coreference chains).
- AnyModel
- A wrapper that turns an extractor closure into a
Model. - BertNER
Onnx - BERT-based NER using ONNX Runtime.
- Canonical
Id - Unique identifier for a coreference cluster.
- Confidence
- A confidence score guaranteed to be in
[0.0, 1.0]. - Coref
Chain - A coreference chain: mentions that all refer to the same entity.
- Coref
Document - A document with coreference annotations.
- Coreference
Config - Configuration for coreference resolution.
- Corpus
- A corpus of grounded documents for cross-document operations.
- CrfNER
- CRF-based NER model.
- Discontinuous
Entity - An entity that may span multiple non-contiguous regions.
- Discontinuous
Span - A discontinuous span representing non-contiguous entity mentions.
- EnsembleNER
- Weighted ensemble of NER backends.
- Entity
- A recognized named entity or relation trigger.
- Entity
Builder - Fluent builder for constructing entities with optional fields.
- Extraction
With Relations - Output from joint entity-relation extraction.
- FCoref
- F-coref neural coreference resolver.
- FCoref
Config - Configuration for f-coref model loading.
- GLiNER
Onnx - Grounded
Document - A document with grounded entity annotations using the three-level hierarchy.
- Hash
MapLexicon - Simple HashMap-based lexicon implementation.
- HeuristicNER
- Heuristic NER model.
- Hierarchical
Confidence - Hierarchical confidence scores for coarse-to-fine extraction.
- Identity
- A global identity: a real-world entity linked to a knowledge base.
- Identity
Id - Unique identifier for an identity within a corpus.
- Information
Loss - Documents information lost during schema mapping.
- LexiconNER
- NER backend that uses exact-match lexicon lookup.
- Mention
- A single mention (text span) that may corefer with other mentions.
- Mention
Cluster - Coreference cluster from mention ranking.
- Mention
Ranking Config - Configuration for mention-ranking coref.
- Mention
Ranking Coref - Mention-ranking coreference resolver.
- Model
Capabilities - Runtime discovery mechanism for model capabilities behind
Box<dyn Model>. - NuNER
- NuNER Zero-shot NER model.
- Offset
Mapping - Offset mapping from tokenizer.
- Provenance
- Provenance information for an extracted entity.
- Ragged
Batch - A ragged (unpadded) batch for efficient ModernBERT inference.
- Ranked
Mention - A detected mention with phi-features for coreference resolution.
- RegexNER
- Regex-based NER - extracts entities with recognizable formats using regex patterns.
- Relation
- A relation between two entities, forming a knowledge graph triple.
- Relation
Extraction Config - Configuration for relation extraction.
- Relation
Triple - A relation triple linking two entities.
- Schema
Mapper - Maps dataset-specific labels to canonical types.
- Signal
- A raw detection signal: the atomic unit of entity extraction.
- Signal
Id - Unique identifier for a signal within a document.
- Signal
Ref - A reference to a signal within a track.
- Span
Candidate - A candidate span for entity extraction.
- Span
Converter - Converter for efficiently handling many spans from the same text.
- StackedNER
- Composable NER that combines multiple backends.
- TPLinker
- TPLinker backend for joint entity-relation extraction.
- Text
Span - A text span with both byte and character offsets.
- Token
Span - Span in subword token space.
- Track
- A track: a cluster of signals referring to the same entity within a document.
- TrackId
- Unique identifier for a track within a document.
- Track
Ref - A reference to a track in a specific document.
- Track
Stats - Aggregate statistics for a track (coreference chain).
- Type
Mapper - Maps domain-specific entity types to standard NER types.
- W2NER
- W2NER model for unified named entity recognition.
- W2NER
Config - Configuration for W2NER decoding.
Enums§
- Canonical
Type - Canonical entity type in the unified schema.
- Clustering
Strategy - Clustering strategy for mention linking.
- Coarse
Type - Coarse-grained schema for multi-dataset training.
- Conflict
Strategy - Strategy for resolving overlapping entity spans.
- Dataset
Schema - Known dataset schemas for automatic mapping.
- Entity
Category - Category of entity based on detection characteristics and semantics.
- Entity
Type - Entity type classification.
- Error
- Error type for anno operations.
- Extraction
Method - Extraction method used to identify an entity.
- Gender
- Gender classification for NLP tasks.
- Identity
Source - Source of identity formation.
- Language
- Supported languages for text analysis.
- Location
- A location in text.
- Mention
Type - Type of referring expression in coreference.
- Modality
- The semiotic modality of a signal source.
- Number
- Grammatical number (singular, dual, plural).
- Person
- Grammatical person (1st, 2nd, 3rd).
- Quantifier
- Quantification type for symbolic signals.
- Span
- A span locator for text and visual modalities.
- Type
Label - A unified type label supporting both core and custom entity types.
- Validation
Issue - Validation issue found during entity validation.
- W2NER
Relation - W2NER word-word relation types.
Traits§
- Coref
Backend - Unified interface for within-document coreference resolution.
- Coreference
Resolver - Trait for coreference resolution algorithms.
- DiscontinuousNER
- Support for discontinuous entity spans.
- Entity
Slice Ext - Extension methods for slices of entities.
- Lexicon
- Exact-match lexicon/gazetteer for entity lookup.
- Model
- Trait for NER model backends.
- Relation
Capable - Trait for models that can extract relations between entities.
- Relation
Extractor - Joint entity and relation extraction.
- Zero
ShotNER - Zero-shot NER for open entity types.
Functions§
- annotate
- Extract entities from text using the default backend and return an
AnnotatedDoc. - auto
- Automatically select the best available NER backend.
- available_
backends - Check which backends are currently available.
- bytes_
to_ chars - Convert byte offsets to character offsets.
- chars_
to_ bytes - Convert character offsets to byte offsets.
- detect_
language - Simple heuristic language detection based on Unicode scripts.
- extract
- Extract entities from text using the best available backend.
- extract_
batch - Extract entities from multiple texts using the best available backend.
- extract_
relation_ triples - Extract relations as index-based triples (for joint extraction backends).
- extract_
relation_ triples_ simple - Extract relation triples using heuristics only – no
SemanticRegistryneeded. - extract_
relations - Extract relations between entities.
- generate_
span_ candidates - Generate all valid span candidates for a ragged batch.
- is_
ascii - Fast check if text is ASCII-only.
- jaccard_
word_ similarity - Compute Jaccard similarity on word sets.
- jaccard_
word_ similarity_ f32 - Compute Jaccard similarity on word sets (f32 version).
- map_
to_ canonical - Unified label mapping - THE SINGLE SOURCE OF TRUTH.
- string_
similarity - Compute string similarity using multiple strategies.
Type Aliases§
- Result
- Result type for anno operations.