Skip to main content

Crate anno

Crate anno 

Source
Expand description

§anno

Information extraction for unstructured text: named entity recognition (NER), within-document coreference resolution, and structured pattern extraction.

This is the published facade crate for the anno workspace. It re-exports the internal anno-lib library API. The full CLI lives in crates/anno-cli/.

  • NER: variable-length spans with character offsets (Unicode scalar values).
  • Coreference: mention clusters (“tracks”) within a single document.
  • Patterns: dates, monetary amounts, emails, URLs, phone numbers.

Internal crates (anno-lib, anno-core, anno-metrics, anno-eval, anno-cli, anno-graph) are workspace-private and not separately published.

Modules§

backends
NER backend implementations.
core
anno-core’s stable types under a namespaced module.
edit_distance
Edit distance algorithms. Edit distance utilities for fuzzy string matching.
env
Environment variable utilities.
error
Error types for anno.
export
Export entity results to annotation and interchange formats (brat, CoNLL, JSONL, RDF, JSON-LD, CSV). Export entity extraction results to annotation and interchange formats.
heuristics
Small, dependency-light heuristics (negation, quantifiers, etc.). Small, dependency-light heuristics shared across the repo.
ingest
Lightweight URL/file ingestion helpers (not a crawling/pipeline product). Document ingestion and text preparation.
lang
Language detection and classification utilities.
models
Default model identifiers for backend construction.
offset
Unified byte/character/token offset handling.
pii
PII detection and redaction (library-level privacy functions). PII (personally identifiable information) detection and redaction.
prelude
Common imports for working with anno.
rag
Coreference preprocessing for RAG: rewrite pronouns for self-contained chunks.
schema
Schema harmonization for multi-dataset NER training.
similarity
Text similarity utilities for entity matching and coreference resolution.
types
Type-level programming patterns for compile-time safety.

Structs§

AnnotatedDoc
Text paired with its extraction outputs (entities, relations, coreference chains).
AnyModel
A wrapper that turns an extractor closure into a Model.
BertNEROnnx
BERT-based NER using ONNX Runtime.
CanonicalId
Unique identifier for a coreference cluster.
Confidence
A confidence score guaranteed to be in [0.0, 1.0].
CorefChain
A coreference chain: mentions that all refer to the same entity.
CorefDocument
A document with coreference annotations.
CoreferenceConfig
Configuration for coreference resolution.
Corpus
A corpus of grounded documents for cross-document operations.
CrfNER
CRF-based NER model.
DiscontinuousEntity
An entity that may span multiple non-contiguous regions.
DiscontinuousSpan
A discontinuous span representing non-contiguous entity mentions.
EnsembleNER
Weighted ensemble of NER backends.
Entity
A recognized named entity or relation trigger.
EntityBuilder
Fluent builder for constructing entities with optional fields.
ExtractionWithRelations
Output from joint entity-relation extraction.
FCoref
F-coref neural coreference resolver.
FCorefConfig
Configuration for f-coref model loading.
GLiNEROnnx
GroundedDocument
A document with grounded entity annotations using the three-level hierarchy.
HashMapLexicon
Simple HashMap-based lexicon implementation.
HeuristicNER
Heuristic NER model.
HierarchicalConfidence
Hierarchical confidence scores for coarse-to-fine extraction.
Identity
A global identity: a real-world entity linked to a knowledge base.
IdentityId
Unique identifier for an identity within a corpus.
InformationLoss
Documents information lost during schema mapping.
LexiconNER
NER backend that uses exact-match lexicon lookup.
Mention
A single mention (text span) that may corefer with other mentions.
MentionCluster
Coreference cluster from mention ranking.
MentionRankingConfig
Configuration for mention-ranking coref.
MentionRankingCoref
Mention-ranking coreference resolver.
ModelCapabilities
Runtime discovery mechanism for model capabilities behind Box<dyn Model>.
NuNER
NuNER Zero-shot NER model.
OffsetMapping
Offset mapping from tokenizer.
Provenance
Provenance information for an extracted entity.
RaggedBatch
A ragged (unpadded) batch for efficient ModernBERT inference.
RankedMention
A detected mention with phi-features for coreference resolution.
RegexNER
Regex-based NER - extracts entities with recognizable formats using regex patterns.
Relation
A relation between two entities, forming a knowledge graph triple.
RelationExtractionConfig
Configuration for relation extraction.
RelationTriple
A relation triple linking two entities.
SchemaMapper
Maps dataset-specific labels to canonical types.
Signal
A raw detection signal: the atomic unit of entity extraction.
SignalId
Unique identifier for a signal within a document.
SignalRef
A reference to a signal within a track.
SpanCandidate
A candidate span for entity extraction.
SpanConverter
Converter for efficiently handling many spans from the same text.
StackedNER
Composable NER that combines multiple backends.
TPLinker
TPLinker backend for joint entity-relation extraction.
TextSpan
A text span with both byte and character offsets.
TokenSpan
Span in subword token space.
Track
A track: a cluster of signals referring to the same entity within a document.
TrackId
Unique identifier for a track within a document.
TrackRef
A reference to a track in a specific document.
TrackStats
Aggregate statistics for a track (coreference chain).
TypeMapper
Maps domain-specific entity types to standard NER types.
W2NER
W2NER model for unified named entity recognition.
W2NERConfig
Configuration for W2NER decoding.

Enums§

CanonicalType
Canonical entity type in the unified schema.
ClusteringStrategy
Clustering strategy for mention linking.
CoarseType
Coarse-grained schema for multi-dataset training.
ConflictStrategy
Strategy for resolving overlapping entity spans.
DatasetSchema
Known dataset schemas for automatic mapping.
EntityCategory
Category of entity based on detection characteristics and semantics.
EntityType
Entity type classification.
Error
Error type for anno operations.
ExtractionMethod
Extraction method used to identify an entity.
Gender
Gender classification for NLP tasks.
IdentitySource
Source of identity formation.
Language
Supported languages for text analysis.
Location
A location in text.
MentionType
Type of referring expression in coreference.
Modality
The semiotic modality of a signal source.
Number
Grammatical number (singular, dual, plural).
Person
Grammatical person (1st, 2nd, 3rd).
Quantifier
Quantification type for symbolic signals.
Span
A span locator for text and visual modalities.
TypeLabel
A unified type label supporting both core and custom entity types.
ValidationIssue
Validation issue found during entity validation.
W2NERRelation
W2NER word-word relation types.

Traits§

CorefBackend
Unified interface for within-document coreference resolution.
CoreferenceResolver
Trait for coreference resolution algorithms.
DiscontinuousNER
Support for discontinuous entity spans.
EntitySliceExt
Extension methods for slices of entities.
Lexicon
Exact-match lexicon/gazetteer for entity lookup.
Model
Trait for NER model backends.
RelationCapable
Trait for models that can extract relations between entities.
RelationExtractor
Joint entity and relation extraction.
ZeroShotNER
Zero-shot NER for open entity types.

Functions§

annotate
Extract entities from text using the default backend and return an AnnotatedDoc.
auto
Automatically select the best available NER backend.
available_backends
Check which backends are currently available.
bytes_to_chars
Convert byte offsets to character offsets.
chars_to_bytes
Convert character offsets to byte offsets.
detect_language
Simple heuristic language detection based on Unicode scripts.
extract
Extract entities from text using the best available backend.
extract_batch
Extract entities from multiple texts using the best available backend.
extract_relation_triples
Extract relations as index-based triples (for joint extraction backends).
extract_relation_triples_simple
Extract relation triples using heuristics only – no SemanticRegistry needed.
extract_relations
Extract relations between entities.
generate_span_candidates
Generate all valid span candidates for a ragged batch.
is_ascii
Fast check if text is ASCII-only.
jaccard_word_similarity
Compute Jaccard similarity on word sets.
jaccard_word_similarity_f32
Compute Jaccard similarity on word sets (f32 version).
map_to_canonical
Unified label mapping - THE SINGLE SOURCE OF TRUTH.
string_similarity
Compute string similarity using multiple strategies.

Type Aliases§

Result
Result type for anno operations.