Expand description
§anno
Information extraction: named entity recognition (NER) and within-document coreference.
- NER output: variable-length spans with character offsets (Unicode scalar values), not byte offsets.
- Coreference output: clusters (“tracks”) of mentions within one document.
This crate focuses on inference-time extraction. Dataset loaders, benchmarking, and matrix
evaluation tooling live in anno-eval (and the anno CLI lives in anno-cli).
§Quickstart
use anno::{Model, StackedNER};
let m = StackedNER::default();
let ents = m.extract_entities("Lynn Conway worked at IBM and Xerox PARC.", None)?;
assert!(!ents.is_empty());§Zero-shot custom entity types
Zero-shot custom entity types are provided by GLiNER backends when the onnx feature is
enabled. See the repo docs for the CLI flag (--extract-types) and the library API.
§Offline / downloads
By default, ML weights may download on first use. To force cached-only behavior, set
ANNO_NO_DOWNLOADS=1 (after prefetching models).
Re-exports§
pub use error::Error;pub use error::Result;pub use lang::detect_language;pub use lang::Language;pub use offset::bytes_to_chars;pub use offset::chars_to_bytes;pub use offset::is_ascii;pub use offset::OffsetMapping;pub use offset::SpanConverter;pub use offset::TextSpan;pub use offset::TokenSpan;pub use backends::label_prompt::LabelNormalizer;pub use backends::label_prompt::StandardNormalizer;pub use backends::AutoNER;pub use backends::BackendType;pub use backends::ConflictStrategy;pub use backends::CrfNER;pub use backends::EnsembleNER;pub use backends::HeuristicNER;pub use backends::LexiconNER;pub use backends::NERExtractor;pub use backends::NuNER;pub use backends::RegexNER;pub use backends::StackedNER;pub use backends::TPLinker;pub use backends::W2NERConfig;pub use backends::W2NERRelation;pub use backends::W2NER;pub use backends::mention_ranking::ClusteringStrategy;pub use backends::mention_ranking::MentionCluster;pub use backends::mention_ranking::MentionRankingConfig;pub use backends::mention_ranking::MentionRankingCoref;pub use backends::mention_ranking::RankedMention;pub use backends::BertNEROnnx;pub use backends::GLiNEROnnx;pub use schema::*;pub use similarity::*;pub use sync::*;pub use types::*;pub use backends::inference::*;
Modules§
- backends
- NER backend implementations.
- core
anno-core’s stable types under a namespaced module.- edit_
distance - Edit distance algorithms. Edit distance utilities for fuzzy string matching.
- env
- Environment variable utilities.
- error
- Error types for anno.
- features
- Entity feature extraction for downstream ML and analysis. Entity feature extraction for downstream ML and analysis.
- heuristics
- Small, dependency-light heuristics (negation, quantifiers, etc.). Small, dependency-light heuristics shared across the repo.
- ingest
- Lightweight URL/file ingestion helpers (not a crawling/pipeline product). Document ingestion and preparation.
- joint
- Joint inference experiments (optional; not the primary API surface). Joint Entity Analysis: Coreference + NER + Entity Linking
- lang
- Language detection and classification utilities.
- linking
- Knowledge-base linking helpers (experimental). Entity Linking (NEL/NED) Module.
- offset
- Unified byte/character/token offset handling.
- preprocess
- Preprocessing for mention detection. Preprocessing utilities for text normalization and morphological analysis.
- schema
- Schema harmonization for multi-dataset NER training.
- similarity
- Text similarity utilities for entity matching and coreference resolution.
- sync
- Synchronization primitives with conditional compilation.
- temporal
- Temporal entity tracking, parsing, and diachronic NER. Temporal entity tracking, parsing, and diachronic NER.
- tokenizer
- Language-specific tokenization for multilingual NLP. Language-specific tokenization for multilingual NLP.
- types
- Type-level programming patterns for compile-time safety.
Structs§
- AnyModel
- A wrapper that turns an extractor closure into a
Model. - Coref
Chain - A coreference chain: mentions that all refer to the same entity.
- Coref
Document - A document with coreference annotations.
- Corpus
- A corpus of grounded documents for cross-document operations.
- Discontinuous
Span - A discontinuous span representing non-contiguous entity mentions.
- Entity
- A recognized named entity or relation trigger.
- Entity
Builder - Fluent builder for constructing entities with optional fields.
- Graph
Document - A complete graph document ready for export.
- Graph
Edge - An edge in the knowledge graph.
- Graph
Node - A node in the knowledge graph.
- Grounded
Document - A document with grounded entity annotations using the three-level hierarchy.
- Hash
MapLexicon - Simple HashMap-based lexicon implementation.
- Hierarchical
Confidence - Hierarchical confidence scores for coarse-to-fine extraction.
- Identity
- A global identity: a real-world entity linked to a knowledge base.
- Identity
Id - Unique identifier for an identity within a corpus.
- Mention
- A single mention (text span) that may corefer with other mentions.
- Mock
Model - A mock NER model for testing purposes.
- Model
Capabilities - Summary of a model’s capabilities, useful when working with
Box<dyn Model>. - PhiFeatures
- A bundle of phi-features (person, number, gender) for morphological agreement.
- Provenance
- Provenance information for an extracted entity.
- Ragged
Batch - A ragged (unpadded) batch for efficient ModernBERT inference.
- Relation
- A relation between two entities, forming a knowledge graph triple.
- Signal
- A raw detection signal: the atomic unit of entity extraction.
- Signal
Id - Unique identifier for a signal within a document.
- Signal
Ref - A reference to a signal within a track.
- Span
Candidate - A candidate span for entity extraction.
- Track
- A track: a cluster of signals referring to the same entity within a document.
- TrackId
- Unique identifier for a track within a document.
- Track
Ref - A reference to a track in a specific document.
- Track
Stats - Aggregate statistics for a track (coreference chain).
- Type
Mapper - Maps domain-specific entity types to standard NER types.
Enums§
- Entity
Category - Category of entity based on detection characteristics and semantics.
- Entity
Type - Entity type classification.
- Entity
Viewport - Viewport context for multi-faceted entity representation.
- Extraction
Method - Extraction method used to identify an entity.
- Gender
- Gender classification for NLP tasks.
- Graph
Export Format - Supported graph export formats.
- Identity
Source - Source of identity formation.
- Location
- A location in some source medium.
- Mention
Type - Type of referring expression in coreference.
- Modality
- The semiotic modality of a signal source.
- Number
- Grammatical number (singular, dual, plural).
- Person
- Grammatical person (1st, 2nd, 3rd).
- Quantifier
- Quantification type for symbolic signals.
- Span
- A span locator for text and visual modalities.
- Type
Label - A unified type label supporting both core and custom entity types.
- Validation
Issue - Validation issue found during entity validation.
Constants§
- DEFAULT_
BERT_ ONNX_ MODEL - Default BERT ONNX model identifier (HuggingFace model ID).
- DEFAULT_
CANDLE_ MODEL - Default Candle model identifier (HuggingFace model ID). Uses dbmdz’s model which has both tokenizer.json and safetensors.
- DEFAULT_
GLINE R2_ MODEL - Default GLiNER2 ONNX model identifier (HuggingFace model ID).
- DEFAULT_
GLINER_ CANDLE_ MODEL - Default GLiNER Candle model identifier (HuggingFace model ID). Uses a model with tokenizer.json and pytorch_model.bin for Candle compatibility. The backend converts pytorch_model.bin to safetensors automatically.
- DEFAULT_
GLINER_ MODEL - Default GLiNER ONNX model identifier (HuggingFace model ID).
- DEFAULT_
NUNER_ MODEL - Default NuNER ONNX model identifier (HuggingFace model ID).
- DEFAULT_
W2NER_ MODEL - Default W2NER ONNX model identifier (HuggingFace model ID).
Traits§
- Batch
Capable - Trait for models that support batch processing.
- Coreference
Resolver - Trait for coreference resolution algorithms.
- Dynamic
Labels - Trait for models that support dynamic/zero-shot entity type specification.
- GpuCapable
- Trait for models that support GPU acceleration.
- Lexicon
- Exact-match lexicon/gazetteer for entity lookup.
- Model
- Trait for NER model backends.
- Named
Entity Capable - Marker trait for models that extract named entities (persons, organizations, locations).
- Relation
Capable - Trait for models that can extract relations between entities.
- Streaming
Capable - Trait for models that support streaming/chunked extraction.
- Structured
Entity Capable - Marker trait for models that extract structured entities (dates, times, money, etc.).
Functions§
- auto
- Automatically select the best available NER backend.
- available_
backends - Check which backends are currently available.
- generate_
span_ candidates - Generate all valid span candidates for a ragged batch.