Expand description
§anno
Information extraction for unstructured text: named entity recognition (NER), within-document coreference resolution, and structured pattern extraction.
This is the published facade crate for the anno workspace. It re-exports
the internal anno-lib library API. The full CLI lives in crates/anno-cli/.
- NER: variable-length spans with character offsets (Unicode scalar values).
- Coreference: mention clusters (“tracks”) within a single document.
- Patterns: dates, monetary amounts, emails, URLs, phone numbers.
Internal crates (anno-lib, anno-core, anno-metrics, anno-eval,
anno-cli, anno-graph) are workspace-private and not separately published.
Modules§
- backends
- NER backend implementations.
- core
anno-core’s stable types under a namespaced module.- edit_
distance - Edit distance algorithms. Edit distance utilities for fuzzy string matching.
- env
- Environment variable utilities.
- error
- Error types for anno.
- features
- Entity feature extraction for downstream ML and analysis. Entity feature extraction for downstream ML and analysis.
- heuristics
- Small, dependency-light heuristics (negation, quantifiers, etc.). Small, dependency-light heuristics shared across the repo.
- ingest
- Lightweight URL/file ingestion helpers (not a crawling/pipeline product). Document ingestion and preparation.
- joint
- Joint inference experiments (optional; not the primary API surface). Joint Entity Analysis: Coreference + NER + Entity Linking
- lang
- Language detection and classification utilities.
- linking
- Knowledge-base linking helpers (experimental). Entity Linking (NEL/NED) Module.
- offset
- Unified byte/character/token offset handling.
- preprocess
- Preprocessing for mention detection. Preprocessing utilities for text normalization and morphological analysis.
- schema
- Schema harmonization for multi-dataset NER training.
- similarity
- Text similarity utilities for entity matching and coreference resolution.
- sync
- Synchronization primitives with conditional compilation.
- temporal
- Temporal entity tracking, parsing, and diachronic NER. Temporal entity tracking, parsing, and diachronic NER.
- tokenizer
- Language-specific tokenization for multilingual NLP. Language-specific tokenization for multilingual NLP.
- types
- Type-level programming patterns for compile-time safety.
Structs§
- AnyModel
- A wrapper that turns an extractor closure into a
Model. - AutoNER
- Automatic model selection - routes to the default model.
- BertNER
Onnx - BERT-based NER using ONNX Runtime.
- Binary
Blocker - Blocker using binary embeddings for fast candidate filtering.
- Binary
Hash - Binary hash for fast approximate nearest neighbor search.
- Confidence
- A confidence score guaranteed to be in the range [0.0, 1.0].
- Confidence
Error - Error when trying to create a Confidence from an invalid value.
- Coref
Chain - A coreference chain: mentions that all refer to the same entity.
- Coref
Document - A document with coreference annotations.
- Coreference
Cluster - A coreference cluster (mentions referring to same entity).
- Coreference
Config - Configuration for coreference resolution.
- Corpus
- A corpus of grounded documents for cross-document operations.
- CrfNER
- CRF-based NER model.
- Discontinuous
Entity - An entity that may span multiple non-contiguous regions.
- Discontinuous
Span - A discontinuous span representing non-contiguous entity mentions.
- DotProduct
Interaction - Dot product interaction (default, fast).
- Encoder
Output - Output from text encoding.
- EnsembleNER
- Weighted ensemble of NER backends.
- Entity
- A recognized named entity or relation trigger.
- Entity
Builder - Fluent builder for constructing entities with optional fields.
- Extraction
With Relations - Output from joint entity-relation extraction.
- GLiNER
Onnx - Grounded
Document - A document with grounded entity annotations using the three-level hierarchy.
- Handshaking
Cell - Result cell in a handshaking matrix.
- Handshaking
Matrix - Handshaking matrix for joint entity-relation extraction.
- Hash
MapLexicon - Simple HashMap-based lexicon implementation.
- HeuristicNER
- Heuristic NER model.
- Hierarchical
Confidence - Hierarchical confidence scores for coarse-to-fine extraction.
- Identity
- A global identity: a real-world entity linked to a knowledge base.
- Identity
Id - Unique identifier for an identity within a corpus.
- Information
Loss - Documents information lost during schema mapping.
- Label
Definition - Definition of a semantic label (entity type or relation type).
- LexiconNER
- NER backend that uses exact-match lexicon lookup.
- MaxSim
Interaction - MaxSim interaction (ColBERT-style, better for phrases).
- Mention
- A single mention (text span) that may corefer with other mentions.
- Mention
Cluster - Coreference cluster from mention ranking.
- Mention
Ranking Config - Configuration for mention-ranking coref.
- Mention
Ranking Coref - Mention-ranking coreference resolver.
- Mock
Model - A mock NER model for testing purposes.
- Model
Capabilities - Summary of a model’s capabilities, useful when working with
Box<dyn Model>. - NERExtractor
- NER extractor with fallback support.
- NuNER
- NuNER Zero-shot NER model.
- Offset
Mapping - Offset mapping from tokenizer.
- PhiFeatures
- A bundle of phi-features (person, number, gender) for morphological agreement.
- Provenance
- Provenance information for an extracted entity.
- Ragged
Batch - A ragged (unpadded) batch for efficient ModernBERT inference.
- Ranked
Mention - A detected mention with phi-features for coreference resolution.
- RegexNER
- Regex-based NER - extracts entities with recognizable formats using regex patterns.
- Relation
- A relation between two entities, forming a knowledge graph triple.
- Relation
Extraction Config - Configuration for relation extraction.
- Relation
Triple - A relation triple linking two entities.
- Schema
Mapper - Maps dataset-specific labels to canonical types.
- Score
- A score guaranteed to be in the range [0.0, 1.0] (f32 precision).
- Semantic
Registry - A frozen, pre-computed registry of entity and relation types.
- Semantic
Registry Builder - Builder for SemanticRegistry.
- Signal
- A raw detection signal: the atomic unit of entity extraction.
- Signal
Id - Unique identifier for a signal within a document.
- Signal
Ref - A reference to a signal within a track.
- Span
Candidate - A candidate span for entity extraction.
- Span
Converter - Converter for efficiently handling many spans from the same text.
- Span
Label Score - Score for a (span, label) match.
- Span
RepConfig - Configuration for span representation.
- Span
Representation Layer - Computes span representations from token embeddings.
- StackedNER
- Composable NER that combines multiple backends.
- Standard
Normalizer - Standard label normalizer with common NER ontology mappings.
- TPLinker
- TPLinker backend for joint entity-relation extraction.
- Text
Span - A text span with both byte and character offsets.
- Token
Span - Span in subword token space.
- Track
- A track: a cluster of signals referring to the same entity within a document.
- TrackId
- Unique identifier for a track within a document.
- Track
Ref - A reference to a track in a specific document.
- Track
Stats - Aggregate statistics for a track (coreference chain).
- Type
Mapper - Maps domain-specific entity types to standard NER types.
- Visual
Position - Visual position of a text token in an image.
- W2NER
- W2NER model for unified named entity recognition.
- W2NER
Config - Configuration for W2NER decoding.
Enums§
- Backend
Type - Backend type identifier.
- Canonical
Type - Canonical entity type in the unified schema.
- Clustering
Strategy - Clustering strategy for mention linking.
- Coarse
Type - Coarse-grained schema for multi-dataset training.
- Conflict
Strategy - Strategy for resolving overlapping entity spans.
- Dataset
Schema - Known dataset schemas for automatic mapping.
- Entity
Category - Category of entity based on detection characteristics and semantics.
- Entity
Type - Entity type classification.
- Entity
Viewport - Viewport context for multi-faceted entity representation.
- Error
- Error type for anno operations.
- Extraction
Method - Extraction method used to identify an entity.
- Gender
- Gender classification for NLP tasks.
- Identity
Source - Source of identity formation.
- Image
Format - Image format hint for decoding.
- Label
Category - Category of semantic label.
- Language
- Supported languages for text analysis.
- Location
- A location in some source medium.
- Mention
Type - Type of referring expression in coreference.
- Modality
- The semiotic modality of a signal source.
- Modality
Hint - Hint for which modality this label applies to.
- Modality
Input - Input modality for the encoder.
- Number
- Grammatical number (singular, dual, plural).
- Person
- Grammatical person (1st, 2nd, 3rd).
- Quantifier
- Quantification type for symbolic signals.
- Span
- A span locator for text and visual modalities.
- Type
Label - A unified type label supporting both core and custom entity types.
- Validation
Issue - Validation issue found during entity validation.
- W2NER
Relation - W2NER word-word relation types.
Constants§
- DEFAULT_
BERT_ ONNX_ MODEL - Default BERT ONNX model identifier (HuggingFace model ID).
- DEFAULT_
CANDLE_ MODEL - Default Candle model identifier (HuggingFace model ID). Uses dbmdz’s model which has both tokenizer.json and safetensors.
- DEFAULT_
GLINE R2_ MODEL - Default GLiNER2 ONNX model identifier (HuggingFace model ID).
- DEFAULT_
GLINER_ CANDLE_ MODEL - Default GLiNER Candle model identifier (HuggingFace model ID). Uses a model with tokenizer.json and pytorch_model.bin for Candle compatibility. The backend converts pytorch_model.bin to safetensors automatically.
- DEFAULT_
GLINER_ MODEL - Default GLiNER ONNX model identifier (HuggingFace model ID).
- DEFAULT_
NUNER_ MODEL - Default NuNER ONNX model identifier (HuggingFace model ID).
- DEFAULT_
W2NER_ MODEL - Default W2NER ONNX model identifier (HuggingFace model ID).
Traits§
- Batch
Capable - Trait for models that support batch processing.
- BiEncoder
- Bi-encoder architecture combining text and label encoders.
- Coreference
Resolver - Trait for coreference resolution algorithms.
- DiscontinuousNER
- Support for discontinuous entity spans.
- Dynamic
Labels - Trait for models that support dynamic/zero-shot entity type specification.
- Entity
Slice Ext - Extension methods for slices of entities.
- GpuCapable
- Trait for models that support GPU acceleration.
- Label
Encoder - Label encoder trait for encoding entity type descriptions.
- Label
Normalizer - Trait for label prompt normalization.
- Late
Interaction - | DotProduct | s·l | Fast | Good | General purpose | | MaxSim | max(s·l)| Medium| Better | Multi-token labels | | Bilinear | s·W·l | Slow | Best | When accuracy critical |
- Lexicon
- Exact-match lexicon/gazetteer for entity lookup.
- Model
- Trait for NER model backends.
- Named
Entity Capable - Marker trait for models that extract named entities (persons, organizations, locations).
- Relation
Capable - Trait for models that can extract relations between entities.
- Relation
Extractor - Joint entity and relation extraction.
- Streaming
Capable - Trait for models that support streaming/chunked extraction.
- Structured
Entity Capable - Marker trait for models that extract structured entities (dates, times, money, etc.).
- Text
Encoder - Text encoder trait for transformer-based encoders.
- Zero
ShotNER - Zero-shot NER for open entity types.
Functions§
- auto
- Automatically select the best available NER backend.
- available_
backends - Check which backends are currently available.
- bytes_
to_ chars - Convert byte offsets to character offsets.
- chars_
to_ bytes - Convert character offsets to byte offsets.
- cosine_
similarity - Compute cosine similarity between two vectors.
- cosine_
similarity_ f32 - Compute cosine similarity between two f32 vectors.
- detect_
language - Simple heuristic language detection based on Unicode scripts.
- extract_
relation_ triples - Extract relations as index-based triples (for joint extraction backends).
- extract_
relations - Extract relations between entities.
- generate_
span_ candidates - Generate all valid span candidates for a ragged batch.
- is_
ascii - Fast check if text is ASCII-only.
- jaccard_
word_ similarity - Compute Jaccard similarity on word sets.
- jaccard_
word_ similarity_ f32 - Compute Jaccard similarity on word sets (f32 version).
- lock
- Lock a mutex using std::sync::Mutex, recovering from poisoning.
- map_
to_ canonical - Unified label mapping - THE SINGLE SOURCE OF TRUTH.
- resolve_
coreferences - Resolve coreferences between entities using embedding similarity.
- string_
similarity - Compute string similarity using multiple strategies.
- try_
lock - Try to lock a mutex using std::sync::Mutex without blocking.
- two_
stage_ retrieval - Recommended two-stage retrieval using binary blocking + dense reranking.
Type Aliases§
- Mutex
- Mutex type using std::sync::Mutex (default, no
productionfeature). - Probability
- Type alias for
Confidencewhen used in probabilistic contexts. - Result
- Result type for anno operations.
- Unit
Interval - Type alias for generic unit interval values.