Skip to main content

Crate anno

Crate anno 

Source
Expand description

§anno

Information extraction for unstructured text: named entity recognition (NER), within-document coreference resolution, and structured pattern extraction.

This is the published facade crate for the anno workspace. It re-exports the internal anno-lib library API. The full CLI lives in crates/anno-cli/.

  • NER: variable-length spans with character offsets (Unicode scalar values).
  • Coreference: mention clusters (“tracks”) within a single document.
  • Patterns: dates, monetary amounts, emails, URLs, phone numbers.

Internal crates (anno-lib, anno-core, anno-metrics, anno-eval, anno-cli, anno-graph) are workspace-private and not separately published.

Modules§

backends
NER backend implementations.
core
anno-core’s stable types under a namespaced module.
edit_distance
Edit distance algorithms. Edit distance utilities for fuzzy string matching.
env
Environment variable utilities.
error
Error types for anno.
features
Entity feature extraction for downstream ML and analysis. Entity feature extraction for downstream ML and analysis.
heuristics
Small, dependency-light heuristics (negation, quantifiers, etc.). Small, dependency-light heuristics shared across the repo.
ingest
Lightweight URL/file ingestion helpers (not a crawling/pipeline product). Document ingestion and preparation.
joint
Joint inference experiments (optional; not the primary API surface). Joint Entity Analysis: Coreference + NER + Entity Linking
lang
Language detection and classification utilities.
linking
Knowledge-base linking helpers (experimental). Entity Linking (NEL/NED) Module.
offset
Unified byte/character/token offset handling.
preprocess
Preprocessing for mention detection. Preprocessing utilities for text normalization and morphological analysis.
schema
Schema harmonization for multi-dataset NER training.
similarity
Text similarity utilities for entity matching and coreference resolution.
sync
Synchronization primitives with conditional compilation.
temporal
Temporal entity tracking, parsing, and diachronic NER. Temporal entity tracking, parsing, and diachronic NER.
tokenizer
Language-specific tokenization for multilingual NLP. Language-specific tokenization for multilingual NLP.
types
Type-level programming patterns for compile-time safety.

Structs§

AnyModel
A wrapper that turns an extractor closure into a Model.
AutoNER
Automatic model selection - routes to the default model.
BertNEROnnx
BERT-based NER using ONNX Runtime.
BinaryBlocker
Blocker using binary embeddings for fast candidate filtering.
BinaryHash
Binary hash for fast approximate nearest neighbor search.
Confidence
A confidence score guaranteed to be in the range [0.0, 1.0].
ConfidenceError
Error when trying to create a Confidence from an invalid value.
CorefChain
A coreference chain: mentions that all refer to the same entity.
CorefDocument
A document with coreference annotations.
CoreferenceCluster
A coreference cluster (mentions referring to same entity).
CoreferenceConfig
Configuration for coreference resolution.
Corpus
A corpus of grounded documents for cross-document operations.
CrfNER
CRF-based NER model.
DiscontinuousEntity
An entity that may span multiple non-contiguous regions.
DiscontinuousSpan
A discontinuous span representing non-contiguous entity mentions.
DotProductInteraction
Dot product interaction (default, fast).
EncoderOutput
Output from text encoding.
EnsembleNER
Weighted ensemble of NER backends.
Entity
A recognized named entity or relation trigger.
EntityBuilder
Fluent builder for constructing entities with optional fields.
ExtractionWithRelations
Output from joint entity-relation extraction.
GLiNEROnnx
GroundedDocument
A document with grounded entity annotations using the three-level hierarchy.
HandshakingCell
Result cell in a handshaking matrix.
HandshakingMatrix
Handshaking matrix for joint entity-relation extraction.
HashMapLexicon
Simple HashMap-based lexicon implementation.
HeuristicNER
Heuristic NER model.
HierarchicalConfidence
Hierarchical confidence scores for coarse-to-fine extraction.
Identity
A global identity: a real-world entity linked to a knowledge base.
IdentityId
Unique identifier for an identity within a corpus.
InformationLoss
Documents information lost during schema mapping.
LabelDefinition
Definition of a semantic label (entity type or relation type).
LexiconNER
NER backend that uses exact-match lexicon lookup.
MaxSimInteraction
MaxSim interaction (ColBERT-style, better for phrases).
Mention
A single mention (text span) that may corefer with other mentions.
MentionCluster
Coreference cluster from mention ranking.
MentionRankingConfig
Configuration for mention-ranking coref.
MentionRankingCoref
Mention-ranking coreference resolver.
MockModel
A mock NER model for testing purposes.
ModelCapabilities
Summary of a model’s capabilities, useful when working with Box<dyn Model>.
NERExtractor
NER extractor with fallback support.
NuNER
NuNER Zero-shot NER model.
OffsetMapping
Offset mapping from tokenizer.
PhiFeatures
A bundle of phi-features (person, number, gender) for morphological agreement.
Provenance
Provenance information for an extracted entity.
RaggedBatch
A ragged (unpadded) batch for efficient ModernBERT inference.
RankedMention
A detected mention with phi-features for coreference resolution.
RegexNER
Regex-based NER - extracts entities with recognizable formats using regex patterns.
Relation
A relation between two entities, forming a knowledge graph triple.
RelationExtractionConfig
Configuration for relation extraction.
RelationTriple
A relation triple linking two entities.
SchemaMapper
Maps dataset-specific labels to canonical types.
Score
A score guaranteed to be in the range [0.0, 1.0] (f32 precision).
SemanticRegistry
A frozen, pre-computed registry of entity and relation types.
SemanticRegistryBuilder
Builder for SemanticRegistry.
Signal
A raw detection signal: the atomic unit of entity extraction.
SignalId
Unique identifier for a signal within a document.
SignalRef
A reference to a signal within a track.
SpanCandidate
A candidate span for entity extraction.
SpanConverter
Converter for efficiently handling many spans from the same text.
SpanLabelScore
Score for a (span, label) match.
SpanRepConfig
Configuration for span representation.
SpanRepresentationLayer
Computes span representations from token embeddings.
StackedNER
Composable NER that combines multiple backends.
StandardNormalizer
Standard label normalizer with common NER ontology mappings.
TPLinker
TPLinker backend for joint entity-relation extraction.
TextSpan
A text span with both byte and character offsets.
TokenSpan
Span in subword token space.
Track
A track: a cluster of signals referring to the same entity within a document.
TrackId
Unique identifier for a track within a document.
TrackRef
A reference to a track in a specific document.
TrackStats
Aggregate statistics for a track (coreference chain).
TypeMapper
Maps domain-specific entity types to standard NER types.
VisualPosition
Visual position of a text token in an image.
W2NER
W2NER model for unified named entity recognition.
W2NERConfig
Configuration for W2NER decoding.

Enums§

BackendType
Backend type identifier.
CanonicalType
Canonical entity type in the unified schema.
ClusteringStrategy
Clustering strategy for mention linking.
CoarseType
Coarse-grained schema for multi-dataset training.
ConflictStrategy
Strategy for resolving overlapping entity spans.
DatasetSchema
Known dataset schemas for automatic mapping.
EntityCategory
Category of entity based on detection characteristics and semantics.
EntityType
Entity type classification.
EntityViewport
Viewport context for multi-faceted entity representation.
Error
Error type for anno operations.
ExtractionMethod
Extraction method used to identify an entity.
Gender
Gender classification for NLP tasks.
IdentitySource
Source of identity formation.
ImageFormat
Image format hint for decoding.
LabelCategory
Category of semantic label.
Language
Supported languages for text analysis.
Location
A location in some source medium.
MentionType
Type of referring expression in coreference.
Modality
The semiotic modality of a signal source.
ModalityHint
Hint for which modality this label applies to.
ModalityInput
Input modality for the encoder.
Number
Grammatical number (singular, dual, plural).
Person
Grammatical person (1st, 2nd, 3rd).
Quantifier
Quantification type for symbolic signals.
Span
A span locator for text and visual modalities.
TypeLabel
A unified type label supporting both core and custom entity types.
ValidationIssue
Validation issue found during entity validation.
W2NERRelation
W2NER word-word relation types.

Constants§

DEFAULT_BERT_ONNX_MODEL
Default BERT ONNX model identifier (HuggingFace model ID).
DEFAULT_CANDLE_MODEL
Default Candle model identifier (HuggingFace model ID). Uses dbmdz’s model which has both tokenizer.json and safetensors.
DEFAULT_GLINER2_MODEL
Default GLiNER2 ONNX model identifier (HuggingFace model ID).
DEFAULT_GLINER_CANDLE_MODEL
Default GLiNER Candle model identifier (HuggingFace model ID). Uses a model with tokenizer.json and pytorch_model.bin for Candle compatibility. The backend converts pytorch_model.bin to safetensors automatically.
DEFAULT_GLINER_MODEL
Default GLiNER ONNX model identifier (HuggingFace model ID).
DEFAULT_NUNER_MODEL
Default NuNER ONNX model identifier (HuggingFace model ID).
DEFAULT_W2NER_MODEL
Default W2NER ONNX model identifier (HuggingFace model ID).

Traits§

BatchCapable
Trait for models that support batch processing.
BiEncoder
Bi-encoder architecture combining text and label encoders.
CoreferenceResolver
Trait for coreference resolution algorithms.
DiscontinuousNER
Support for discontinuous entity spans.
DynamicLabels
Trait for models that support dynamic/zero-shot entity type specification.
EntitySliceExt
Extension methods for slices of entities.
GpuCapable
Trait for models that support GPU acceleration.
LabelEncoder
Label encoder trait for encoding entity type descriptions.
LabelNormalizer
Trait for label prompt normalization.
LateInteraction
| DotProduct | s·l | Fast | Good | General purpose | | MaxSim | max(s·l)| Medium| Better | Multi-token labels | | Bilinear | s·W·l | Slow | Best | When accuracy critical |
Lexicon
Exact-match lexicon/gazetteer for entity lookup.
Model
Trait for NER model backends.
NamedEntityCapable
Marker trait for models that extract named entities (persons, organizations, locations).
RelationCapable
Trait for models that can extract relations between entities.
RelationExtractor
Joint entity and relation extraction.
StreamingCapable
Trait for models that support streaming/chunked extraction.
StructuredEntityCapable
Marker trait for models that extract structured entities (dates, times, money, etc.).
TextEncoder
Text encoder trait for transformer-based encoders.
ZeroShotNER
Zero-shot NER for open entity types.

Functions§

auto
Automatically select the best available NER backend.
available_backends
Check which backends are currently available.
bytes_to_chars
Convert byte offsets to character offsets.
chars_to_bytes
Convert character offsets to byte offsets.
cosine_similarity
Compute cosine similarity between two vectors.
cosine_similarity_f32
Compute cosine similarity between two f32 vectors.
detect_language
Simple heuristic language detection based on Unicode scripts.
extract_relation_triples
Extract relations as index-based triples (for joint extraction backends).
extract_relations
Extract relations between entities.
generate_span_candidates
Generate all valid span candidates for a ragged batch.
is_ascii
Fast check if text is ASCII-only.
jaccard_word_similarity
Compute Jaccard similarity on word sets.
jaccard_word_similarity_f32
Compute Jaccard similarity on word sets (f32 version).
lock
Lock a mutex using std::sync::Mutex, recovering from poisoning.
map_to_canonical
Unified label mapping - THE SINGLE SOURCE OF TRUTH.
resolve_coreferences
Resolve coreferences between entities using embedding similarity.
string_similarity
Compute string similarity using multiple strategies.
try_lock
Try to lock a mutex using std::sync::Mutex without blocking.
two_stage_retrieval
Recommended two-stage retrieval using binary blocking + dense reranking.

Type Aliases§

Mutex
Mutex type using std::sync::Mutex (default, no production feature).
Probability
Type alias for Confidence when used in probabilistic contexts.
Result
Result type for anno operations.
UnitInterval
Type alias for generic unit interval values.