Skip to main content

Crate anno

Crate anno 

Source
Expand description

§anno

Information extraction: named entity recognition (NER) and within-document coreference.

  • NER output: variable-length spans with character offsets (Unicode scalar values), not byte offsets.
  • Coreference output: clusters (“tracks”) of mentions within one document.

This crate focuses on inference-time extraction. Dataset loaders, benchmarking, and matrix evaluation tooling live in anno-eval (and the anno CLI lives in anno-cli).

§Quickstart

use anno::{Model, StackedNER};

let m = StackedNER::default();
let ents = m.extract_entities("Lynn Conway worked at IBM and Xerox PARC.", None)?;
assert!(!ents.is_empty());

§Zero-shot custom entity types

Zero-shot custom entity types are provided by GLiNER backends when the onnx feature is enabled. See the repo docs for the CLI flag (--extract-types) and the library API.

§Offline / downloads

By default, ML weights may download on first use. To force cached-only behavior, set ANNO_NO_DOWNLOADS=1 (after prefetching models).

Re-exports§

pub use error::Error;
pub use error::Result;
pub use lang::detect_language;
pub use lang::Language;
pub use offset::bytes_to_chars;
pub use offset::chars_to_bytes;
pub use offset::is_ascii;
pub use offset::OffsetMapping;
pub use offset::SpanConverter;
pub use offset::TextSpan;
pub use offset::TokenSpan;
pub use backends::label_prompt::LabelNormalizer;
pub use backends::label_prompt::StandardNormalizer;
pub use backends::AutoNER;
pub use backends::BackendType;
pub use backends::ConflictStrategy;
pub use backends::CrfNER;
pub use backends::EnsembleNER;
pub use backends::HeuristicNER;
pub use backends::LexiconNER;
pub use backends::NERExtractor;
pub use backends::NuNER;
pub use backends::RegexNER;
pub use backends::StackedNER;
pub use backends::TPLinker;
pub use backends::W2NERConfig;
pub use backends::W2NERRelation;
pub use backends::W2NER;
pub use backends::mention_ranking::ClusteringStrategy;
pub use backends::mention_ranking::MentionCluster;
pub use backends::mention_ranking::MentionRankingConfig;
pub use backends::mention_ranking::MentionRankingCoref;
pub use backends::mention_ranking::RankedMention;
pub use backends::BertNEROnnx;
pub use backends::GLiNEROnnx;
pub use schema::*;
pub use similarity::*;
pub use sync::*;
pub use types::*;
pub use backends::inference::*;

Modules§

backends
NER backend implementations.
core
anno-core’s stable types under a namespaced module.
edit_distance
Edit distance algorithms. Edit distance utilities for fuzzy string matching.
env
Environment variable utilities.
error
Error types for anno.
features
Entity feature extraction for downstream ML and analysis. Entity feature extraction for downstream ML and analysis.
heuristics
Small, dependency-light heuristics (negation, quantifiers, etc.). Small, dependency-light heuristics shared across the repo.
ingest
Lightweight URL/file ingestion helpers (not a crawling/pipeline product). Document ingestion and preparation.
joint
Joint inference experiments (optional; not the primary API surface). Joint Entity Analysis: Coreference + NER + Entity Linking
lang
Language detection and classification utilities.
linking
Knowledge-base linking helpers (experimental). Entity Linking (NEL/NED) Module.
offset
Unified byte/character/token offset handling.
preprocess
Preprocessing for mention detection. Preprocessing utilities for text normalization and morphological analysis.
schema
Schema harmonization for multi-dataset NER training.
similarity
Text similarity utilities for entity matching and coreference resolution.
sync
Synchronization primitives with conditional compilation.
temporal
Temporal entity tracking, parsing, and diachronic NER. Temporal entity tracking, parsing, and diachronic NER.
tokenizer
Language-specific tokenization for multilingual NLP. Language-specific tokenization for multilingual NLP.
types
Type-level programming patterns for compile-time safety.

Structs§

AnyModel
A wrapper that turns an extractor closure into a Model.
CorefChain
A coreference chain: mentions that all refer to the same entity.
CorefDocument
A document with coreference annotations.
Corpus
A corpus of grounded documents for cross-document operations.
DiscontinuousSpan
A discontinuous span representing non-contiguous entity mentions.
Entity
A recognized named entity or relation trigger.
EntityBuilder
Fluent builder for constructing entities with optional fields.
GraphDocument
A complete graph document ready for export.
GraphEdge
An edge in the knowledge graph.
GraphNode
A node in the knowledge graph.
GroundedDocument
A document with grounded entity annotations using the three-level hierarchy.
HashMapLexicon
Simple HashMap-based lexicon implementation.
HierarchicalConfidence
Hierarchical confidence scores for coarse-to-fine extraction.
Identity
A global identity: a real-world entity linked to a knowledge base.
IdentityId
Unique identifier for an identity within a corpus.
Mention
A single mention (text span) that may corefer with other mentions.
MockModel
A mock NER model for testing purposes.
ModelCapabilities
Summary of a model’s capabilities, useful when working with Box<dyn Model>.
PhiFeatures
A bundle of phi-features (person, number, gender) for morphological agreement.
Provenance
Provenance information for an extracted entity.
RaggedBatch
A ragged (unpadded) batch for efficient ModernBERT inference.
Relation
A relation between two entities, forming a knowledge graph triple.
Signal
A raw detection signal: the atomic unit of entity extraction.
SignalId
Unique identifier for a signal within a document.
SignalRef
A reference to a signal within a track.
SpanCandidate
A candidate span for entity extraction.
Track
A track: a cluster of signals referring to the same entity within a document.
TrackId
Unique identifier for a track within a document.
TrackRef
A reference to a track in a specific document.
TrackStats
Aggregate statistics for a track (coreference chain).
TypeMapper
Maps domain-specific entity types to standard NER types.

Enums§

EntityCategory
Category of entity based on detection characteristics and semantics.
EntityType
Entity type classification.
EntityViewport
Viewport context for multi-faceted entity representation.
ExtractionMethod
Extraction method used to identify an entity.
Gender
Gender classification for NLP tasks.
GraphExportFormat
Supported graph export formats.
IdentitySource
Source of identity formation.
Location
A location in some source medium.
MentionType
Type of referring expression in coreference.
Modality
The semiotic modality of a signal source.
Number
Grammatical number (singular, dual, plural).
Person
Grammatical person (1st, 2nd, 3rd).
Quantifier
Quantification type for symbolic signals.
Span
A span locator for text and visual modalities.
TypeLabel
A unified type label supporting both core and custom entity types.
ValidationIssue
Validation issue found during entity validation.

Constants§

DEFAULT_BERT_ONNX_MODEL
Default BERT ONNX model identifier (HuggingFace model ID).
DEFAULT_CANDLE_MODEL
Default Candle model identifier (HuggingFace model ID). Uses dbmdz’s model which has both tokenizer.json and safetensors.
DEFAULT_GLINER2_MODEL
Default GLiNER2 ONNX model identifier (HuggingFace model ID).
DEFAULT_GLINER_CANDLE_MODEL
Default GLiNER Candle model identifier (HuggingFace model ID). Uses a model with tokenizer.json and pytorch_model.bin for Candle compatibility. The backend converts pytorch_model.bin to safetensors automatically.
DEFAULT_GLINER_MODEL
Default GLiNER ONNX model identifier (HuggingFace model ID).
DEFAULT_NUNER_MODEL
Default NuNER ONNX model identifier (HuggingFace model ID).
DEFAULT_W2NER_MODEL
Default W2NER ONNX model identifier (HuggingFace model ID).

Traits§

BatchCapable
Trait for models that support batch processing.
CoreferenceResolver
Trait for coreference resolution algorithms.
DynamicLabels
Trait for models that support dynamic/zero-shot entity type specification.
GpuCapable
Trait for models that support GPU acceleration.
Lexicon
Exact-match lexicon/gazetteer for entity lookup.
Model
Trait for NER model backends.
NamedEntityCapable
Marker trait for models that extract named entities (persons, organizations, locations).
RelationCapable
Trait for models that can extract relations between entities.
StreamingCapable
Trait for models that support streaming/chunked extraction.
StructuredEntityCapable
Marker trait for models that extract structured entities (dates, times, money, etc.).

Functions§

auto
Automatically select the best available NER backend.
available_backends
Check which backends are currently available.
generate_span_candidates
Generate all valid span candidates for a ragged batch.