Crate anno

Expand description

§anno

Information extraction: named entity recognition (NER) and within-document coreference.

NER output: variable-length spans with character offsets (Unicode scalar values), not byte offsets.
Coreference output: clusters (“tracks”) of mentions within one document.

This crate focuses on inference-time extraction. Dataset loaders, benchmarking, and matrix evaluation tooling live in anno-eval (and the anno CLI lives in anno-cli).

§Quickstart

use anno::{Model, StackedNER};

let m = StackedNER::default();
let ents = m.extract_entities("Lynn Conway worked at IBM and Xerox PARC.", None)?;
assert!(!ents.is_empty());

§Zero-shot custom entity types

Zero-shot custom entity types are provided by GLiNER backends when the onnx feature is enabled. See the repo docs for the CLI flag (--extract-types) and the library API.

§Offline / downloads

By default, ML weights may download on first use. To force cached-only behavior, set ANNO_NO_DOWNLOADS=1 (after prefetching models).

Re-exports§

pub use error::Error;
pub use error::Result;
pub use lang::detect_language;
pub use lang::Language;
pub use offset::bytes_to_chars;
pub use offset::chars_to_bytes;
pub use offset::is_ascii;
pub use offset::OffsetMapping;
pub use offset::SpanConverter;
pub use offset::TextSpan;
pub use offset::TokenSpan;
pub use backends::label_prompt::LabelNormalizer;
pub use backends::label_prompt::StandardNormalizer;
pub use backends::AutoNER;
pub use backends::BackendType;
pub use backends::ConflictStrategy;
pub use backends::CrfNER;
pub use backends::EnsembleNER;
pub use backends::HeuristicNER;
pub use backends::LexiconNER;
pub use backends::NERExtractor;
pub use backends::NuNER;
pub use backends::RegexNER;
pub use backends::StackedNER;
pub use backends::TPLinker;
pub use backends::W2NERConfig;
pub use backends::W2NERRelation;
pub use backends::W2NER;
pub use backends::mention_ranking::ClusteringStrategy;
pub use backends::mention_ranking::MentionCluster;
pub use backends::mention_ranking::MentionRankingConfig;
pub use backends::mention_ranking::MentionRankingCoref;
pub use backends::mention_ranking::RankedMention;
pub use backends::BertNEROnnx;
pub use backends::GLiNEROnnx;
pub use schema::*;
pub use similarity::*;
pub use sync::*;
pub use types::*;
pub use backends::inference::*;

Modules§

backends: NER backend implementations.
core: anno-core’s stable types under a namespaced module.
edit_distance: Edit distance algorithms. Edit distance utilities for fuzzy string matching.
env: Environment variable utilities.
error: Error types for anno.
features: Entity feature extraction for downstream ML and analysis. Entity feature extraction for downstream ML and analysis.
heuristics: Small, dependency-light heuristics (negation, quantifiers, etc.). Small, dependency-light heuristics shared across the repo.
ingest: Lightweight URL/file ingestion helpers (not a crawling/pipeline product). Document ingestion and preparation.
joint: Joint inference experiments (optional; not the primary API surface). Joint Entity Analysis: Coreference + NER + Entity Linking
lang: Language detection and classification utilities.
linking: Knowledge-base linking helpers (experimental). Entity Linking (NEL/NED) Module.
offset: Unified byte/character/token offset handling.
preprocess: Preprocessing for mention detection. Preprocessing utilities for text normalization and morphological analysis.
schema: Schema harmonization for multi-dataset NER training.
similarity: Text similarity utilities for entity matching and coreference resolution.
sync: Synchronization primitives with conditional compilation.
temporal: Temporal entity tracking, parsing, and diachronic NER. Temporal entity tracking, parsing, and diachronic NER.
tokenizer: Language-specific tokenization for multilingual NLP. Language-specific tokenization for multilingual NLP.
types: Type-level programming patterns for compile-time safety.

Structs§

AnyModel: A wrapper that turns an extractor closure into a Model.
CorefChain: A coreference chain: mentions that all refer to the same entity.
CorefDocument: A document with coreference annotations.
Corpus: A corpus of grounded documents for cross-document operations.
DiscontinuousSpan: A discontinuous span representing non-contiguous entity mentions.
Entity: A recognized named entity or relation trigger.
EntityBuilder: Fluent builder for constructing entities with optional fields.
GraphDocument: A complete graph document ready for export.
GraphEdge: An edge in the knowledge graph.
GraphNode: A node in the knowledge graph.
GroundedDocument: A document with grounded entity annotations using the three-level hierarchy.
HashMapLexicon: Simple HashMap-based lexicon implementation.
HierarchicalConfidence: Hierarchical confidence scores for coarse-to-fine extraction.
Identity: A global identity: a real-world entity linked to a knowledge base.
IdentityId: Unique identifier for an identity within a corpus.
Mention: A single mention (text span) that may corefer with other mentions.
MockModel: A mock NER model for testing purposes.
ModelCapabilities: Summary of a model’s capabilities, useful when working with Box<dyn Model>.
PhiFeatures: A bundle of phi-features (person, number, gender) for morphological agreement.
Provenance: Provenance information for an extracted entity.
RaggedBatch: A ragged (unpadded) batch for efficient ModernBERT inference.
Relation: A relation between two entities, forming a knowledge graph triple.
Signal: A raw detection signal: the atomic unit of entity extraction.
SignalId: Unique identifier for a signal within a document.
SignalRef: A reference to a signal within a track.
SpanCandidate: A candidate span for entity extraction.
Track: A track: a cluster of signals referring to the same entity within a document.
TrackId: Unique identifier for a track within a document.
TrackRef: A reference to a track in a specific document.
TrackStats: Aggregate statistics for a track (coreference chain).
TypeMapper: Maps domain-specific entity types to standard NER types.

Enums§

EntityCategory: Category of entity based on detection characteristics and semantics.
EntityType: Entity type classification.
EntityViewport: Viewport context for multi-faceted entity representation.
ExtractionMethod: Extraction method used to identify an entity.
Gender: Gender classification for NLP tasks.
GraphExportFormat: Supported graph export formats.
IdentitySource: Source of identity formation.
Location: A location in some source medium.
MentionType: Type of referring expression in coreference.
Modality: The semiotic modality of a signal source.
Number: Grammatical number (singular, dual, plural).
Person: Grammatical person (1st, 2nd, 3rd).
Quantifier: Quantification type for symbolic signals.
Span: A span locator for text and visual modalities.
TypeLabel: A unified type label supporting both core and custom entity types.
ValidationIssue: Validation issue found during entity validation.

Constants§

DEFAULT_BERT_ONNX_MODEL: Default BERT ONNX model identifier (HuggingFace model ID).
DEFAULT_CANDLE_MODEL: Default Candle model identifier (HuggingFace model ID). Uses dbmdz’s model which has both tokenizer.json and safetensors.
DEFAULT_GLINER2_MODEL: Default GLiNER2 ONNX model identifier (HuggingFace model ID).
DEFAULT_GLINER_CANDLE_MODEL: Default GLiNER Candle model identifier (HuggingFace model ID). Uses a model with tokenizer.json and pytorch_model.bin for Candle compatibility. The backend converts pytorch_model.bin to safetensors automatically.
DEFAULT_GLINER_MODEL: Default GLiNER ONNX model identifier (HuggingFace model ID).
DEFAULT_NUNER_MODEL: Default NuNER ONNX model identifier (HuggingFace model ID).
DEFAULT_W2NER_MODEL: Default W2NER ONNX model identifier (HuggingFace model ID).

Traits§

BatchCapable: Trait for models that support batch processing.
CoreferenceResolver: Trait for coreference resolution algorithms.
DynamicLabels: Trait for models that support dynamic/zero-shot entity type specification.
GpuCapable: Trait for models that support GPU acceleration.
Lexicon: Exact-match lexicon/gazetteer for entity lookup.
Model: Trait for NER model backends.
NamedEntityCapable: Marker trait for models that extract named entities (persons, organizations, locations).
RelationCapable: Trait for models that can extract relations between entities.
StreamingCapable: Trait for models that support streaming/chunked extraction.
StructuredEntityCapable: Marker trait for models that extract structured entities (dates, times, money, etc.).

Functions§

auto: Automatically select the best available NER backend.
available_backends: Check which backends are currently available.
generate_span_candidates: Generate all valid span candidates for a ragged batch.

Crate anno

Crate anno Copy item path

§anno

§Quickstart

§Zero-shot custom entity types

§Offline / downloads

Re-exports§

Modules§

Structs§

Enums§

Constants§

Traits§

Functions§

Crate anno