Crate anno

Expand description

§anno

Information extraction for unstructured text: named entity recognition (NER), within-document coreference resolution, and structured pattern extraction.

This is the published facade crate for the anno workspace. It re-exports the internal anno-lib library API. The full CLI lives in crates/anno-cli/.

NER: variable-length spans with character offsets (Unicode scalar values).
Coreference: mention clusters (“tracks”) within a single document.
Patterns: dates, monetary amounts, emails, URLs, phone numbers.

Internal crates (anno-lib, anno-core, anno-metrics, anno-eval, anno-cli, anno-graph) are workspace-private and not separately published.

Modules§

backends: NER backend implementations.
core: anno-core’s stable types under a namespaced module.
edit_distance: Edit distance algorithms. Edit distance utilities for fuzzy string matching.
env: Environment variable utilities.
error: Error types for anno.
features: Entity feature extraction for downstream ML and analysis. Entity feature extraction for downstream ML and analysis.
heuristics: Small, dependency-light heuristics (negation, quantifiers, etc.). Small, dependency-light heuristics shared across the repo.
ingest: Lightweight URL/file ingestion helpers (not a crawling/pipeline product). Document ingestion and preparation.
joint: Joint inference experiments (optional; not the primary API surface). Joint Entity Analysis: Coreference + NER + Entity Linking
lang: Language detection and classification utilities.
linking: Knowledge-base linking helpers (experimental). Entity Linking (NEL/NED) Module.
offset: Unified byte/character/token offset handling.
preprocess: Preprocessing for mention detection. Preprocessing utilities for text normalization and morphological analysis.
schema: Schema harmonization for multi-dataset NER training.
similarity: Text similarity utilities for entity matching and coreference resolution.
sync: Synchronization primitives with conditional compilation.
temporal: Temporal entity tracking, parsing, and diachronic NER. Temporal entity tracking, parsing, and diachronic NER.
tokenizer: Language-specific tokenization for multilingual NLP. Language-specific tokenization for multilingual NLP.
types: Type-level programming patterns for compile-time safety.

Structs§

AnyModel: A wrapper that turns an extractor closure into a Model.
AutoNER: Automatic model selection - routes to the default model.
BertNEROnnx: BERT-based NER using ONNX Runtime.
BinaryBlocker: Blocker using binary embeddings for fast candidate filtering.
BinaryHash: Binary hash for fast approximate nearest neighbor search.
Confidence: A confidence score guaranteed to be in the range [0.0, 1.0].
ConfidenceError: Error when trying to create a Confidence from an invalid value.
CorefChain: A coreference chain: mentions that all refer to the same entity.
CorefDocument: A document with coreference annotations.
CoreferenceCluster: A coreference cluster (mentions referring to same entity).
CoreferenceConfig: Configuration for coreference resolution.
Corpus: A corpus of grounded documents for cross-document operations.
CrfNER: CRF-based NER model.
DiscontinuousEntity: An entity that may span multiple non-contiguous regions.
DiscontinuousSpan: A discontinuous span representing non-contiguous entity mentions.
DotProductInteraction: Dot product interaction (default, fast).
EncoderOutput: Output from text encoding.
EnsembleNER: Weighted ensemble of NER backends.
Entity: A recognized named entity or relation trigger.
EntityBuilder: Fluent builder for constructing entities with optional fields.
ExtractionWithRelations: Output from joint entity-relation extraction.
GLiNEROnnx
GroundedDocument: A document with grounded entity annotations using the three-level hierarchy.
HandshakingCell: Result cell in a handshaking matrix.
HandshakingMatrix: Handshaking matrix for joint entity-relation extraction.
HashMapLexicon: Simple HashMap-based lexicon implementation.
HeuristicNER: Heuristic NER model.
HierarchicalConfidence: Hierarchical confidence scores for coarse-to-fine extraction.
Identity: A global identity: a real-world entity linked to a knowledge base.
IdentityId: Unique identifier for an identity within a corpus.
InformationLoss: Documents information lost during schema mapping.
LabelDefinition: Definition of a semantic label (entity type or relation type).
LexiconNER: NER backend that uses exact-match lexicon lookup.
MaxSimInteraction: MaxSim interaction (ColBERT-style, better for phrases).
Mention: A single mention (text span) that may corefer with other mentions.
MentionCluster: Coreference cluster from mention ranking.
MentionRankingConfig: Configuration for mention-ranking coref.
MentionRankingCoref: Mention-ranking coreference resolver.
MockModel: A mock NER model for testing purposes.
ModelCapabilities: Summary of a model’s capabilities, useful when working with Box<dyn Model>.
NERExtractor: NER extractor with fallback support.
NuNER: NuNER Zero-shot NER model.
OffsetMapping: Offset mapping from tokenizer.
PhiFeatures: A bundle of phi-features (person, number, gender) for morphological agreement.
Provenance: Provenance information for an extracted entity.
RaggedBatch: A ragged (unpadded) batch for efficient ModernBERT inference.
RankedMention: A detected mention with phi-features for coreference resolution.
RegexNER: Regex-based NER - extracts entities with recognizable formats using regex patterns.
Relation: A relation between two entities, forming a knowledge graph triple.
RelationExtractionConfig: Configuration for relation extraction.
RelationTriple: A relation triple linking two entities.
SchemaMapper: Maps dataset-specific labels to canonical types.
Score: A score guaranteed to be in the range [0.0, 1.0] (f32 precision).
SemanticRegistry: A frozen, pre-computed registry of entity and relation types.
SemanticRegistryBuilder: Builder for SemanticRegistry.
Signal: A raw detection signal: the atomic unit of entity extraction.
SignalId: Unique identifier for a signal within a document.
SignalRef: A reference to a signal within a track.
SpanCandidate: A candidate span for entity extraction.
SpanConverter: Converter for efficiently handling many spans from the same text.
SpanLabelScore: Score for a (span, label) match.
SpanRepConfig: Configuration for span representation.
SpanRepresentationLayer: Computes span representations from token embeddings.
StackedNER: Composable NER that combines multiple backends.
StandardNormalizer: Standard label normalizer with common NER ontology mappings.
TPLinker: TPLinker backend for joint entity-relation extraction.
TextSpan: A text span with both byte and character offsets.
TokenSpan: Span in subword token space.
Track: A track: a cluster of signals referring to the same entity within a document.
TrackId: Unique identifier for a track within a document.
TrackRef: A reference to a track in a specific document.
TrackStats: Aggregate statistics for a track (coreference chain).
TypeMapper: Maps domain-specific entity types to standard NER types.
VisualPosition: Visual position of a text token in an image.
W2NER: W2NER model for unified named entity recognition.
W2NERConfig: Configuration for W2NER decoding.

Enums§

BackendType: Backend type identifier.
CanonicalType: Canonical entity type in the unified schema.
ClusteringStrategy: Clustering strategy for mention linking.
CoarseType: Coarse-grained schema for multi-dataset training.
ConflictStrategy: Strategy for resolving overlapping entity spans.
DatasetSchema: Known dataset schemas for automatic mapping.
EntityCategory: Category of entity based on detection characteristics and semantics.
EntityType: Entity type classification.
EntityViewport: Viewport context for multi-faceted entity representation.
Error: Error type for anno operations.
ExtractionMethod: Extraction method used to identify an entity.
Gender: Gender classification for NLP tasks.
IdentitySource: Source of identity formation.
ImageFormat: Image format hint for decoding.
LabelCategory: Category of semantic label.
Language: Supported languages for text analysis.
Location: A location in some source medium.
MentionType: Type of referring expression in coreference.
Modality: The semiotic modality of a signal source.
ModalityHint: Hint for which modality this label applies to.
ModalityInput: Input modality for the encoder.
Number: Grammatical number (singular, dual, plural).
Person: Grammatical person (1st, 2nd, 3rd).
Quantifier: Quantification type for symbolic signals.
Span: A span locator for text and visual modalities.
TypeLabel: A unified type label supporting both core and custom entity types.
ValidationIssue: Validation issue found during entity validation.
W2NERRelation: W2NER word-word relation types.

Constants§

DEFAULT_BERT_ONNX_MODEL: Default BERT ONNX model identifier (HuggingFace model ID).
DEFAULT_CANDLE_MODEL: Default Candle model identifier (HuggingFace model ID). Uses dbmdz’s model which has both tokenizer.json and safetensors.
DEFAULT_GLINER2_MODEL: Default GLiNER2 ONNX model identifier (HuggingFace model ID).
DEFAULT_GLINER_CANDLE_MODEL: Default GLiNER Candle model identifier (HuggingFace model ID). Uses a model with tokenizer.json and pytorch_model.bin for Candle compatibility. The backend converts pytorch_model.bin to safetensors automatically.
DEFAULT_GLINER_MODEL: Default GLiNER ONNX model identifier (HuggingFace model ID).
DEFAULT_NUNER_MODEL: Default NuNER ONNX model identifier (HuggingFace model ID).
DEFAULT_W2NER_MODEL: Default W2NER ONNX model identifier (HuggingFace model ID).

Traits§

BatchCapable: Trait for models that support batch processing.
BiEncoder: Bi-encoder architecture combining text and label encoders.
CoreferenceResolver: Trait for coreference resolution algorithms.
DiscontinuousNER: Support for discontinuous entity spans.
DynamicLabels: Trait for models that support dynamic/zero-shot entity type specification.
EntitySliceExt: Extension methods for slices of entities.
GpuCapable: Trait for models that support GPU acceleration.
LabelEncoder: Label encoder trait for encoding entity type descriptions.
LabelNormalizer: Trait for label prompt normalization.
LateInteraction: | DotProduct | s·l | Fast | Good | General purpose | | MaxSim | max(s·l)| Medium| Better | Multi-token labels | | Bilinear | s·W·l | Slow | Best | When accuracy critical |
Lexicon: Exact-match lexicon/gazetteer for entity lookup.
Model: Trait for NER model backends.
NamedEntityCapable: Marker trait for models that extract named entities (persons, organizations, locations).
RelationCapable: Trait for models that can extract relations between entities.
RelationExtractor: Joint entity and relation extraction.
StreamingCapable: Trait for models that support streaming/chunked extraction.
StructuredEntityCapable: Marker trait for models that extract structured entities (dates, times, money, etc.).
TextEncoder: Text encoder trait for transformer-based encoders.
ZeroShotNER: Zero-shot NER for open entity types.

Functions§

auto: Automatically select the best available NER backend.
available_backends: Check which backends are currently available.
bytes_to_chars: Convert byte offsets to character offsets.
chars_to_bytes: Convert character offsets to byte offsets.
cosine_similarity: Compute cosine similarity between two vectors.
cosine_similarity_f32: Compute cosine similarity between two f32 vectors.
detect_language: Simple heuristic language detection based on Unicode scripts.
extract_relation_triples: Extract relations as index-based triples (for joint extraction backends).
extract_relations: Extract relations between entities.
generate_span_candidates: Generate all valid span candidates for a ragged batch.
is_ascii: Fast check if text is ASCII-only.
jaccard_word_similarity: Compute Jaccard similarity on word sets.
jaccard_word_similarity_f32: Compute Jaccard similarity on word sets (f32 version).
lock: Lock a mutex using std::sync::Mutex, recovering from poisoning.
map_to_canonical: Unified label mapping - THE SINGLE SOURCE OF TRUTH.
resolve_coreferences: Resolve coreferences between entities using embedding similarity.
string_similarity: Compute string similarity using multiple strategies.
try_lock: Try to lock a mutex using std::sync::Mutex without blocking.
two_stage_retrieval: Recommended two-stage retrieval using binary blocking + dense reranking.

Type Aliases§

Mutex: Mutex type using std::sync::Mutex (default, no production feature).
Probability: Type alias for Confidence when used in probabilistic contexts.
Result: Result type for anno operations.
UnitInterval: Type alias for generic unit interval values.

Crate anno

Crate anno Copy item path

§anno

Modules§

Structs§

Enums§

Constants§

Traits§

Functions§

Type Aliases§

Crate anno