Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
anno
Information extraction for unstructured text: named entity recognition (NER), within-document coreference resolution, and structured pattern extraction.
Dual-licensed under MIT or Apache-2.0.
Tasks
Named entity recognition. Given input text, identify spans (start, end, type, confidence) where each span denotes a named entity [1, 2]. Entity types follow standard taxonomies (PER, ORG, LOC, MISC for CoNLL-style [2]) or caller-defined labels for zero-shot extraction. Offsets are character offsets (Unicode scalar values), not byte offsets; see CONTRACT.md.
Coreference resolution. Identify mention spans and group them into clusters, where each cluster tracks a consistent discourse referent within the document [3, 4]. Referents may be concrete or abstract, singular or plural, real-world or fictional — the constraint is within-document consistency, not ontological uniqueness. Coreferring mentions span proper names, definite descriptions, and pronouns: "Sophie Wilson", "the designer", and "She" form one cluster.
Relation extraction. Extract typed (head, relation, tail) triples from text. Available on RelationCapable backends; others fall back to co-occurrence edges for graph export.
Structured pattern extraction. Dates, monetary amounts, email addresses, URLs, phone numbers via deterministic regex grammars.
Backends
All backends produce the same output type: variable-length spans with character offsets.
| Backend | Architecture | Labels | Zero-shot | Relations | Weights | Reference |
|---|---|---|---|---|---|---|
stacked (default) |
Selector/fallback | Best available | — | — | HuggingFace (when ML enabled) | -- |
gliner |
Bi-encoder span classifier | Custom | Yes | — | gliner_small-v2.1 | Zaratiana et al. [5] |
gliner2 |
Multi-task span classifier | Custom | Yes | Heuristic | gliner-multitask-large-v0.5 | [11] |
nuner |
Token classifier (BIO) | Custom | Yes | — | NuNerZero_onnx | Bogdanov et al. [6] |
w2ner |
Word-word relation grids | Trained (nested) | No | — | w2ner-bert-base | Li et al. [7] |
bert-onnx |
Sequence labeling (BERT) | PER/ORG/LOC/MISC | No | — | bert-base-NER-onnx | Devlin et al. [8] |
tplinker |
Joint entity-relation (heuristic) | Custom | — | Heuristic | None | [10] |
crf |
Conditional Random Field | Trained | No | — | Bundled (bundled-crf-weights) |
Lafferty et al. [9] |
hmm |
Hidden Markov Model | Trained | No | — | Bundled (bundled-hmm-params) |
Rabiner [12] |
pattern |
Regex grammars | DATE/MONEY/EMAIL/URL/PHONE | N/A | — | None | -- |
heuristic |
Capitalization + context | PER/ORG/LOC | N/A | — | None | -- |
ensemble |
Weighted voting combiner | Mixed | Varies | — | Varies | -- |
ML backends are feature-gated (onnx or candle). Weights download from HuggingFace on first use. All backends expose model.capabilities() for runtime discovery. See BACKENDS.md for selection guidance and feature-flag details.
Install
From a local clone:
ANNO_NO_DOWNLOADS=1 or HF_HUB_OFFLINE=1 forces cached-only behavior.
Examples
PER:1 "Lynn Conway"
ORG:2 "IBM" "Xerox PARC"
LOC:1 "California"
JSON output (schema-stable; uses pattern for offline reproducibility):
Zero-shot custom entity types (via GLiNER [5]):
drug:1 "Aspirin" symptom:2 "headaches" "fever"
Coreference:
Coreference: "Sophie Wilson" → "She"
Downstream
Filter and pipe JSON output:
# All person entities, text only
|
# Unique organizations, sorted
|
Batch a directory (parallel, cached):
# Stream stdin JSONL: {"id":"…","text":"…"} per line
|
Knowledge Graph (RDF)
anno export emits standard N-Triples or JSON-LD -- loadable into any RDF store (Oxigraph, Jena, Blazegraph, etc.). --base-uri sets the IRI namespace:
# Own namespace (recommended for private corpora)
# Align to DBpedia (widely used LOD namespace)
Default (urn:anno:) produces stable URNs suitable for local use without a registered domain. The output is standard W3C N-Triples, so any SPARQL-capable store can query it:
# Example: load into a SPARQL store and query
|
Semantic relation triples — RelationCapable backends (tplinker, gliner2) produce typed (head, relation, tail) triples instead of co-occurrence edges. Use --format graph-ntriples (--features graph) for routing through the internal graph substrate:
# → <entity/person/0_Lynn_Conway_0> <rel/works_for> <entity/org/1_IBM_13> .
Property Graph CSV
Export node and edge tables as CSV for import into any property-graph database:
# Semantic edges (gliner2 or tplinker)
# Co-occurrence edges (any other backend)
Each file produces {stem}-nodes.csv + {stem}-edges.csv with columns:
- Nodes:
id,entity_type,text,start,end,source - Edges:
from,to,rel_type,confidence
Library
[]
= "0.3"
The crate is published as anno-lib on crates.io; the Rust import name is anno.
Feature flags
| Feature | Default | Description |
|---|---|---|
onnx |
Yes | ONNX Runtime backends (GLiNER, NuNER, BERT, W2NER) via ort |
candle |
No | Pure-Rust Candle backends (no C++ runtime needed) |
eval |
No | Evaluation harnesses, dataset loaders, and matrix sampling |
graph |
No | Knowledge-graph export adapters (N-Triples, JSON-LD, CSV) |
schema |
No | JSON Schema generation for output types via schemars |
The onnx feature (default) pulls in ort (ONNX Runtime bindings), which requires a C++ runtime. For minimal builds, use default-features = false.
Basic extraction:
use ;
let m = default;
let ents = m.extract_entities?;
assert!;
# Ok::
Zero-shot custom types via DynamicLabels (GLiNER, GLiNER2, NuNER):
use ;
let m = new?;
let ents = m.extract_with_labels?;
# Ok::
Runtime capability discovery — all backends now report accurately:
use Model;
Architecture
| Crate | Purpose |
|---|---|
anno (root) |
Published facade; re-exports anno-lib |
anno-lib |
Backends, Model / RelationCapable traits, extraction pipeline |
anno-core |
Stable data model (Entity, Relation, Signal, Track, Identity, Corpus) |
anno-graph |
Graph/KG export adapters: converts extraction output to lattix::KnowledgeGraph; owns triple construction and --format graph-ntriples |
anno-eval |
Evaluation harnesses, dataset loaders, matrix sampling (uses muxer) |
anno-cli |
Full CLI; graph feature routes through anno-graph |
anno-metrics |
Shared evaluation primitives (CoNLL F1, encoders) |
Pipeline: Text → Extract → Coalesce → structured output. See ARCHITECTURE.md.
Evaluation
anno-eval provides dataset loading, backend-vs-dataset compatibility gating, and CoNLL-style [2] span-level evaluation (precision, recall, F1) with label mapping between backend and dataset taxonomies.
anno sampler (--features eval) provides a randomized matrix sampler with two modes: triage (worst-first) and measure (ML-only stable measurement).
Sampler examples
# Deterministic offline decision-loop smoke example (also run in CI).
# Run the CI sampler harness locally (uses ~/.anno_cache for history/cache).
Scope
Inference-time extraction only. Training is out of scope — use upstream frameworks (Hugging Face Transformers, Flair, etc.) and export ONNX weights for consumption.
Documentation
- QUICKSTART
- CONTRACT — offset semantics, scope, feature gating
- BACKENDS — backend selection, architecture, feature flags
- ARCHITECTURE — crate layout, dependency flow
- REFERENCES — full bibliography (NER, coref, relation extraction, software)
- API docs
- Changelog
References
Key citations; see docs/REFERENCES.md for the full list with links.
- Grishman & Sundheim, COLING 1996 (MUC-6 NER). [PDF]
- Tjong Kim Sang & De Meulder, CoNLL 2003 (CoNLL benchmark). [PDF]
- Lee et al., EMNLP 2017 (end-to-end coreference). [arXiv]
- Jurafsky & Martin, SLP3 2024 (coreference fundamentals). [Online]
- Zaratiana et al., NAACL 2024 (GLiNER). [arXiv]
- Bogdanov et al., 2024 (NuNER). [arXiv]
- Li et al., AAAI 2022 (W2NER). [arXiv]
- Devlin et al., NAACL 2019 (BERT). [arXiv]
- Lafferty et al., ICML 2001 (CRF).
- Wang et al., COLING 2020 (TPLinker). [arXiv]
- Zaratiana et al., 2025 (GLiNER2). [arXiv]
- Rabiner, Proceedings of the IEEE 1989 (HMM tutorial). [PDF]
Citeable via CITATION.cff.
License
Dual-licensed under MIT or Apache-2.0.