Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
anno
Information extraction for unstructured text: named entity recognition (NER), coreference resolution, entity linking, relation extraction, and zero-shot entity types.
Dual-licensed under MIT or Apache-2.0. MSRV: 1.85.
Library
The crate is published as anno on crates.io.
[]
= "0.3.9"
The default feature (onnx) pulls in ort (ONNX Runtime C++ bindings). For builds without a C++ runtime, use default-features = false.
use ;
let m = default;
let ents = m.extract_entities?;
assert!;
# Ok::
StackedNER::default() selects the best available backend at runtime: GLiNER (if onnx enabled and model cached), falling back to heuristic + pattern extraction. Set ANNO_NO_DOWNLOADS=1 or HF_HUB_OFFLINE=1 to force cached-only behavior.
Zero-shot custom types via DynamicLabels (GLiNER, GLiNER2, NuNER):
use ;
let m = new?;
let ents = m.extract_with_labels?;
# Ok::
Feature flags
| Feature | Default | Description |
|---|---|---|
onnx |
Yes | ONNX Runtime backends (GLiNER, NuNER, BERT, W2NER, FCoref) via ort |
candle |
No | Pure-Rust Candle backends (no C++ runtime needed) |
analysis |
No | Evaluation helpers, coref metrics, RAG rewriting |
eval |
No | Superset of analysis: adds network support for dataset fetching |
graph |
No | Keywords, summarization, salience (TextRank/LexRank/YAKE) |
schema |
No | JSON Schema generation for output types via schemars |
discourse |
No | Centering theory, abstract anaphora, shell nouns |
Tasks
Named entity recognition. Given input text, identify spans (start, end, type, confidence) where each span denotes a named entity [1, 2]. Entity types follow standard taxonomies (PER, ORG, LOC, MISC for CoNLL-style [2]) or caller-defined labels for zero-shot extraction. Offsets are character offsets (Unicode scalar values), not byte offsets; see CONTRACT.md.
Coreference resolution. Identify mention spans and group them into clusters, where each cluster tracks a consistent discourse referent within the document [3, 4]. Coreferring mentions span proper names, definite descriptions, and pronouns: "Sophie Wilson", "the designer", and "She" form one cluster. Multiple backends: rule-based sieves (SimpleCorefResolver), neural (FCoref, 78.5 F1 on CoNLL-2012), and mention-ranking (MentionRankingCoref).
Relation extraction. Extract typed (head, relation, tail) triples from text. Available on RelationCapable backends (gliner2, tplinker); others fall back to co-occurrence edges for graph export.
Structured pattern extraction. Dates, monetary amounts, email addresses, URLs, phone numbers via deterministic regex grammars.
Backends
All backends produce the same output type: variable-length spans with character offsets.
| Backend | Architecture | Labels | Zero-shot | Relations | Weights | Reference |
|---|---|---|---|---|---|---|
stacked (default) |
Selector/fallback | GLiNER > BERT > heuristic+pattern | -- | -- | HuggingFace (when ML enabled) | -- |
gliner |
Bi-encoder span classifier | Custom | Yes | -- | gliner_small-v2.1 | Zaratiana et al. [5] |
gliner2 |
Multi-task span classifier | Custom | Yes | Heuristic | gliner-multitask-large-v0.5 | [11] |
nuner |
Token classifier (BIO) | Custom | Yes | -- | NuNerZero_onnx | Bogdanov et al. [6] |
bert-onnx |
Sequence labeling (BERT) | PER/ORG/LOC/MISC | No | -- | bert-base-NER-onnx | Devlin et al. [8] |
pattern |
Regex grammars | DATE/MONEY/EMAIL/URL/PHONE/PERCENT | N/A | -- | None | -- |
tplinker |
Joint entity-relation (heuristic) | Custom | -- | Heuristic | None | [10] |
ML backends are feature-gated (onnx or candle). Weights download from HuggingFace on first use. See BACKENDS.md for the full list (including experimental backends) and feature-flag details.
CLI
Examples
PER:1 "Lynn Conway"
ORG:2 "IBM" "Xerox PARC"
LOC:1 "California"
JSON output (--format json):
Zero-shot custom entity types (via GLiNER [5]):
drug:1 "Aspirin" symptom:2 "headaches" "fever"
Coreference:
Coreference: "Sophie Wilson" → "She"
Coreference
Three coref backends with different tradeoffs:
| Backend | Type | Quality | Speed | Weights |
|---|---|---|---|---|
SimpleCorefResolver |
Rule-based (6 sieves) | Low | Fast | None |
FCoref |
Neural (DistilRoBERTa) | 78.5 F1 | Medium | ONNX export |
MentionRankingCoref |
Mention-ranking | Medium | Medium | None |
FCoref (Otmazgin et al., AACL 2022) requires a one-time model export. The export script is in the repo at scripts/export_fcoref.py (not shipped in the crates.io package -- clone the repo to run it):
# From a repo clone:
let coref = from_path?;
let clusters = coref.resolve?;
// clusters[0] = { mentions: ["Marie Curie", "She"], canonical: "Marie Curie" }
RAG preprocessing (rag::resolve_for_rag(), requires analysis feature): rewrites pronouns with their antecedents so document chunks remain self-contained after splitting. Supports English, French, Spanish, German. Includes pleonastic "it" filter and overlap guard.
Downstream
Filter and pipe JSON output:
# All person entities, text only
|
# Unique organizations, sorted
|
Batch a directory (parallel, cached):
# Stream stdin JSONL: {"id":"…","text":"…"} per line
|
Knowledge Graph (RDF)
The following export commands require
--features graphat install time (addgraphto the features list in thecargo installcommand above).
anno export emits standard N-Triples or JSON-LD -- loadable into any RDF store (Oxigraph, Jena, Blazegraph, etc.). --base-uri sets the IRI namespace:
# Own namespace (recommended for private corpora)
# Align to DBpedia (widely used LOD namespace)
Default (urn:anno:) produces stable URNs suitable for local use without a registered domain. The output is standard W3C N-Triples, so any SPARQL-capable store can query it:
# Example: load into a SPARQL store and query
|
Semantic relation triples — RelationCapable backends (tplinker, gliner2) produce typed (head, relation, tail) triples instead of co-occurrence edges. Use --format graph-ntriples (--features graph) for routing through the internal graph substrate:
# → <entity/person/0_Lynn_Conway_0> <rel/works_for> <entity/org/1_IBM_13> .
Property Graph CSV
Export node and edge tables as CSV for import into any property-graph database:
# Semantic edges (gliner2 or tplinker)
# Co-occurrence edges (any other backend)
Each file produces {stem}-nodes.csv + {stem}-edges.csv with columns:
- Nodes:
id,entity_type,text,start,end,source - Edges:
from,to,rel_type,confidence
Architecture
| Crate | Published | Purpose |
|---|---|---|
anno (root) |
Yes | Backends, Model trait, extraction pipeline, coref resolvers. |
anno-core |
No (yanked) | Data model (Entity, Relation, Mention, CorefChain, Signal, Track). Re-exported by anno. |
anno-graph |
No | Graph/KG export adapters (N-Triples, JSON-LD, CSV) |
anno-eval |
No | Evaluation harnesses, dataset loaders, matrix sampling |
anno-cli |
No | CLI binary |
anno-metrics |
No | Shared evaluation primitives (CoNLL F1, encoders) |
anno-core exists so that crates needing only data types (Entity, Relation, etc.) don't pull in ML dependencies. anno-lib depends on anno-core and re-exports all its types at the crate root.
Pipeline: Text -> Extract -> Coalesce -> structured output. See ARCHITECTURE.md.
Evaluation
anno-eval provides dataset loading, backend-vs-dataset compatibility gating, and CoNLL-style [2] span-level evaluation (precision, recall, F1) with label mapping between backend and dataset taxonomies.
anno sampler (--features eval) provides a randomized matrix sampler with two modes: triage (worst-first) and measure (ML-only stable measurement).
Sampler examples
# Deterministic offline decision-loop smoke example (also run in CI).
# Run the CI sampler harness locally (uses ~/.anno_cache for history/cache).
Scope
Inference-time extraction. Training pipelines are out of scope -- use upstream frameworks (Hugging Face Transformers, Flair, etc.) and export ONNX weights for consumption. (The repo contains some experimental training code for box embeddings; this is research-only and not part of the public API.)
Troubleshooting
ONNX Runtime linking errors. The onnx feature pulls in ort, which needs the ONNX Runtime C++ library. If linking fails, use default-features = false for a minimal build without native deps. If you need ONNX backends, ensure ort can download the prebuilt runtime -- check proxy/firewall settings and the ORT_DYLIB_PATH env var.
Model download failures. First use of ML backends downloads weights from HuggingFace to ~/.cache/huggingface/hub/. Behind firewalls, pre-fetch models on a connected machine, then set HF_HUB_OFFLINE=1 (or ANNO_NO_DOWNLOADS=1) to force cached-only mode.
FCoref export script not found. The export script (scripts/export_fcoref.py) is only present in the git repository, not in the crates.io package. Clone the repo to access it: uv run scripts/export_fcoref.py.
"Feature not available" errors. Most ML backends are feature-gated behind onnx or candle. If a backend constructor returns an error about a missing feature, check which features are enabled. The feature flags table above lists what each feature unlocks.
Character offset mismatches. All spans use character offsets (Unicode scalar values), not byte offsets. If offsets look wrong, ensure you are indexing by chars() count, not by byte position. See CONTRACT.md.
Documentation
- QUICKSTART
- CONTRACT — offset semantics, scope, feature gating
- BACKENDS — backend selection, architecture, feature flags
- ARCHITECTURE — crate layout, dependency flow
- REFERENCES — full bibliography (NER, coref, relation extraction, software)
- API docs
- Changelog
References
[1] Grishman & Sundheim, COLING 1996 (MUC-6 NER). [2] Tjong Kim Sang & De Meulder, CoNLL 2003. [3] Lee et al., EMNLP 2017 (end-to-end coref). [4] Jurafsky & Martin, SLP3 2024. [5] Zaratiana et al., NAACL 2024 (GLiNER). [6] Bogdanov et al., 2024 (NuNER). [7] Li et al., AAAI 2022 (W2NER). [8] Devlin et al., NAACL 2019 (BERT). [9] Lafferty et al., ICML 2001 (CRF). [10] Wang et al., COLING 2020 (TPLinker). [11] Zaratiana et al., 2025 (GLiNER2). [12] Rabiner, Proc. IEEE 1989 (HMM).
Full list with links: docs/REFERENCES.md. Citeable via CITATION.cff.
License
Dual-licensed under MIT or Apache-2.0.