# anno
[](https://crates.io/crates/anno)
[](https://docs.rs/anno)
[](https://github.com/arclabs561/anno/actions/workflows/ci.yml)
Information extraction for unstructured text: named entity recognition (NER), coreference resolution, entity linking, and zero-shot entity types.
Dual-licensed under MIT or Apache-2.0. MSRV: 1.85.
## Library
The crate is published as **`anno`** on crates.io.
```toml
[dependencies]
anno = "0.3.7"
```
The default feature (`onnx`) pulls in `ort` (ONNX Runtime C++ bindings). For builds without a C++ runtime, use `default-features = false`.
```rust
use anno::{Model, StackedNER};
let m = StackedNER::default();
let ents = m.extract_entities("Sophie Wilson designed the ARM processor.", None)?;
assert!(!ents.is_empty());
# Ok::<(), anno::Error>(())
```
`StackedNER::default()` selects the best available backend at runtime: GLiNER (if `onnx` enabled and model cached), falling back to heuristic + pattern extraction. Set `ANNO_NO_DOWNLOADS=1` or `HF_HUB_OFFLINE=1` to force cached-only behavior.
Zero-shot custom types via `DynamicLabels` (GLiNER, GLiNER2, NuNER):
```rust
use anno::{DynamicLabels, GLiNEROnnx};
let m = GLiNEROnnx::new("onnx-community/gliner_small-v2.1")?;
let ents = m.extract_with_labels(
"Aspirin treats headaches.",
&["drug", "symptom"],
None,
)?;
# Ok::<(), anno::Error>(())
```
### Feature flags
| `onnx` | Yes | ONNX Runtime backends (GLiNER, NuNER, BERT, W2NER, FCoref) via `ort` |
| `candle` | No | Pure-Rust Candle backends (no C++ runtime needed) |
| `analysis` | No | Evaluation helpers, coref metrics, RAG rewriting |
| `eval` | No | Superset of `analysis`: adds network support for dataset fetching |
| `graph` | No | Keywords, summarization, salience (TextRank/LexRank/YAKE) |
| `schema` | No | JSON Schema generation for output types via `schemars` |
| `discourse` | No | Centering theory, abstract anaphora, shell nouns |
## Tasks
**Named entity recognition.** Given input text, identify spans `(start, end, type, confidence)` where each span denotes a named entity [1, 2]. Entity types follow standard taxonomies (PER, ORG, LOC, MISC for CoNLL-style [2]) or caller-defined labels for zero-shot extraction. Offsets are **character offsets** (Unicode scalar values), not byte offsets; see [CONTRACT.md](docs/CONTRACT.md).
**Coreference resolution.** Identify mention spans and group them into clusters, where each cluster tracks a consistent discourse referent within the document [3, 4]. Coreferring mentions span proper names, definite descriptions, and pronouns: "Sophie Wilson", "the designer", and "She" form one cluster. Multiple backends: rule-based sieves (`SimpleCorefResolver`), neural (`FCoref`, 78.5 F1 on CoNLL-2012), and mention-ranking (`MentionRankingCoref`).
**Relation extraction.** Extract typed `(head, relation, tail)` triples from text. Available on `RelationCapable` backends (`gliner2`, `tplinker`); others fall back to co-occurrence edges for graph export.
**Structured pattern extraction.** Dates, monetary amounts, email addresses, URLs, phone numbers via deterministic regex grammars.
## Backends
All backends produce the same output type: variable-length spans with character offsets.
| `stacked` (default) | Selector/fallback | GLiNER > BERT > heuristic+pattern | -- | -- | HuggingFace (when ML enabled) | -- |
| `gliner` | Bi-encoder span classifier | Custom | Yes | -- | [gliner_small-v2.1](https://huggingface.co/onnx-community/gliner_small-v2.1) | Zaratiana et al. [5] |
| `gliner2` | Multi-task span classifier | Custom | Yes | Heuristic | [gliner-multitask-large-v0.5](https://huggingface.co/onnx-community/gliner-multitask-large-v0.5) | [11] |
| `nuner` | Token classifier (BIO) | Custom | Yes | -- | [NuNerZero_onnx](https://huggingface.co/deepanwa/NuNerZero_onnx) | Bogdanov et al. [6] |
| `bert-onnx` | Sequence labeling (BERT) | PER/ORG/LOC/MISC | No | -- | [bert-base-NER-onnx](https://huggingface.co/protectai/bert-base-NER-onnx) | Devlin et al. [8] |
| `pattern` | Regex grammars | DATE/MONEY/EMAIL/URL/PHONE/PERCENT | N/A | -- | None | -- |
| `tplinker` | Joint entity-relation (heuristic) | Custom | -- | Heuristic | None | [10] |
ML backends are feature-gated (`onnx` or `candle`). Weights download from HuggingFace on first use. See [BACKENDS.md](docs/BACKENDS.md) for the full list (including experimental backends) and feature-flag details.
## CLI
```sh
cargo install --git https://github.com/arclabs561/anno --package anno-cli --bin anno --features "onnx eval"
```
### Examples
```sh
anno extract --text "Lynn Conway worked at IBM and Xerox PARC in California."
```
```text
PER:1 "Lynn Conway"
ORG:2 "IBM" "Xerox PARC"
LOC:1 "California"
```
JSON output (`--format json`):
```sh
anno extract --model pattern --format json --text "Contact jobs@acme.com by March 15 for the \$50K role."
```
Zero-shot custom entity types (via GLiNER [5]):
```sh
anno extract --model gliner --extract-types "DRUG,SYMPTOM" \
--text "Aspirin can treat headaches and reduce fever."
```
```text
drug:1 "Aspirin" symptom:2 "headaches" "fever"
```
Coreference:
```sh
anno debug --coref -t "Sophie Wilson designed the ARM processor. She revolutionized mobile computing."
```
```text
Coreference: "Sophie Wilson" → "She"
```
## Coreference
Three coref backends with different tradeoffs:
| `SimpleCorefResolver` | Rule-based (6 sieves) | Low | Fast | None |
| `FCoref` | Neural (DistilRoBERTa) | 78.5 F1 | Medium | ONNX export |
| `MentionRankingCoref` | Mention-ranking | Medium | Medium | None |
**FCoref** (Otmazgin et al., AACL 2022) requires a one-time model export. The export script is in the repo at `scripts/export_fcoref.py` (not shipped in the crates.io package -- clone the repo to run it):
```sh
# From a repo clone:
uv run scripts/export_fcoref.py
```
```rust
let coref = FCoref::from_path("fcoref_onnx")?;
let clusters = coref.resolve("Marie Curie was born in Warsaw. She discovered radium.")?;
// clusters[0] = { mentions: ["Marie Curie", "She"], canonical: "Marie Curie" }
```
**RAG preprocessing** (`rag::resolve_for_rag()`, requires `analysis` feature): rewrites pronouns with their antecedents so document chunks remain self-contained after splitting. Supports English, French, Spanish, German. Includes pleonastic "it" filter and overlap guard.
## Downstream
Filter and pipe JSON output:
```sh
# All person entities, text only
anno extract --format json --file article.txt \
| jq '[.[] | select(.entity_type == "PER") | .text]'
# Unique organizations, sorted
anno extract --format json --file article.txt \
| jq '[.[] | select(.entity_type == "ORG") | .text] | unique | sort[]'
```
Batch a directory (parallel, cached):
```sh
anno batch --dir docs/ --parallel 4 --cache --output results/
# Stream stdin JSONL: {"id":"…","text":"…"} per line
### Knowledge Graph (RDF)
> The following export commands require `--features graph` at install time (add `graph` to the features list in the `cargo install` command above).
`anno export` emits standard **N-Triples** or **JSON-LD** -- loadable into any RDF store (Oxigraph, Jena, Blazegraph, etc.). `--base-uri` sets the IRI namespace:
```sh
# Own namespace (recommended for private corpora)
anno export --input docs/ --output /tmp/kg/ --format ntriples \
--base-uri https://myproject.example.com/kg/
# Align to DBpedia (widely used LOD namespace)
anno export --input docs/ --output /tmp/kg/ --format ntriples \
--base-uri https://dbpedia.org/resource/
```
Default (`urn:anno:`) produces stable URNs suitable for local use without a registered domain. The output is standard W3C N-Triples, so any SPARQL-capable store can query it:
```sh
# Example: load into a SPARQL store and query
cat /tmp/kg/*.nt | your-rdf-store load --format ntriples
your-rdf-store query 'SELECT ?label WHERE {
?e a <https://myproject.example.com/kg/vocab#PERType> ;
<http://www.w3.org/2000/01/rdf-schema#label> ?label .
}'
```
**Semantic relation triples** — `RelationCapable` backends (`tplinker`, `gliner2`) produce typed `(head, relation, tail)` triples instead of co-occurrence edges. Use `--format graph-ntriples` (`--features graph`) for routing through the internal graph substrate:
```sh
anno export --input docs/ --output /tmp/kg/ \
--format graph-ntriples --model gliner2 \
--base-uri https://myproject.example.com/kg/
# → <entity/person/0_Lynn_Conway_0> <rel/works_for> <entity/org/1_IBM_13> .
```
### Property Graph CSV
Export node and edge tables as CSV for import into any property-graph database:
```sh
# Semantic edges (gliner2 or tplinker)
anno export --input docs/ --output /tmp/kg/ --format kuzu --model gliner2
# Co-occurrence edges (any other backend)
anno export --input docs/ --output /tmp/kg/ --format kuzu
```
Each file produces `{stem}-nodes.csv` + `{stem}-edges.csv` with columns:
- **Nodes**: `id`, `entity_type`, `text`, `start`, `end`, `source`
- **Edges**: `from`, `to`, `rel_type`, `confidence`
## Architecture
| `anno` (root) | Yes | Backends, `Model` trait, extraction pipeline, coref resolvers. |
| `anno-core` | No (yanked) | Data model (`Entity`, `Relation`, `Mention`, `CorefChain`, `Signal`, `Track`). Re-exported by `anno`. |
| `anno-graph` | No | Graph/KG export adapters (N-Triples, JSON-LD, CSV) |
| `anno-eval` | No | Evaluation harnesses, dataset loaders, matrix sampling |
| `anno-cli` | No | CLI binary |
| `anno-metrics` | No | Shared evaluation primitives (CoNLL F1, encoders) |
`anno-core` exists so that crates needing only data types (`Entity`, `Relation`, etc.) don't pull in ML dependencies. `anno-lib` depends on `anno-core` and re-exports all its types at the crate root.
Pipeline: Text -> Extract -> Coalesce -> structured output. See [ARCHITECTURE.md](docs/ARCHITECTURE.md).
## Evaluation
`anno-eval` provides dataset loading, backend-vs-dataset compatibility gating, and CoNLL-style [2] span-level evaluation (precision, recall, F1) with label mapping between backend and dataset taxonomies.
```sh
anno benchmark --model gliner --dataset conll2003
```
`anno sampler` (`--features eval`) provides a randomized matrix sampler with two modes: **triage** (worst-first) and **measure** (ML-only stable measurement).
### Sampler examples
```sh
# Deterministic offline decision-loop smoke example (also run in CI).
cargo run -p anno-eval --example muxer_decision_loop --features eval
# Run the CI sampler harness locally (uses ~/.anno_cache for history/cache).
just ci-matrix-local 42 ner
```
## Scope
Inference-time extraction. Training pipelines are out of scope -- use upstream frameworks (Hugging Face Transformers, Flair, etc.) and export ONNX weights for consumption. (The repo contains some experimental training code for box embeddings; this is research-only and not part of the public API.)
## Troubleshooting
**ONNX Runtime linking errors.** The `onnx` feature pulls in `ort`, which needs the ONNX Runtime C++ library. If linking fails, use `default-features = false` for a minimal build without native deps. If you need ONNX backends, ensure `ort` can download the prebuilt runtime -- check proxy/firewall settings and the `ORT_DYLIB_PATH` env var.
**Model download failures.** First use of ML backends downloads weights from HuggingFace to `~/.cache/huggingface/hub/`. Behind firewalls, pre-fetch models on a connected machine, then set `HF_HUB_OFFLINE=1` (or `ANNO_NO_DOWNLOADS=1`) to force cached-only mode.
**FCoref export script not found.** The export script (`scripts/export_fcoref.py`) is only present in the git repository, not in the crates.io package. Clone the repo to access it: `uv run scripts/export_fcoref.py`.
**"Feature not available" errors.** Most ML backends are feature-gated behind `onnx` or `candle`. If a backend constructor returns an error about a missing feature, check which features are enabled. The feature flags table above lists what each feature unlocks.
**Character offset mismatches.** All spans use character offsets (Unicode scalar values), not byte offsets. If offsets look wrong, ensure you are indexing by `chars()` count, not by byte position. See [CONTRACT.md](docs/CONTRACT.md).
## Documentation
- [QUICKSTART](docs/QUICKSTART.md)
- [CONTRACT](docs/CONTRACT.md) — offset semantics, scope, feature gating
- [BACKENDS](docs/BACKENDS.md) — backend selection, architecture, feature flags
- [ARCHITECTURE](docs/ARCHITECTURE.md) — crate layout, dependency flow
- [REFERENCES](docs/REFERENCES.md) — full bibliography (NER, coref, relation extraction, software)
- [API docs](https://docs.rs/anno)
- [Changelog](CHANGELOG.md)
## References
[1] Grishman & Sundheim, *COLING* 1996 (MUC-6 NER).
[2] Tjong Kim Sang & De Meulder, *CoNLL* 2003.
[3] Lee et al., *EMNLP* 2017 (end-to-end coref).
[4] Jurafsky & Martin, *SLP3* 2024.
[5] Zaratiana et al., *NAACL* 2024 (GLiNER).
[6] Bogdanov et al., 2024 (NuNER).
[7] Li et al., *AAAI* 2022 (W2NER).
[8] Devlin et al., *NAACL* 2019 (BERT).
[9] Lafferty et al., *ICML* 2001 (CRF).
[10] Wang et al., *COLING* 2020 (TPLinker).
[11] Zaratiana et al., 2025 (GLiNER2).
[12] Rabiner, *Proc. IEEE* 1989 (HMM).
Full list with links: [docs/REFERENCES.md](docs/REFERENCES.md). Citeable via [CITATION.cff](CITATION.cff).
## License
Dual-licensed under MIT or Apache-2.0.