anno-lib 0.3.4

Information extraction for Rust: Named Entity Recognition (NER), coreference resolution, and structured extraction
docs.rs failed to build anno-lib-0.3.4
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build: anno-lib-0.3.1

anno

crates.io Documentation CI

Information extraction for unstructured text: named entity recognition (NER), within-document coreference resolution, and structured pattern extraction.

Dual-licensed under MIT or Apache-2.0.

Tasks

Named entity recognition. Given input text, identify spans (start, end, type, confidence) where each span denotes a named entity [1, 2]. Entity types follow standard taxonomies (PER, ORG, LOC, MISC for CoNLL-style [2]) or caller-defined labels for zero-shot extraction. Offsets are character offsets (Unicode scalar values), not byte offsets; see CONTRACT.md.

Coreference resolution. Identify mention spans and group them into clusters, where each cluster tracks a consistent discourse referent within the document [3, 4]. Referents may be concrete or abstract, singular or plural, real-world or fictional — the constraint is within-document consistency, not ontological uniqueness. Coreferring mentions span proper names, definite descriptions, and pronouns: "Sophie Wilson", "the designer", and "She" form one cluster.

Relation extraction. Extract typed (head, relation, tail) triples from text. Available on RelationCapable backends; others fall back to co-occurrence edges for graph export.

Structured pattern extraction. Dates, monetary amounts, email addresses, URLs, phone numbers via deterministic regex grammars.

Backends

All backends produce the same output type: variable-length spans with character offsets.

Backend Architecture Labels Zero-shot Relations Weights Reference
stacked (default) Selector/fallback Best available HuggingFace (when ML enabled) --
gliner Bi-encoder span classifier Custom Yes gliner_small-v2.1 Zaratiana et al. [5]
gliner2 Multi-task span classifier Custom Yes Heuristic gliner-multitask-large-v0.5 [11]
nuner Token classifier (BIO) Custom Yes NuNerZero_onnx Bogdanov et al. [6]
w2ner Word-word relation grids Trained (nested) No w2ner-bert-base Li et al. [7]
bert-onnx Sequence labeling (BERT) PER/ORG/LOC/MISC No bert-base-NER-onnx Devlin et al. [8]
tplinker Joint entity-relation (heuristic) Custom Heuristic None [10]
crf Conditional Random Field Trained No Bundled (bundled-crf-weights) Lafferty et al. [9]
hmm Hidden Markov Model Trained No Bundled (bundled-hmm-params) Rabiner [12]
pattern Regex grammars DATE/MONEY/EMAIL/URL/PHONE N/A None --
heuristic Capitalization + context PER/ORG/LOC N/A None --
ensemble Weighted voting combiner Mixed Varies Varies --

ML backends are feature-gated (onnx or candle). Weights download from HuggingFace on first use. All backends expose model.capabilities() for runtime discovery. See BACKENDS.md for selection guidance and feature-flag details.

Install

cargo install --git https://github.com/arclabs561/anno --package anno-cli --bin anno --features "onnx eval"

From a local clone:

cargo install --path crates/anno-cli --bin anno --features "onnx eval"

ANNO_NO_DOWNLOADS=1 or HF_HUB_OFFLINE=1 forces cached-only behavior.

Examples

anno extract --text "Lynn Conway worked at IBM and Xerox PARC in California."
PER:1 "Lynn Conway"
ORG:2 "IBM" "Xerox PARC"
LOC:1 "California"

JSON output (schema-stable; uses pattern for offline reproducibility):

anno extract --model pattern --format json --text "Contact jobs@acme.com by March 15 for the \$50K role."
[
  {"text": "jobs@acme.com", "entity_type": "EMAIL", "start": 8, "end": 21, "confidence": 0.98},
  {"text": "March 15", "entity_type": "DATE", "start": 25, "end": 33, "confidence": 0.95},
  {"text": "$50K", "entity_type": "MONEY", "start": 42, "end": 46, "confidence": 0.95}
]

Zero-shot custom entity types (via GLiNER [5]):

anno extract --model gliner --extract-types "DRUG,SYMPTOM" \
  --text "Aspirin can treat headaches and reduce fever."
drug:1 "Aspirin" symptom:2 "headaches" "fever"

Coreference:

anno debug --coref -t "Sophie Wilson designed the ARM processor. She revolutionized mobile computing."
Coreference: "Sophie Wilson" → "She"

Downstream

Filter and pipe JSON output:

# All person entities, text only
anno extract --format json --file article.txt \
  | jq '[.[] | select(.entity_type == "PER") | .text]'

# Unique organizations, sorted
anno extract --format json --file article.txt \
  | jq '[.[] | select(.entity_type == "ORG") | .text] | unique | sort[]'

Batch a directory (parallel, cached):

anno batch --dir docs/ --parallel 4 --cache --output results/

# Stream stdin JSONL: {"id":"…","text":"…"} per line
cat corpus.jsonl | anno batch --stdin --parallel 4 --cache --output results/

Knowledge Graph (RDF / Oxigraph)

anno export emits N-Triples or JSON-LD. --base-uri sets the IRI namespace — use a URI you own or one that aligns to a known linked-data vocabulary:

# Own namespace (recommended for private corpora)
anno export --input docs/ --output /tmp/kg/ --format ntriples \
  --base-uri https://myproject.example.com/kg/

# Align to DBpedia (widely used LOD namespace)
anno export --input docs/ --output /tmp/kg/ --format ntriples \
  --base-uri https://dbpedia.org/resource/

Default (urn:anno:) produces stable URNs suitable for local use without a registered domain.

Load into Oxigraph (pure-Rust RDF store, SPARQL 1.1):

cat /tmp/kg/*.nt | oxigraph load --location /tmp/anno-store --format ntriples

oxigraph query --location /tmp/anno-store \
  --query 'SELECT ?label WHERE {
    ?e a <https://myproject.example.com/kg/vocab#PERType> ;
       <http://www.w3.org/2000/01/rdf-schema#label> ?label .
  }'

Semantic relation triplesRelationCapable backends (tplinker, gliner2) produce typed (head, relation, tail) triples instead of co-occurrence edges. Use --format graph-ntriples (--features graph) for routing through the internal graph substrate:

anno export --input docs/ --output /tmp/kg/ \
  --format graph-ntriples --model gliner2 \
  --base-uri https://myproject.example.com/kg/
# → <entity/person/0_Lynn_Conway_0> <rel/works_for> <entity/org/1_IBM_13> .

Property Graph (Kuzu)

Export node and edge tables for Kuzu:

# Semantic edges (gliner2 or tplinker)
anno export --input docs/ --output /tmp/kg/ --format kuzu --model gliner2

# Co-occurrence edges (any other backend)
anno export --input docs/ --output /tmp/kg/ --format kuzu

Each file produces {stem}-nodes.csv + {stem}-edges.csv. Schema:

CREATE NODE TABLE Entity(
  id STRING, entity_type STRING, text STRING,
  start INT64, end INT64, source STRING,
  PRIMARY KEY(id)
);
CREATE REL TABLE Relation(FROM Entity TO Entity, rel_type STRING, confidence DOUBLE);

COPY Entity   FROM '/tmp/kg/doc-nodes.csv' (HEADER=TRUE);
COPY Relation FROM '/tmp/kg/doc-edges.csv' (HEADER=TRUE);

Library

[dependencies]
anno-lib = "0.3"

The crate is published as anno-lib on crates.io; the Rust import name is anno.

Feature flags

Feature Default Description
onnx Yes ONNX Runtime backends (GLiNER, NuNER, BERT, W2NER) via ort
candle No Pure-Rust Candle backends (no C++ runtime needed)
eval No Evaluation harnesses, dataset loaders, and matrix sampling
graph No Knowledge-graph export adapters (N-Triples, JSON-LD, Kuzu)
schema No JSON Schema generation for output types via schemars

The onnx feature (default) pulls in ort (ONNX Runtime bindings), which requires a C++ runtime. For minimal builds, use default-features = false.

Basic extraction:

use anno::{Model, StackedNER};

let m = StackedNER::default();
let ents = m.extract_entities("Sophie Wilson designed the ARM processor.", None)?;
assert!(!ents.is_empty());
# Ok::<(), anno::Error>(())

Zero-shot custom types via DynamicLabels (GLiNER, GLiNER2, NuNER):

use anno::{DynamicLabels, GLiNEROnnx};

let m = GLiNEROnnx::new("onnx-community/gliner_small-v2.1")?;
let ents = m.extract_with_labels(
    "Aspirin treats headaches.",
    &["drug", "symptom"],
    None,
)?;
# Ok::<(), anno::Error>(())

Runtime capability discovery — all backends now report accurately:

use anno::Model;

fn print_caps(m: &dyn Model) {
    let c = m.capabilities();
    println!(
        "batch={} streaming={} gpu={} relations={} zero-shot={}",
        c.batch_capable, c.streaming_capable, c.gpu_capable,
        c.relation_capable, c.dynamic_labels,
    );
}

Architecture

Crate Purpose
anno (root) Published facade; re-exports anno-lib
anno-lib Backends, Model / RelationCapable traits, extraction pipeline
anno-core Stable data model (Entity, Relation, Signal, Track, Identity, Corpus)
anno-graph Graph/KG export adapters: converts extraction output to lattix::KnowledgeGraph; owns triple construction and --format graph-ntriples
anno-eval Evaluation harnesses, dataset loaders, matrix sampling (uses muxer)
anno-cli Full CLI; graph feature routes through anno-graph
anno-metrics Shared evaluation primitives (CoNLL F1, encoders)

Pipeline: Text → Extract → Coalesce → structured output. See ARCHITECTURE.md.

Evaluation

anno-eval provides dataset loading, backend-vs-dataset compatibility gating, and CoNLL-style [2] span-level evaluation (precision, recall, F1) with label mapping between backend and dataset taxonomies.

anno benchmark --model gliner --dataset conll2003

anno sampler (--features eval) provides a randomized matrix sampler with two modes: triage (worst-first) and measure (ML-only stable measurement).

Sampler examples

# Deterministic offline decision-loop smoke example (also run in CI).
cargo run -p anno-eval --example muxer_decision_loop --features eval

# Run the CI sampler harness locally (uses ~/.anno_cache for history/cache).
just ci-matrix-local 42 ner

Scope

Inference-time extraction only. Training is out of scope — use upstream frameworks (Hugging Face Transformers, Flair, etc.) and export ONNX weights for consumption.

Documentation

References

Key citations; see docs/REFERENCES.md for the full list with links.

  1. Grishman & Sundheim, COLING 1996 (MUC-6 NER). [PDF]
  2. Tjong Kim Sang & De Meulder, CoNLL 2003 (CoNLL benchmark). [PDF]
  3. Lee et al., EMNLP 2017 (end-to-end coreference). [arXiv]
  4. Jurafsky & Martin, SLP3 2024 (coreference fundamentals). [Online]
  5. Zaratiana et al., NAACL 2024 (GLiNER). [arXiv]
  6. Bogdanov et al., 2024 (NuNER). [arXiv]
  7. Li et al., AAAI 2022 (W2NER). [arXiv]
  8. Devlin et al., NAACL 2019 (BERT). [arXiv]
  9. Lafferty et al., ICML 2001 (CRF).
  10. Wang et al., COLING 2020 (TPLinker). [arXiv]
  11. Zaratiana et al., 2025 (GLiNER2). [arXiv]
  12. Rabiner, Proceedings of the IEEE 1989 (HMM tutorial). [PDF]

Citeable via CITATION.cff.

License

Dual-licensed under MIT or Apache-2.0.