anno-lib 0.3.0

Information extraction for Rust: Named Entity Recognition (NER), coreference resolution, and structured extraction
Documentation
anno-lib-0.3.0 has been yanked.

anno

Information extraction: named entity recognition and coreference.

Dual-licensed under MIT or Apache-2.0.

API docs (may lag behind main while crates.io publishing is paused): docs.rs/anno

Install

This repo is not using crates.io as its primary distribution path right now; see docs/PUBLISH_STATUS.md.

Full CLI (crates/anno-cli)

This is the recommended anno binary (many commands: extract, debug, benchmark, models, etc.):

# From this repo:
cargo install --path crates/anno-cli --bin anno --features "onnx eval"

Or without cloning manually:

cargo install --git https://github.com/arclabs561/anno --package anno-cli --bin anno --features "onnx eval"

Minimal facade CLI (package anno)

This binary is intentionally small and currently supports anno extract only:

# From this repo:
cargo install --path . --bin anno

Optional (enable ONNX-backed ML selection for stacked at build time):

cargo install --path . --bin anno --features onnx

What it does

  • Named entity recognition (NER): people, orgs, locations, etc.
  • Coreference: resolve mentions like "Sophie Wilson" → "She".
  • Structured pattern extraction (dates, money, emails).

Backends

Backend Label surface Structure Weights Notes
stacked (default) Mixed (best available) Flat spans HuggingFace (when ML enabled) Selector/fallback: chooses an ML backend when available, otherwise regex+heuristic
gliner Custom (zero-shot) Flat spans onnx-community/gliner_small-v2.1 Span classifier; --extract-types
gliner2 Custom (zero-shot) Flat spans onnx-community/gliner-multitask-large-v0.5 Multi-task (NER + classification)
nuner Custom (zero-shot) Flat spans deepanwa/NuNerZero_onnx Token classifier (BIO), arbitrary-length entities
w2ner Fixed (trained labels) Nested/discont. ljynlp/w2ner-bert-base Word-word grids; supports nested spans
bert-onnx Fixed (PER/ORG/LOC/MISC) Flat spans protectai/bert-base-NER-onnx Classic CoNLL-style NER
pattern Fixed (patterns) Flat spans None Regex for dates, emails, money
heuristic Fixed (heuristics) Flat spans None Capitalization + context baseline
crf Fixed (trained labels) Flat spans Bundled (bundled-crf-weights) Classical baseline; can load custom weights
hmm Fixed (trained labels) Flat spans Bundled (bundled-hmm-params) Classical baseline/education
ensemble Mixed Flat spans Varies Parallel combiner: weighted voting across backends

Notes:

  • All NER backends return variable-length spans (start/end offsets). Some are token-labeling models internally.
  • Offsets are character offsets (Unicode scalar values), not byte offsets; see docs/CONTRACT.md.
  • ML backends are feature-gated behind onnx or candle. The published anno crate keeps defaults minimal; enable onnx explicitly when you want ML backends.
  • ML model weights download from HuggingFace on first use (see “Offline / downloads” below).
  • The table is the NER backend surface; for a fuller capability/provenance discussion see docs/BACKENDS.md.
  • Evaluation tooling adds additional notions (dataset/backend compatibility gates, label-shift “true zero-shot” accounting); those live in anno-eval, not in the runtime Model trait surface.

Offline / downloads

  • The facade CLI (package anno) does not enable ML backends by default.
  • If you enable ML backends (--features onnx / candle), weights may download on first use.
  • Force cached-only / offline behavior:
    • ANNO_NO_DOWNLOADS=1 (preferred), or
    • HF_HUB_OFFLINE=1

Examples

Note: Most examples below assume the full CLI (package anno-cli). If you installed the minimal facade CLI, only anno extract is available.

Named entities (human output is compact and may vary by backend/build):

anno extract --text "Lynn Conway worked at IBM and Xerox PARC in California."
PER:1 "Lynn Conway"
ORG:2 "IBM" "Xerox PARC"
LOC:1 "California"

Machine-readable output (schema-stable; values vary). This example uses pattern to be offline and reproducible; other models use the same JSON shape:

anno extract --model pattern --format json --text "Contact jobs@acme.com by March 15 for the \$50K role."
[
  {
    "text": "jobs@acme.com",
    "entity_type": "EMAIL",
    "start": 8,
    "end": 21,
    "confidence": 0.98
  },
  {
    "text": "March 15",
    "entity_type": "DATE",
    "start": 25,
    "end": 33,
    "confidence": 0.95
  },
  {
    "text": "$50K",
    "entity_type": "MONEY",
    "start": 42,
    "end": 46,
    "confidence": 0.95
  }
]

Structured entities (dates, money, emails):

anno extract --model pattern --text "Contact jobs@acme.com by March 15 for the \$50K role."
EMAIL:1 "jobs@acme.com" DATE:1 "March 15" MONEY:1 "$50K"

Zero-shot extraction (full CLI; define your own entity types):

anno extract --model gliner --extract-types "DRUG,SYMPTOM" \
  --text "Aspirin can treat headaches and reduce fever."
drug:1 "Aspirin" symptom:2 "headaches" "fever"

Coreference resolution (full CLI):

anno debug --coref -t "Sophie Wilson designed the ARM processor. She revolutionized mobile computing."
Coreference: "Sophie Wilson" → "She"

Library (Rust)

Add the library:

[dependencies]
# Git (recommended; matches this repo while publishing is paused):
anno = { git = "https://github.com/arclabs561/anno", rev = "<commit>" }

# crates.io (may lag behind `main`):
# anno = "0.2"
use anno::{Model, StackedNER};

let m = StackedNER::default();
let ents = m.extract_entities("Sophie Wilson designed the ARM processor.", None)?;
assert!(!ents.is_empty());
# Ok::<(), anno::Error>(())

More examples: docs/QUICKSTART.md.

Advanced: sampler (muxer)

If you build with eval, anno sampler (alias anno muxer) exposes the same randomized matrix sampler used in CI, with two modes:

  • triage: quick regression-hunting defaults (worst-first)
  • measure: stable measurement defaults (ml-only)

This command is hidden from the top-level help; use anno help sampler.

Docs

  • docs/QUICKSTART.md
  • docs/CONTRACT.md — interface contract
  • docs/BACKENDS.md — backend selection