anno
Information extraction: named entity recognition and coreference.
Dual-licensed under MIT or Apache-2.0.
API docs (may lag behind main while crates.io publishing is paused): docs.rs/anno
Install
This repo is not using crates.io as its primary distribution path right now; see docs/PUBLISH_STATUS.md.
Full CLI (crates/anno-cli)
This is the recommended anno binary (many commands: extract, debug, benchmark, models, etc.):
# From this repo:
Or without cloning manually:
Minimal facade CLI (package anno)
This binary is intentionally small and currently supports anno extract only:
# From this repo:
Optional (enable ONNX-backed ML selection for stacked at build time):
What it does
- Named entity recognition (NER): people, orgs, locations, etc.
- Coreference: resolve mentions like "Sophie Wilson" → "She".
- Structured pattern extraction (dates, money, emails).
Backends
| Backend | Label surface | Structure | Weights | Notes |
|---|---|---|---|---|
stacked (default) |
Mixed (best available) | Flat spans | HuggingFace (when ML enabled) | Selector/fallback: chooses an ML backend when available, otherwise regex+heuristic |
gliner |
Custom (zero-shot) | Flat spans | onnx-community/gliner_small-v2.1 | Span classifier; --extract-types |
gliner2 |
Custom (zero-shot) | Flat spans | onnx-community/gliner-multitask-large-v0.5 | Multi-task (NER + classification) |
nuner |
Custom (zero-shot) | Flat spans | deepanwa/NuNerZero_onnx | Token classifier (BIO), arbitrary-length entities |
w2ner |
Fixed (trained labels) | Nested/discont. | ljynlp/w2ner-bert-base | Word-word grids; supports nested spans |
bert-onnx |
Fixed (PER/ORG/LOC/MISC) | Flat spans | protectai/bert-base-NER-onnx | Classic CoNLL-style NER |
pattern |
Fixed (patterns) | Flat spans | None | Regex for dates, emails, money |
heuristic |
Fixed (heuristics) | Flat spans | None | Capitalization + context baseline |
crf |
Fixed (trained labels) | Flat spans | Bundled (bundled-crf-weights) |
Classical baseline; can load custom weights |
hmm |
Fixed (trained labels) | Flat spans | Bundled (bundled-hmm-params) |
Classical baseline/education |
ensemble |
Mixed | Flat spans | Varies | Parallel combiner: weighted voting across backends |
Notes:
- All NER backends return variable-length spans (start/end offsets). Some are token-labeling models internally.
- Offsets are character offsets (Unicode scalar values), not byte offsets; see
docs/CONTRACT.md. - ML backends are feature-gated behind
onnxorcandle. The publishedannocrate keeps defaults minimal; enableonnxexplicitly when you want ML backends. - ML model weights download from HuggingFace on first use (see “Offline / downloads” below).
- The table is the NER backend surface; for a fuller capability/provenance discussion see
docs/BACKENDS.md. - Evaluation tooling adds additional notions (dataset/backend compatibility gates, label-shift “true zero-shot” accounting); those live in
anno-eval, not in the runtimeModeltrait surface.
Offline / downloads
- The facade CLI (package
anno) does not enable ML backends by default. - If you enable ML backends (
--features onnx/candle), weights may download on first use. - Force cached-only / offline behavior:
ANNO_NO_DOWNLOADS=1(preferred), orHF_HUB_OFFLINE=1
Examples
Note: Most examples below assume the full CLI (package anno-cli). If you installed the
minimal facade CLI, only anno extract is available.
Named entities (human output is compact and may vary by backend/build):
PER:1 "Lynn Conway"
ORG:2 "IBM" "Xerox PARC"
LOC:1 "California"
Machine-readable output (schema-stable; values vary). This example uses pattern to be offline and reproducible; other models use the same JSON shape:
Structured entities (dates, money, emails):
EMAIL:1 "jobs@acme.com" DATE:1 "March 15" MONEY:1 "$50K"
Zero-shot extraction (full CLI; define your own entity types):
drug:1 "Aspirin" symptom:2 "headaches" "fever"
Coreference resolution (full CLI):
Coreference: "Sophie Wilson" → "She"
Library (Rust)
Add the library:
[]
# Git (recommended; matches this repo while publishing is paused):
= { = "https://github.com/arclabs561/anno", = "<commit>" }
# crates.io (may lag behind `main`):
# anno = "0.2"
use ;
let m = default;
let ents = m.extract_entities?;
assert!;
# Ok::
More examples: docs/QUICKSTART.md.
Advanced: sampler (muxer)
If you build with eval, anno sampler (alias anno muxer) exposes the same randomized
matrix sampler used in CI, with two modes:
- triage: quick regression-hunting defaults (worst-first)
- measure: stable measurement defaults (ml-only)
This command is hidden from the top-level help; use anno help sampler.
Docs
docs/QUICKSTART.mddocs/CONTRACT.md— interface contractdocs/BACKENDS.md— backend selection