Skip to main content

anno/backends/crf/
mod.rs

1//! Conditional Random Field (CRF) NER backend.
2//!
3//! Implements classical statistical NER using CRF sequence labeling.
4//! This provides a lightweight, interpretable baseline that requires no
5//! external dependencies or GPU acceleration.
6//!
7//! # History
8//!
9//! CRF-based NER was a common baseline throughout the 2000s (pre-neural sequence labeling):
10//! - Lafferty et al. 2001: Introduced CRFs for sequence labeling (ICML)
11//! - McCallum & Li 2003: Applied CRFs to NER
12//! - Stanford NER (2003-2014): CRF-based, still widely used
13//! - State-of-art until neural methods (BiLSTM-CRF, 2015+)
14//!
15//! # Why CRF Beat Previous Methods
16//!
17//! CRFs solved the **label bias problem** that plagued MEMMs (Maximum Entropy
18//! Markov Models, McCallum et al. 2000):
19//!
20//! ```text
21//! Label Bias: In MEMMs, states with few successors effectively ignore
22//! observations. Transition scores are conditional on current state,
23//! so low-entropy states "absorb" probability mass regardless of input.
24//!
25//! HMM:   Generative model     P(x,y) = P(y) × P(x|y)
26//! MEMM:  Local discriminative P(y_t|y_{t-1}, x)  ← label bias here
27//! CRF:   Global discriminative P(y|x) = (1/Z) exp(∑ features × weights)
28//!                                       ↑ normalizes over entire sequence
29//! ```
30//!
31//! CRF models the conditional probability of the entire label sequence given
32//! the observation sequence, using global normalization:
33//!
34//! ```text
35//! P(y|x) = (1/Z(x)) × exp( ∑_t ∑_k λ_k × f_k(y_t, y_{t-1}, x, t) )
36//!
37//! where:
38//!   - Z(x) is the partition function (normalizer)
39//!   - f_k are feature functions
40//!   - λ_k are learned weights
41//! ```
42//!
43//! # References
44//!
45//! - Lafferty, McCallum, Pereira (2001): "Conditional Random Fields:
46//!   Probabilistic Models for Segmenting and Labeling Sequence Data" (ICML)
47//! - McCallum & Li (2003): "Early Results for Named Entity Recognition with
48//!   Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons"
49//! - Finkel, Grenager, Manning (2005): "Incorporating Non-local Information
50//!   into Information Extraction Systems by Gibbs Sampling" (ACL)
51//!
52//! # See Also
53//!
54//! - Historical NER baselines (HMM/CRF-era sequence models)
55//!
56//! # Feature Templates
57//!
58//! The CRF uses the following feature templates (matching `train_crf_weights.py`):
59//!
60//! ```text
61//! - bias           : Always-on feature for label-specific bias
62//! - word.lower     : Lowercased current word
63//! - word.shape     : Word shape pattern (Xx, X, x, 0, etc.)
64//! - word.isdigit   : Whether word is all digits
65//! - word.istitle   : Whether word is titlecased
66//! - word.isupper   : Whether word is all uppercase
67//! - prefix{2,3}    : First 2-3 characters
68//! - suffix{2,3}    : Last 2-3 characters
69//! - -1:word.*      : Previous word features
70//! - +1:word.*      : Next word features
71//! - BOS/EOS        : Sentence boundary markers
72//! ```
73//!
74//! # Performance
75//!
76//! Performance depends on weights, tokenization, and dataset; use the eval harness
77//! for quantitative results.
78//! | Heuristic | Lower | Hand-tuned, always available |
79//! | Trained | Higher | From `train_crf_weights.py` |
80//! | Neural | Highest | For comparison (GLiNER, BERT) |
81//!
82//! # Usage
83//!
84//! ```rust
85//! use anno::CrfNER;
86//! use anno::Model;
87//!
88//! // Use with default heuristic weights
89//! let ner = CrfNER::new();
90//! let entities = ner.extract_entities("John Smith works at Google", None)?;
91//!
92//! // Or load trained weights for better accuracy
93//! // let ner = CrfNER::with_weights("crf_weights.json")?;
94//! # Ok::<(), anno::Error>(())
95//! ```
96//!
97//! # Training Weights
98//!
99//! To train weights on CoNLL-2003:
100//!
101//! ```bash
102//! uv run scripts/train_crf_weights.py
103//! ```
104//!
105//! This produces `crf_weights.json` which can be loaded with `CrfNER::with_weights()`.
106//!
107//! Nuance: CoNLL-2003’s English text is derived from Reuters/RCV1 and is commonly treated as
108//! redistribution-restricted. The CoNLL site notes that, “because of copyright reasons we only
109//! make available the annotations” and that you need separate access to the Reuters corpus to
110//! build the full dataset: `http://www.clips.uantwerpen.be/conll2003/ner/`.
111//!
112//! Practical consequence: `anno` includes a training script, but it does not ship a CoNLL-trained
113//! `crf_weights.json` out of the box.
114//!
115//! # Advantages Over Neural Methods
116//!
117//! - **Interpretable**: Features and weights are human-readable
118//! - **Fast training**: CPU-only training; typically faster to iterate than neural training loops
119//! - **No dependencies**: Pure Rust, no ONNX/Candle required
120//! - **Deterministic**: Same input always produces same output
121//! - **Small footprint**: Small weights file compared to ML model artifacts
122
123use crate::{Entity, EntityType, Model, Result};
124use std::collections::HashMap;
125#[cfg(feature = "bundled-crf-weights")]
126use std::sync::OnceLock;
127
128/// CRF-based NER model.
129///
130/// Uses hand-crafted features and sequence labeling for named entity recognition.
131/// This is a pure-Rust implementation that doesn't require external libraries.
132pub struct CrfNER {
133    /// Feature weights learned during training (or loaded from file)
134    weights: HashMap<String, f64>,
135    /// Entity type gazetteer lists
136    gazetteers: HashMap<EntityType, Vec<String>>,
137    /// Label set (BIO tagging)
138    labels: Vec<String>,
139    /// Feature templates
140    templates: Vec<FeatureTemplate>,
141}
142
143/// Feature template for CRF
144#[derive(Debug, Clone)]
145pub enum FeatureTemplate {
146    /// Current word
147    Word,
148    /// Word at offset
149    WordAt(i32),
150    /// Word shape (Xx, XX, x, 0)
151    Shape,
152    /// Shape at offset
153    ShapeAt(i32),
154    /// Prefix of length n
155    Prefix(usize),
156    /// Suffix of length n
157    Suffix(usize),
158    /// Is in gazetteer for entity type
159    InGazetteer(EntityType),
160    /// Previous label
161    PrevLabel,
162    /// Bigram: current + previous label
163    LabelBigram,
164    /// Word + Label combination
165    WordLabel,
166}
167
168impl Default for CrfNER {
169    fn default() -> Self {
170        Self::new()
171    }
172}
173
174mod algorithm;
175// CRF algorithm: feature extraction, Viterbi decoding, weight loading (see algorithm.rs).
176impl Model for CrfNER {
177    fn extract_entities(&self, text: &str, _language: Option<&str>) -> Result<Vec<Entity>> {
178        if text.trim().is_empty() {
179            return Ok(vec![]);
180        }
181
182        let tokens = Self::tokenize(text);
183        if tokens.is_empty() {
184            return Ok(vec![]);
185        }
186
187        let labels = self.viterbi_decode(&tokens);
188        let entities = self.labels_to_entities(text, &tokens, &labels);
189
190        Ok(entities)
191    }
192
193    fn supported_types(&self) -> Vec<EntityType> {
194        vec![
195            EntityType::Person,
196            EntityType::Organization,
197            EntityType::Location,
198            EntityType::Other("MISC".to_string()),
199        ]
200    }
201
202    fn is_available(&self) -> bool {
203        true // Always available (no external dependencies)
204    }
205
206    fn name(&self) -> &'static str {
207        "crf"
208    }
209
210    fn description(&self) -> &'static str {
211        "CRF-based NER (classical statistical method)"
212    }
213
214    fn capabilities(&self) -> crate::ModelCapabilities {
215        crate::ModelCapabilities {
216            batch_capable: true,
217            optimal_batch_size: Some(32),
218            streaming_capable: true,
219            ..Default::default()
220        }
221    }
222}
223
224impl crate::NamedEntityCapable for CrfNER {}
225
226impl crate::BatchCapable for CrfNER {
227    fn optimal_batch_size(&self) -> Option<usize> {
228        Some(32) // CRF is fast, can handle batches
229    }
230}
231
232impl crate::StreamingCapable for CrfNER {
233    fn recommended_chunk_size(&self) -> usize {
234        4096 // Smaller chunks since CRF is token-based
235    }
236}
237
238#[cfg(test)]
239mod tests;