anno/backends/crf/mod.rs
1//! Conditional Random Field (CRF) NER backend.
2//!
3//! Implements classical statistical NER using CRF sequence labeling.
4//! This provides a lightweight, interpretable baseline that requires no
5//! external dependencies or GPU acceleration.
6//!
7//! # History
8//!
9//! CRF-based NER was a common baseline throughout the 2000s (pre-neural sequence labeling):
10//! - Lafferty et al. 2001: Introduced CRFs for sequence labeling (ICML)
11//! - McCallum & Li 2003: Applied CRFs to NER
12//! - Stanford NER (2003-2014): CRF-based, still widely used
13//! - State-of-art until neural methods (BiLSTM-CRF, 2015+)
14//!
15//! # Why CRF Beat Previous Methods
16//!
17//! CRFs solved the **label bias problem** that plagued MEMMs (Maximum Entropy
18//! Markov Models, McCallum et al. 2000):
19//!
20//! ```text
21//! Label Bias: In MEMMs, states with few successors effectively ignore
22//! observations. Transition scores are conditional on current state,
23//! so low-entropy states "absorb" probability mass regardless of input.
24//!
25//! HMM: Generative model P(x,y) = P(y) × P(x|y)
26//! MEMM: Local discriminative P(y_t|y_{t-1}, x) ← label bias here
27//! CRF: Global discriminative P(y|x) = (1/Z) exp(∑ features × weights)
28//! ↑ normalizes over entire sequence
29//! ```
30//!
31//! CRF models the conditional probability of the entire label sequence given
32//! the observation sequence, using global normalization:
33//!
34//! ```text
35//! P(y|x) = (1/Z(x)) × exp( ∑_t ∑_k λ_k × f_k(y_t, y_{t-1}, x, t) )
36//!
37//! where:
38//! - Z(x) is the partition function (normalizer)
39//! - f_k are feature functions
40//! - λ_k are learned weights
41//! ```
42//!
43//! # References
44//!
45//! - Lafferty, McCallum, Pereira (2001): "Conditional Random Fields:
46//! Probabilistic Models for Segmenting and Labeling Sequence Data" (ICML)
47//! - McCallum & Li (2003): "Early Results for Named Entity Recognition with
48//! Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons"
49//! - Finkel, Grenager, Manning (2005): "Incorporating Non-local Information
50//! into Information Extraction Systems by Gibbs Sampling" (ACL)
51//!
52//! # See Also
53//!
54//! - Historical NER baselines (HMM/CRF-era sequence models)
55//!
56//! # Feature Templates
57//!
58//! The CRF uses the following feature templates (matching `train_crf_weights.py`):
59//!
60//! ```text
61//! - bias : Always-on feature for label-specific bias
62//! - word.lower : Lowercased current word
63//! - word.shape : Word shape pattern (Xx, X, x, 0, etc.)
64//! - word.isdigit : Whether word is all digits
65//! - word.istitle : Whether word is titlecased
66//! - word.isupper : Whether word is all uppercase
67//! - prefix{2,3} : First 2-3 characters
68//! - suffix{2,3} : Last 2-3 characters
69//! - -1:word.* : Previous word features
70//! - +1:word.* : Next word features
71//! - BOS/EOS : Sentence boundary markers
72//! ```
73//!
74//! # Performance
75//!
76//! Performance depends on weights, tokenization, and dataset; use the eval harness
77//! for quantitative results.
78//! | Heuristic | Lower | Hand-tuned, always available |
79//! | Trained | Higher | From `train_crf_weights.py` |
80//! | Neural | Highest | For comparison (GLiNER, BERT) |
81//!
82//! # Usage
83//!
84//! ```rust
85//! use anno::CrfNER;
86//! use anno::Model;
87//!
88//! // Use with default heuristic weights
89//! let ner = CrfNER::new();
90//! let entities = ner.extract_entities("John Smith works at Google", None)?;
91//!
92//! // Or load trained weights for better accuracy
93//! // let ner = CrfNER::with_weights("crf_weights.json")?;
94//! # Ok::<(), anno::Error>(())
95//! ```
96//!
97//! # Training Weights
98//!
99//! To train weights on CoNLL-2003:
100//!
101//! ```bash
102//! uv run scripts/train_crf_weights.py
103//! ```
104//!
105//! This produces `crf_weights.json` which can be loaded with `CrfNER::with_weights()`.
106//!
107//! Nuance: CoNLL-2003’s English text is derived from Reuters/RCV1 and is commonly treated as
108//! redistribution-restricted. The CoNLL site notes that, “because of copyright reasons we only
109//! make available the annotations” and that you need separate access to the Reuters corpus to
110//! build the full dataset: `http://www.clips.uantwerpen.be/conll2003/ner/`.
111//!
112//! Practical consequence: `anno` includes a training script, but it does not ship a CoNLL-trained
113//! `crf_weights.json` out of the box.
114//!
115//! # Advantages Over Neural Methods
116//!
117//! - **Interpretable**: Features and weights are human-readable
118//! - **Fast training**: CPU-only training; typically faster to iterate than neural training loops
119//! - **No dependencies**: Pure Rust, no ONNX/Candle required
120//! - **Deterministic**: Same input always produces same output
121//! - **Small footprint**: Small weights file compared to ML model artifacts
122
123use crate::{Entity, EntityType, Model, Result};
124use std::collections::HashMap;
125#[cfg(feature = "bundled-crf-weights")]
126use std::sync::OnceLock;
127
128/// CRF-based NER model.
129///
130/// Uses hand-crafted features and sequence labeling for named entity recognition.
131/// This is a pure-Rust implementation that doesn't require external libraries.
132pub struct CrfNER {
133 /// Feature weights learned during training (or loaded from file)
134 weights: HashMap<String, f64>,
135 /// Entity type gazetteer lists
136 gazetteers: HashMap<EntityType, Vec<String>>,
137 /// Label set (BIO tagging)
138 labels: Vec<String>,
139 /// Feature templates
140 templates: Vec<FeatureTemplate>,
141}
142
143/// Feature template for CRF
144#[derive(Debug, Clone)]
145pub enum FeatureTemplate {
146 /// Current word
147 Word,
148 /// Word at offset
149 WordAt(i32),
150 /// Word shape (Xx, XX, x, 0)
151 Shape,
152 /// Shape at offset
153 ShapeAt(i32),
154 /// Prefix of length n
155 Prefix(usize),
156 /// Suffix of length n
157 Suffix(usize),
158 /// Is in gazetteer for entity type
159 InGazetteer(EntityType),
160 /// Previous label
161 PrevLabel,
162 /// Bigram: current + previous label
163 LabelBigram,
164 /// Word + Label combination
165 WordLabel,
166}
167
168impl Default for CrfNER {
169 fn default() -> Self {
170 Self::new()
171 }
172}
173
174mod algorithm;
175// CRF algorithm: feature extraction, Viterbi decoding, weight loading (see algorithm.rs).
176impl Model for CrfNER {
177 fn extract_entities(&self, text: &str, _language: Option<&str>) -> Result<Vec<Entity>> {
178 if text.trim().is_empty() {
179 return Ok(vec![]);
180 }
181
182 let tokens = Self::tokenize(text);
183 if tokens.is_empty() {
184 return Ok(vec![]);
185 }
186
187 let labels = self.viterbi_decode(&tokens);
188 let entities = self.labels_to_entities(text, &tokens, &labels);
189
190 Ok(entities)
191 }
192
193 fn supported_types(&self) -> Vec<EntityType> {
194 vec![
195 EntityType::Person,
196 EntityType::Organization,
197 EntityType::Location,
198 EntityType::Other("MISC".to_string()),
199 ]
200 }
201
202 fn is_available(&self) -> bool {
203 true // Always available (no external dependencies)
204 }
205
206 fn name(&self) -> &'static str {
207 "crf"
208 }
209
210 fn description(&self) -> &'static str {
211 "CRF-based NER (classical statistical method)"
212 }
213
214 fn capabilities(&self) -> crate::ModelCapabilities {
215 crate::ModelCapabilities {
216 batch_capable: true,
217 optimal_batch_size: Some(32),
218 streaming_capable: true,
219 ..Default::default()
220 }
221 }
222}
223
224impl crate::NamedEntityCapable for CrfNER {}
225
226impl crate::BatchCapable for CrfNER {
227 fn optimal_batch_size(&self) -> Option<usize> {
228 Some(32) // CRF is fast, can handle batches
229 }
230}
231
232impl crate::StreamingCapable for CrfNER {
233 fn recommended_chunk_size(&self) -> usize {
234 4096 // Smaller chunks since CRF is token-based
235 }
236}
237
238#[cfg(test)]
239mod tests;