1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
//! Conditional Random Field (CRF) NER backend.
//!
//! Implements classical statistical NER using CRF sequence labeling.
//! This provides a lightweight, interpretable baseline that requires no
//! external dependencies or GPU acceleration.
//!
//! # History
//!
//! CRF-based NER was a common baseline throughout the 2000s (pre-neural sequence labeling):
//! - Lafferty et al. 2001: Introduced CRFs for sequence labeling (ICML)
//! - McCallum & Li 2003: Applied CRFs to NER
//! - Stanford NER (2003-2014): CRF-based, still widely used
//! - State-of-art until neural methods (BiLSTM-CRF, 2015+)
//!
//! # Why CRF Beat Previous Methods
//!
//! CRFs solved the **label bias problem** that plagued MEMMs (Maximum Entropy
//! Markov Models, McCallum et al. 2000):
//!
//! ```text
//! Label Bias: In MEMMs, states with few successors effectively ignore
//! observations. Transition scores are conditional on current state,
//! so low-entropy states "absorb" probability mass regardless of input.
//!
//! HMM: Generative model P(x,y) = P(y) × P(x|y)
//! MEMM: Local discriminative P(y_t|y_{t-1}, x) ← label bias here
//! CRF: Global discriminative P(y|x) = (1/Z) exp(∑ features × weights)
//! ↑ normalizes over entire sequence
//! ```
//!
//! CRF models the conditional probability of the entire label sequence given
//! the observation sequence, using global normalization:
//!
//! ```text
//! P(y|x) = (1/Z(x)) × exp( ∑_t ∑_k λ_k × f_k(y_t, y_{t-1}, x, t) )
//!
//! where:
//! - Z(x) is the partition function (normalizer)
//! - f_k are feature functions
//! - λ_k are learned weights
//! ```
//!
//! # References
//!
//! - Lafferty, McCallum, Pereira (2001): "Conditional Random Fields:
//! Probabilistic Models for Segmenting and Labeling Sequence Data" (ICML)
//! - McCallum & Li (2003): "Early Results for Named Entity Recognition with
//! Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons"
//! - Finkel, Grenager, Manning (2005): "Incorporating Non-local Information
//! into Information Extraction Systems by Gibbs Sampling" (ACL)
//!
//! # See Also
//!
//! - Historical NER baselines (HMM/CRF-era sequence models)
//!
//! # Feature Templates
//!
//! The CRF uses the following feature templates (matching `train_crf_weights.py`):
//!
//! ```text
//! - bias : Always-on feature for label-specific bias
//! - word.lower : Lowercased current word
//! - word.shape : Word shape pattern (Xx, X, x, 0, etc.)
//! - word.isdigit : Whether word is all digits
//! - word.istitle : Whether word is titlecased
//! - word.isupper : Whether word is all uppercase
//! - prefix{2,3} : First 2-3 characters
//! - suffix{2,3} : Last 2-3 characters
//! - -1:word.* : Previous word features
//! - +1:word.* : Next word features
//! - BOS/EOS : Sentence boundary markers
//! ```
//!
//! # Trained Parameters
//!
//! Bundled weights in `crf_weights.json` (28k features) are trained via `python-crfsuite`
//! on WikiANN EN (15k sentences, L1=0.1, L2=0.1, 100 iterations). Includes shape, affix,
//! casing, context features, and transition weights. Word identity features are limited
//! to a top-2000 vocab to keep the file shippable. Labels: PER, ORG, LOC.
//!
//! To retrain on a different dataset:
//! ```sh
//! uv run scripts/train_crf_weights.py --dataset <hf_dataset> --config <config>
//! ```
//!
//! Requires the `bundled-crf-weights` feature to use trained weights; otherwise
//! falls back to hand-tuned heuristic weights.
//!
//! # Performance
//!
//! Performance depends on weights, tokenization, and dataset; use the eval harness
//! for quantitative results.
//! | Heuristic | Lower | Hand-tuned, always available |
//! | Trained | Higher | From `train_crf_weights.py` |
//! | Neural | Highest | For comparison (GLiNER, BERT) |
//!
//! # Usage
//!
//! ```rust
//! use anno::CrfNER;
//! use anno::Model;
//!
//! // Use with default heuristic weights
//! let ner = CrfNER::new();
//! let entities = ner.extract_entities("John Smith works at Google", None)?;
//!
//! // Or load trained weights for better accuracy
//! // let ner = CrfNER::with_weights("crf_weights.json")?;
//! # Ok::<(), anno::Error>(())
//! ```
//!
//! # Training Weights
//!
//! To train weights on CoNLL-2003:
//!
//! ```bash
//! uv run scripts/train_crf_weights.py
//! ```
//!
//! This produces `crf_weights.json` which can be loaded with `CrfNER::with_weights()`.
//!
//! Nuance: CoNLL-2003’s English text is derived from Reuters/RCV1 and is commonly treated as
//! redistribution-restricted. The CoNLL site notes that, “because of copyright reasons we only
//! make available the annotations” and that you need separate access to the Reuters corpus to
//! build the full dataset: `http://www.clips.uantwerpen.be/conll2003/ner/`.
//!
//! Practical consequence: `anno` includes a training script, but it does not ship a CoNLL-trained
//! `crf_weights.json` out of the box.
//!
//! # Advantages Over Neural Methods
//!
//! - **Interpretable**: Features and weights are human-readable
//! - **Fast training**: CPU-only training; typically faster to iterate than neural training loops
//! - **No dependencies**: Pure Rust, no ONNX/Candle required
//! - **Deterministic**: Same input always produces same output
//! - **Small footprint**: Small weights file compared to ML model artifacts
use crate::;
use HashMap;
use OnceLock;
/// CRF-based NER model.
///
/// Uses hand-crafted features and sequence labeling for named entity recognition.
/// This is a pure-Rust implementation that doesn't require external libraries.
/// Feature template for CRF
/// Find sentence boundary character offsets in text.
///
/// A sentence boundary is `. `, `! `, or `? ` followed by an uppercase letter.
/// Returns the character offset of the punctuation mark (the split point).
/// Clip entities that cross sentence boundaries.
///
/// If an entity span contains a sentence boundary (`.` + whitespace + uppercase),
/// truncate the entity to end before the boundary. Removes entities that become
/// empty after clipping.
// CRF algorithm: feature extraction, Viterbi decoding, weight loading (see algorithm.rs).