libgrammstein 0.1.0

Hybrid language model (N-gram + Embeddings) for WFST text correction
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
# Training on Large Corpora

This guide covers strategies for training models on large-scale text corpora.

## Memory Challenges

Large corpora present several challenges:

| Challenge | Cause | Solution |
|-----------|-------|----------|
| Corpus doesn't fit in RAM | Loading all text | Streaming readers |
| Model doesn't fit | Large vocabulary | Filtering, compression |
| Slow training | Sequential processing | Parallelization |
| Disk space | Storing corpus | Streaming from HTTP |

## Streaming Corpus Readers

### Wikipedia Streaming

Process Wikipedia dumps without downloading:

```rust
use libgrammstein::corpus::{WikipediaReader, WikipediaConfig, LoadStrategy};

// HTTP streaming - processes on-the-fly
let config = WikipediaConfig {
    load_strategy: LoadStrategy::HttpStreaming,
    max_articles: Some(1_000_000),  // Limit for testing
    ..Default::default()
};

let reader = WikipediaReader::from_url(
    "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"
)?;

// Train directly from stream
let model = TrainerBuilder::new(dictionary)
    .order(5)
    .train(&reader)?;
```

### Chunked Processing

Process large files in chunks:

```rust
use std::io::{BufRead, BufReader};
use std::fs::File;

fn process_large_file<D, F>(path: &str, chunk_size: usize, mut processor: F)
where
    D: MutableMappedDictionary<Value = NgramEntry>,
    F: FnMut(&[String]) -> Result<()>,
{
    let file = File::open(path)?;
    let reader = BufReader::new(file);

    let mut chunk = Vec::with_capacity(chunk_size);

    for line in reader.lines() {
        chunk.push(line?);

        if chunk.len() >= chunk_size {
            processor(&chunk)?;
            chunk.clear();
        }
    }

    // Process remaining
    if !chunk.is_empty() {
        processor(&chunk)?;
    }
}
```

## Vocabulary Reduction

### Frequency Filtering

Remove rare words:

```rust
let model = TrainerBuilder::new(dictionary)
    .order(5)
    .min_word_freq(10)  // Words appearing < 10 times ignored
    .train(&reader)?;
```

Memory impact:
| min_word_freq | Vocabulary Reduction | Memory Savings |
|---------------|---------------------|----------------|
| 1 | 0% | 0% |
| 5 | ~50% | ~30% |
| 10 | ~70% | ~50% |
| 100 | ~90% | ~70% |

### Vocabulary Capping

Limit vocabulary size:

```rust
fn cap_vocabulary(word_counts: &mut HashMap<String, u64>, max_vocab: usize) {
    if word_counts.len() <= max_vocab {
        return;
    }

    // Sort by frequency
    let mut sorted: Vec<_> = word_counts.iter().collect();
    sorted.sort_by(|a, b| b.1.cmp(a.1));

    // Keep only top words
    let to_remove: Vec<_> = sorted[max_vocab..].iter()
        .map(|(w, _)| w.to_string())
        .collect();

    for word in to_remove {
        word_counts.remove(&word);
    }
}
```

## N-gram Pruning

### Count-based Pruning

Remove low-count n-grams:

```rust
fn prune_ngrams<D>(model: &mut NgramModel<D>, min_count: u64)
where
    D: IterableDictionary + MutableMappedDictionary<Value = NgramEntry>,
{
    let to_remove: Vec<String> = model.trie()
        .iter_entries()
        .filter(|(_, entry)| entry.count() < min_count)
        .map(|(key, _)| key)
        .collect();

    for key in to_remove {
        model.trie_mut().remove(&key);
    }
}
```

### Entropy-based Pruning

Keep only informative n-grams:

```rust
fn entropy_prune<D>(model: &mut NgramModel<D>, threshold: f64)
where
    D: IterableDictionary + MutableMappedDictionary<Value = NgramEntry>,
{
    // Remove n-grams that add little information beyond backoff
    // (Advanced technique - requires careful implementation)
}
```

## Parallel Processing

### Multi-threaded Training

libgrammstein uses Rayon for parallel processing:

```rust
// Configure thread pool
rayon::ThreadPoolBuilder::new()
    .num_threads(16)  // Use 16 threads
    .build_global()
    .unwrap();

// Training automatically parallelizes
let model = TrainerBuilder::new(dictionary)
    .order(5)
    .batch_size(100_000)  // Larger batches = better parallelism
    .train(&reader)?;
```

### Distributed Training

For very large corpora, split across machines:

```rust
// Machine 1: Train on first half
let model1 = train_on_shard("shard1.txt")?;
model1.save("model_shard1.bin")?;

// Machine 2: Train on second half
let model2 = train_on_shard("shard2.txt")?;
model2.save("model_shard2.bin")?;

// Merge (on coordinator)
let merged = merge_models(&[model1, model2])?;
```

## Checkpointing

### Periodic Saves

Save progress during long training:

```rust
use std::time::{Duration, Instant};

struct CheckpointManager {
    interval: Duration,
    last_checkpoint: Instant,
    path: PathBuf,
}

impl CheckpointManager {
    fn maybe_checkpoint<D>(&mut self, model: &NgramModel<D>) -> Result<()>
    where
        D: MutableMappedDictionary<Value = NgramEntry> + Serialize,
    {
        if self.last_checkpoint.elapsed() >= self.interval {
            let checkpoint_path = self.path.join(format!(
                "checkpoint_{}.bin",
                chrono::Utc::now().format("%Y%m%d_%H%M%S")
            ));
            model.save(&checkpoint_path)?;
            self.last_checkpoint = Instant::now();
            log::info!("Checkpoint saved: {:?}", checkpoint_path);
        }
        Ok(())
    }
}
```

### Resume from Checkpoint

```rust
fn resume_training(checkpoint_path: &str, corpus: &impl CorpusReader) -> Result<NgramModel<D>> {
    // Load checkpoint
    let mut model: NgramModel<D> = NgramModel::load(checkpoint_path)?;

    // Continue training (implementation depends on your needs)
    // For n-grams, you might track which sentences were processed
    // and resume from there

    Ok(model)
}
```

## Memory-Mapped Files

For very large models:

```rust
use memmap2::MmapMut;

fn mmap_model_storage(path: &str, size: usize) -> Result<MmapMut> {
    let file = OpenOptions::new()
        .read(true)
        .write(true)
        .create(true)
        .open(path)?;

    file.set_len(size as u64)?;

    unsafe { MmapMut::map_mut(&file) }
}
```

## Dictionary Backend Selection

| Backend | Memory | Build Time | Query Time |
|---------|--------|------------|------------|
| `PathMapDictionary` | High | Fast | Fast |
| `DynamicDawgChar` | Low | Medium | Medium |
| `DoubleArrayTrieChar` | Lowest | Slow | Fastest |

For large corpora, use compressed backends:

```rust
// Start with DynamicDawgChar for training (allows updates)
let training_dict = DynamicDawgChar::new();
let model = TrainerBuilder::new(training_dict)
    .order(5)
    .train(&reader)?;

// Save to portable format
model.save_portable("model.portable.bin")?;

// Load into DoubleArrayTrieChar for production (smallest, fastest)
let production_model = NgramModel::load_portable(
    "model.portable.bin",
    || DoubleArrayTrieChar::new()
)?;
```

## Memory Estimation

Estimate memory requirements before training:

```rust
fn estimate_memory(
    corpus_words: u64,
    vocab_size: u64,
    order: usize,
    dim: usize,  // For embeddings
) -> (u64, u64) {
    // N-gram estimate (rough)
    // Assumes ~10 bytes per n-gram entry average
    let ngram_entries = corpus_words * order as u64;
    let ngram_memory = ngram_entries * 10;

    // Embedding estimate
    // vocab_size * dim * 4 bytes (f32)
    // + bucket_count * dim * 4 bytes
    let bucket_count = 2_000_000u64;
    let embedding_memory = (vocab_size + bucket_count) * dim as u64 * 4;

    (ngram_memory, embedding_memory)
}

let (ngram_mb, embed_mb) = estimate_memory(1_000_000_000, 1_000_000, 5, 100);
println!("Estimated N-gram memory: {} MB", ngram_mb / 1_000_000);
println!("Estimated Embedding memory: {} MB", embed_mb / 1_000_000);
```

## CLI for Large Corpora

```bash
# Train with checkpoints
grammstein train ngram enwiki.xml.bz2 model.bin \
    --order 5 \
    --min-count 10 \
    --checkpoint ./checkpoints \
    --checkpoint-interval 100000

# Resume from checkpoint
grammstein train ngram enwiki.xml.bz2 model.bin \
    --resume ./checkpoints/latest.ckpt

# HTTP streaming (no download required)
grammstein train ngram \
    "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2" \
    model.bin \
    --order 5
```

## Best Practices

1. **Profile first** - Understand where memory is going
2. **Stream when possible** - Don't load entire corpus into RAM
3. **Filter aggressively** - Remove rare words early
4. **Checkpoint frequently** - Long jobs can fail
5. **Use appropriate backends** - Compressed for storage, fast for serving
6. **Monitor progress** - Know when something is wrong

## Scaling Guidelines

| Corpus Size | Order | min_count | Approx. Memory |
|-------------|-------|-----------|----------------|
| 10M words | 5 | 5 | 2-4 GB |
| 100M words | 5 | 10 | 10-20 GB |
| 1B words | 5 | 20 | 50-100 GB |
| 10B+ words | 3-4 | 50+ | 200+ GB |

## Google Books N-gram Import: Cached-File Mode

For Google Books n-gram imports specifically, the `--cache-files` flag
decouples the download from the parse pipeline. Each worker downloads the
raw `.gz` to a local cache directory first, then imports from disk.

```bash
grammstein train import-google-books \
    --language en \
    --orders 1..=5 \
    --output english.artrie \
    --parallel 8 \
    --cache-files
```

### When to enable

- **Unstable upstream connection.** A failed HTTP stream mid-parse wastes
  hours of CPU; cached-file mode lets a retry resume from the local copy
  (or from a partial download via HTTP 206 Range).
- **Long-running imports.** Network blips ~6 hours into a 30-hour import
  are recoverable without restarting from scratch.
- **Debugging.** Reproducing parser/encoder issues against a fixed input
  is easier when the input lives on disk.

### Mechanics

| Property | Value |
|---|---|
| Cache location | `{output_path_parent}/grammstein-cache/` |
| Filename scheme | `googlebooks-{corpus_id}-all-{order}gram-{VERSION}-{prefix}.gz` |
| Atomicity | Downloads to `.gz.downloading` and atomically renames on completion |
| Resume on partial | HTTP Range request from existing byte offset |
| 416 recovery | Stale partial deleted, full re-download issued |
| Cleanup | Removed on successful import and on final failure (all retries exhausted) |
| Retained on retry | Cached file preserved across retryable errors so the retry reuses it |

The cache layer is orthogonal to the chunked-transaction
(`--tx-chunk-size`) and lock-free-flush-threshold
(`--lockfree-flush-threshold`) memory bounds — those affect the *write*
path, while `--cache-files` affects the *download* path.

See `docs/cli/import-google-books.md` for the full flag reference and
`docs/architecture/memory-optimization.md` for the broader design context.

## See Also

- [N-gram Training]ngram.md - Training workflow
- [Hyperparameters]hyperparameters.md - Tuning for size/quality
- [CLI Reference]../cli/README.md - Command-line options
- [Google Books Import Flags]../cli/import-google-books.md - Memory + reliability tuning
- [Memory Optimization Architecture]../architecture/memory-optimization.md - Importer internals