# Hybrid Model Training Guide
This guide covers training and using hybrid language models that combine n-grams with embeddings.
## Overview
Hybrid models leverage the strengths of both approaches:
| N-gram | Accurate for seen n-grams | Poor OOV handling |
| Embedding | Semantic similarity, OOV handling | Less precise probabilities |
| **Hybrid** | Best of both | Slightly more complex |
## Training Workflow
### Step 1: Train N-gram Model
```rust
use libgrammstein::ngram::{TrainerBuilder, NgramEntry};
use libgrammstein::corpus::PlaintextReader;
use liblevenshtein::dictionary::dynamic_dawg_char::DynamicDawgChar;
let reader = PlaintextReader::from_file("corpus.txt")?;
let ngram_model = TrainerBuilder::new(DynamicDawgChar::new())
.order(5)
.min_word_freq(5)
.train(&reader)?;
```
### Step 2: Train Embedding Model
```rust
use libgrammstein::embedding::EmbeddingTrainerBuilder;
// Re-read corpus (readers are consumed)
let reader2 = PlaintextReader::from_file("corpus.txt")?;
let embedding_model = EmbeddingTrainerBuilder::new()
.dim(100)
.window_size(5)
.min_count(5)
.epochs(5)
.train(&reader2)?;
```
### Step 3: Create Hybrid Model
```rust
use libgrammstein::hybrid::{HybridLanguageModel, HybridConfig, InterpolationStrategy};
let config = HybridConfig {
strategy: InterpolationStrategy::Linear { alpha: 0.7 },
cache_size: 50_000,
..Default::default()
};
let hybrid = HybridLanguageModel::new(ngram_model, embedding_model, config);
```
## Interpolation Strategies
### Linear Interpolation
Combines probabilities in linear space:
```
```rust
InterpolationStrategy::Linear { alpha: 0.7 }
```
**Best for:** General use, balanced performance.
### Log-Linear Interpolation
Combines log probabilities:
```
```rust
InterpolationStrategy::LogLinear { alpha: 0.7 }
```
**Best for:** When both models are well-calibrated.
### N-gram with Fallback
Uses n-gram for known words, embedding for OOV:
```rust
InterpolationStrategy::NgramWithEmbeddingFallback
```
**Best for:** When n-gram quality is high, OOV handling needed.
### Dynamic Weighting
Adjusts weight based on available context:
```rust
InterpolationStrategy::Dynamic {
base_alpha: 0.5, // Base n-gram weight
alpha_per_context: 0.1, // +0.1 per context word
max_alpha: 0.9, // Maximum n-gram weight
}
```
With 3 context words: α = 0.5 + 0.1 × 3 = 0.8
**Best for:** Variable-length contexts where n-gram quality improves with more context.
## Choosing Alpha
The alpha parameter controls the balance:
| 0.9 | 90% | 10% | High-quality n-gram, rare OOV |
| 0.7 | 70% | 30% | Balanced (default) |
| 0.5 | 50% | 50% | Equal weighting |
| 0.3 | 30% | 70% | Small n-gram corpus |
### Tuning Alpha
Find optimal alpha on held-out data:
```rust
fn tune_alpha(
ngram: &NgramModel<D>,
embedding: &SubwordEmbedding,
dev_corpus: &impl CorpusReader,
) -> f64 {
let mut best_alpha = 0.5;
let mut best_ppl = f64::INFINITY;
for alpha_int in 1..=9 {
let alpha = alpha_int as f64 / 10.0;
let config = HybridConfig {
strategy: InterpolationStrategy::Linear { alpha },
..Default::default()
};
let hybrid = HybridLanguageModel::new(
ngram.clone(),
embedding.clone(),
config
);
let ppl = evaluate_perplexity(&hybrid, dev_corpus);
if ppl < best_ppl {
best_ppl = ppl;
best_alpha = alpha;
}
println!("α={:.1}: perplexity={:.2}", alpha, ppl);
}
println!("Best: α={:.1} (ppl={:.2})", best_alpha, best_ppl);
best_alpha
}
```
## Temperature Parameter
Controls sharpness of embedding probabilities:
```rust
let config = HybridConfig {
temperature: 1.0, // Default: neutral
..Default::default()
};
```
| < 1.0 | Sharper distribution, more confident |
| 1.0 | Neutral (default) |
| > 1.0 | Smoother distribution, less confident |
## Caching
The hybrid model caches computed scores:
```rust
let config = HybridConfig {
cache_size: 50_000, // Cache 50k (word, context) pairs
..Default::default()
};
// Clear cache when needed
hybrid.clear_cache();
```
## Evaluation
### Perplexity
```rust
fn evaluate_perplexity(
model: &HybridLanguageModel<D>,
test_corpus: &impl CorpusReader,
) -> f64 {
let mut total_log_prob = 0.0;
let mut total_words = 0usize;
for sentence in test_corpus.sentences() {
let tokens: Vec<&str> = sentence.split_whitespace().collect();
total_log_prob += model.sentence_log_prob(&tokens);
total_words += tokens.len();
}
(-total_log_prob / total_words as f64).exp()
}
```
### OOV Performance
```rust
fn evaluate_oov_handling(
model: &HybridLanguageModel<D>,
oov_sentences: &[Vec<&str>],
) {
for sentence in oov_sentences {
let score = model.sentence_log_prob(sentence);
let ppl = model.perplexity(sentence);
println!("Sentence: {:?}", sentence);
println!(" Log prob: {:.4}", score);
println!(" Perplexity: {:.2}", ppl);
}
}
// Test with sentences containing OOV words
let oov_test = vec![
vec!["the", "xyzzy", "jumped"],
vec!["qwertyuiop", "is", "a", "word"],
];
evaluate_oov_handling(&hybrid, &oov_test);
```
## Serialization
### Binary Format
```rust
// Save (requires serde-extras feature and D: Serialize)
hybrid.save("hybrid.bin")?;
// Load
let loaded: HybridLanguageModel<DynamicDawgChar<NgramEntry>> =
HybridLanguageModel::load("hybrid.bin")?;
```
### Portable Format
```rust
// Save portable (works with any D)
hybrid.save_portable("hybrid.portable.bin")?;
// Load with different backend
let loaded = HybridLanguageModel::load_portable(
"hybrid.portable.bin",
|| DoubleArrayTrieChar::new()
)?;
```
## CLI Training
```bash
# Train hybrid model
grammstein train hybrid corpus.txt hybrid.bin \
--ngram-order 5 \
--embed-dim 100 \
--lambda 0.7
# This trains both components and saves combined model
```
## Use Cases
### Spell Correction Ranking
```rust
fn rank_corrections(
model: &HybridLanguageModel<D>,
context: &[&str],
candidates: &[&str],
) -> Vec<(&str, f64)> {
let mut scored: Vec<_> = candidates.iter()
.map(|&c| (c, model.score(c, context)))
.collect();
scored.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
scored
}
let context = ["the", "quick", "brown"];
let candidates = ["fox", "fix", "fax", "fog"];
let ranked = rank_corrections(&hybrid, &context, &candidates);
println!("Best correction: {}", ranked[0].0);
```
### Language Detection Fallback
```rust
fn is_likely_target_language(
model: &HybridLanguageModel<D>,
sentence: &[&str],
threshold: f64,
) -> bool {
let ppl = model.perplexity(sentence);
ppl < threshold // Lower perplexity = more likely
}
```
### Sentence Generation
```rust
fn generate_next_word(
model: &HybridLanguageModel<D>,
context: &[&str],
vocabulary: &[&str],
temperature: f64,
) -> String {
// Score all vocabulary words
let mut scores: Vec<(String, f64)> = vocabulary.iter()
.map(|&w| (w.to_string(), model.score(w, context)))
.collect();
// Apply temperature
let max_score = scores.iter().map(|(_, s)| *s).fold(f64::NEG_INFINITY, f64::max);
let probs: Vec<f64> = scores.iter()
.map(|(_, s)| ((s - max_score) / temperature).exp())
.collect();
let sum: f64 = probs.iter().sum();
let probs: Vec<f64> = probs.iter().map(|p| p / sum).collect();
// Sample from distribution
sample_from_distribution(&scores, &probs)
}
```
## Complete Example
```rust
use libgrammstein::hybrid::{HybridLanguageModel, HybridConfig, InterpolationStrategy};
use libgrammstein::ngram::{TrainerBuilder, NgramEntry};
use libgrammstein::embedding::EmbeddingTrainerBuilder;
use libgrammstein::corpus::PlaintextReader;
use liblevenshtein::dictionary::dynamic_dawg_char::DynamicDawgChar;
fn main() -> libgrammstein::Result<()> {
// Load corpus
let corpus_path = "corpus.txt";
// Train n-gram
println!("Training n-gram model...");
let reader1 = PlaintextReader::from_file(corpus_path)?;
let ngram = TrainerBuilder::new(DynamicDawgChar::new())
.order(5)
.train(&reader1)?;
// Train embeddings
println!("Training embeddings...");
let reader2 = PlaintextReader::from_file(corpus_path)?;
let embedding = EmbeddingTrainerBuilder::new()
.dim(100)
.epochs(5)
.train(&reader2)?;
// Create hybrid
let config = HybridConfig {
strategy: InterpolationStrategy::Linear { alpha: 0.7 },
..Default::default()
};
let hybrid = HybridLanguageModel::new(ngram, embedding, config);
// Test
let test_sentence = ["the", "quick", "brown", "fox"];
println!("\nTest sentence: {:?}", test_sentence);
println!("Log probability: {:.4}", hybrid.sentence_log_prob(&test_sentence));
println!("Perplexity: {:.2}", hybrid.perplexity(&test_sentence));
// OOV test
let oov_sentence = ["the", "xyzzy", "jumped"];
println!("\nOOV sentence: {:?}", oov_sentence);
println!("Log probability: {:.4}", hybrid.sentence_log_prob(&oov_sentence));
println!("Perplexity: {:.2}", hybrid.perplexity(&oov_sentence));
// Save
hybrid.save("hybrid-model.bin")?;
println!("\nModel saved");
Ok(())
}
```
## See Also
- [HybridLanguageModel API](../api/hybrid.md) - Complete API reference
- [N-gram Training](ngram.md) - N-gram component training
- [Embedding Training](embedding.md) - Embedding component training
- [Hyperparameters](hyperparameters.md) - Tuning guide