# Hybrid Language Model
This document explains how libgrammstein combines N-gram models with subword embeddings into a unified hybrid language model.
## Motivation: Best of Both Worlds
N-gram models and embedding models have complementary strengths:
| N-gram | Precise local context, fast lookup, well-understood | OOV problem, sparse for long contexts |
| Embeddings | Semantic similarity, handles OOV via subwords | Ignores word order, weaker at local patterns |
The **hybrid model** combines both to get:
- **Local precision** from N-grams
- **Semantic coverage** from embeddings
- **OOV handling** through subword representations
## How the Hybrid Model Works
Given a token and its context, the hybrid model computes a weighted combination of scores:
```
Where:
- ngram_score = log P_MKN(word | context)
- embedding_score = cosine_similarity(word, context_embedding)
- λ₁ + λ₂ = 1 (interpolation weights)
```
### Scoring Flow
```
Input: ("fox", context=["the", "quick", "brown"])
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ HybridLanguageModel │
│ │
│ ┌───────────────────────────┐ ┌───────────────────────────┐ │
│ │ N-gram Model │ │ Embedding Model │ │
│ │ │ │ │ │
│ │ │ │ v_ctx = avg(embed(ctx))│ │
│ │ Apply MKN smoothing │ │ │ │
│ │ with backoff chain │ │ Compute similarity: │ │
│ │ │ │ sim = v_fox · v_ctx │ │
│ │ Result: -3.2 (log prob) │ │ Result: 0.75 │ │
│ └─────────────┬─────────────┘ └─────────────┬─────────────┘ │
│ │ │ │
│ └──────────────┬────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────┐ │
│ │ Interpolation Layer │ │
│ │ │ │
│ │ λ₁ = 0.8, λ₂ = 0.2 │ │
│ │ score = 0.8 × (-3.2) │ │
│ │ + 0.2 × log(0.75) │ │
│ │ score = -2.56 + (-0.058) │ │
│ │ score = -2.618 │ │
│ └───────────────────────────────┘ │
│ │ │
└───────────────────────────────┼──────────────────────────────────┘
▼
Output: log P(fox | the quick brown) = -2.618
```
## libgrammstein Implementation
### HybridLanguageModel Struct
```rust
pub struct HybridLanguageModel<D: MutableMappedDictionary<Value = NgramEntry>> {
/// N-gram model with Modified Kneser-Ney smoothing
ngram: NgramModel<D>,
/// Subword embedding model
embedding: SubwordEmbedding,
/// Interpolation configuration
config: HybridConfig,
/// LRU cache for hot queries
cache: Mutex<LruCache<CacheKey, f64>>,
}
#[derive(Clone, Debug)]
pub struct HybridConfig {
/// Weight for N-gram score (default: 0.8)
pub ngram_weight: f64,
/// Weight for embedding score (default: 0.2)
pub embedding_weight: f64,
/// Cache size for frequently queried n-grams
pub cache_size: usize,
/// OOV handling strategy
pub oov_strategy: OovStrategy,
}
#[derive(Clone, Debug)]
pub enum OovStrategy {
/// Use only embedding score for OOV words
EmbeddingOnly,
/// Backoff to lower-order N-grams, supplement with embeddings
BackoffWithEmbedding,
/// Assign a fixed log probability to OOV
FixedPenalty(f64),
}
```
### Core Scoring Methods
```rust
impl<D: MutableMappedDictionary<Value = NgramEntry>> HybridLanguageModel<D> {
/// Score a word given its context
pub fn score(&self, word: &str, context: &[&str]) -> f64 {
// Check cache
let cache_key = self.make_cache_key(word, context);
if let Some(&cached) = self.cache.lock().unwrap().get(&cache_key) {
return cached;
}
// Compute N-gram score
let ngram_score = self.ngram.log_prob(word, context);
// Compute embedding score
let embedding_score = self.compute_embedding_score(word, context);
// Interpolate
let score = self.config.ngram_weight * ngram_score
+ self.config.embedding_weight * embedding_score;
// Cache and return
self.cache.lock().unwrap().put(cache_key, score);
score
}
/// Compute embedding-based score
fn compute_embedding_score(&self, word: &str, context: &[&str]) -> f64 {
if context.is_empty() {
return 0.0; // No context to compare against
}
// Get word embedding
let word_emb = self.embedding.get_embedding(word);
// Get context embedding (average of context word embeddings)
let mut context_emb = Array1::zeros(self.embedding.dim());
for ctx_word in context {
context_emb += &self.embedding.get_embedding(ctx_word);
}
context_emb /= context.len() as f32;
// Cosine similarity → log probability
let similarity = word_emb.dot(&context_emb) as f64;
// Convert similarity [-1, 1] to log probability
// Using log(0.5 + 0.5 * similarity) to map to reasonable range
(0.5 + 0.5 * similarity).ln()
}
/// Score a complete sentence
pub fn sentence_log_prob(&self, tokens: &[&str]) -> f64 {
if tokens.is_empty() {
return 0.0;
}
let order = self.ngram.order();
let mut total = 0.0;
for i in 0..tokens.len() {
let context_start = i.saturating_sub(order - 1);
let context = &tokens[context_start..i];
let word = tokens[i];
total += self.score(word, context);
}
total
}
}
```
## OOV Handling Strategies
When a word is out-of-vocabulary for the N-gram model, the hybrid model can handle it several ways:
### Strategy 1: Embedding Only
For OOV words, rely entirely on embedding similarity:
```rust
OovStrategy::EmbeddingOnly
// When "splendiferous" is OOV:
// - N-gram model backs off to uniform distribution (low score)
// - Embedding captures semantic similarity to known words
// - Final score weighted toward embedding
```
### Strategy 2: Backoff with Embedding
Use N-gram backoff chain but boost with embedding similarity:
```rust
OovStrategy::BackoffWithEmbedding
// When "splendiferous" is OOV for 5-gram:
// 1. N-gram backs off: 5-gram → 4-gram → 3-gram → ...
// 2. Embedding provides semantic context
// 3. Both contribute to final score
```
### Strategy 3: Fixed Penalty
Assign a fixed log probability to OOV words:
```rust
OovStrategy::FixedPenalty(-10.0)
// When "splendiferous" is OOV:
// - ngram_score = -10.0 (fixed)
// - embedding_score computed normally
// - Interpolation applies
```
### OOV Detection
```rust
impl<D: MutableMappedDictionary<Value = NgramEntry>> HybridLanguageModel<D> {
/// Check if a word is in the N-gram vocabulary
pub fn is_known(&self, word: &str) -> bool {
self.ngram.contains_unigram(word)
}
/// Get OOV rate for a token sequence
pub fn oov_rate(&self, tokens: &[&str]) -> f64 {
let oov_count = tokens.iter().filter(|w| !self.is_known(w)).count();
oov_count as f64 / tokens.len() as f64
}
}
```
## Interpolation Strategies
libgrammstein supports multiple interpolation strategies:
### Linear Interpolation (Default)
Simple weighted average:
```rust
InterpolationStrategy::Linear { ngram_weight: 0.8, embedding_weight: 0.2 }
score = 0.8 × ngram_score + 0.2 × embedding_score
```
### Log-Linear Interpolation
Multiply in probability space (add in log space):
```rust
InterpolationStrategy::LogLinear { ngram_weight: 0.8, embedding_weight: 0.2 }
score = 0.8 × ngram_score + 0.2 × embedding_score
// (Same formula, but embedding_score is already in log space)
```
### Adaptive Interpolation
Adjust weights based on N-gram count:
```rust
InterpolationStrategy::Adaptive { base_ngram_weight: 0.9 }
// If N-gram has high count → trust N-gram more
// If N-gram has low count → trust embedding more
let ngram_count = self.ngram.get_count(word, context);
let confidence = (ngram_count as f64 / 100.0).min(1.0);
let ngram_weight = 0.5 + 0.4 * confidence; // 0.5 to 0.9
let embedding_weight = 1.0 - ngram_weight;
score = ngram_weight × ngram_score + embedding_weight × embedding_score
```
## Implementing the LanguageModel Trait
The hybrid model implements lling-llang's `LanguageModel` trait:
```rust
impl<D> LanguageModel for HybridLanguageModel<D>
where
D: MutableMappedDictionary<Value = NgramEntry> + Send + Sync,
{
fn score_sequence(&self, tokens: &[&str]) -> f64 {
self.sentence_log_prob(tokens)
}
fn score_continuation(&self, prefix: &[&str], next: &str) -> f64 {
self.score(next, prefix)
}
}
```
This enables seamless integration with lling-llang's correction pipelines.
## Creating a Hybrid Model
### From Trained Components
```rust
use libgrammstein::prelude::*;
// Load or train N-gram model
let ngram: NgramModel<DynamicDawgChar<NgramEntry>> = NgramModel::load("ngram.bin")?;
// Load or train embedding model
let embedding = SubwordEmbedding::load("embedding.bin")?;
// Create hybrid model
let config = HybridConfig {
ngram_weight: 0.8,
embedding_weight: 0.2,
cache_size: 10_000,
oov_strategy: OovStrategy::BackoffWithEmbedding,
};
let hybrid = HybridLanguageModel::new(ngram, embedding, config);
```
### Training Both Components
```rust
use libgrammstein::prelude::*;
// Prepare corpus reader
let reader = PlaintextReader::from_directory("./corpus")?;
// Train N-gram model
let ngram = TrainerBuilder::new()
.order(5)
.min_count(2)
.train(&reader)?;
// Train embedding model
let embedding = EmbeddingTrainer::new()
.dimension(100)
.epochs(20)
.window(5)
.train(&reader)?;
// Combine
let hybrid = HybridLanguageModel::new(ngram, embedding, HybridConfig::default());
hybrid.save("hybrid_model.bin")?;
```
## Thread Safety
The hybrid model is designed for concurrent access:
| `ngram` | `Arc<D>` where `D: Send + Sync` |
| `embedding` | Immutable embeddings + `Arc<DashMap>` cache |
| `config` | Plain data (Copy) |
| `cache` | `Mutex<LruCache>` for interior mutability |
All components satisfy `Send + Sync`, making `HybridLanguageModel` usable across threads.
## Performance Optimization
### Caching
The LRU cache stores recently computed scores:
```rust
cache_size: 10_000 // Default
// Cache hit: O(1) lookup
// Cache miss: Compute N-gram + embedding scores
```
### Batch Scoring
For efficiency, score multiple sequences in parallel:
```rust
use rayon::prelude::*;
let sequences: Vec<Vec<&str>> = ...;
let scores: Vec<f64> = sequences
.par_iter()
.map(|seq| hybrid.sentence_log_prob(seq))
.collect();
```
### Pre-warming the Cache
For known high-frequency queries:
```rust
impl<D: MutableMappedDictionary<Value = NgramEntry>> HybridLanguageModel<D> {
pub fn prewarm_cache(&self, common_contexts: &[(Vec<&str>, Vec<&str>)]) {
for (context, words) in common_contexts {
for word in words {
self.score(word, context);
}
}
}
}
```
## Memory Layout
```
HybridLanguageModel
├── ngram: NgramModel<D>
│ ├── dictionary: Arc<D> # Trie with NgramEntry values
│ ├── smoothing: KneserNeySmoothing
│ └── vocab_size: usize
│
├── embedding: SubwordEmbedding
│ ├── word_embeddings: Array2<f32> # [200K × 100] = 80MB
│ ├── subword_embeddings: Array2<f32> # [2M × 100] = 800MB
│ ├── word_to_idx: HashMap
│ ├── idx_to_word: Vec<String>
│ └── cache: Arc<DashMap>
│
├── config: HybridConfig
│ ├── ngram_weight: f64
│ ├── embedding_weight: f64
│ ├── cache_size: usize
│ └── oov_strategy: OovStrategy
│
└── cache: Mutex<LruCache<CacheKey, f64>>
└── Capacity: ~10,000 entries
```
## Comparison with Pure Models
| Perplexity | Lower for in-domain | Higher | Balanced |
| OOV handling | Poor | Excellent | Good |
| Query latency | ~100ns | ~1μs | ~1μs |
| Memory | 1-2GB | 1GB | 2-3GB |
| Training time | Hours | Days | Days |
## Hyperparameters
| `ngram_weight` | 0.7-0.9 | Higher = more local context |
| `embedding_weight` | 0.1-0.3 | Higher = more semantic |
| `cache_size` | 10,000 | Larger = more memory, fewer recomputes |
| `oov_strategy` | BackoffWithEmbedding | How to handle unknown words |
## Next Steps
- [Interpolation](interpolation.md): Detailed interpolation strategies
- [OOV Handling](oov-handling.md): Out-of-vocabulary strategies
- [lling-llang Integration](../../integration/lling-llang/overview.md): Using in WFST pipelines
- [Training](../../training/hyperparameters.md): Tuning guide