libgrammstein 0.1.0

Hybrid language model (N-gram + Embeddings) for WFST text correction
# N-gram Query API

This document details the query methods available on `NgramModel`.

## Core Query Methods

### log_prob

Compute log probability of a word given context.

```rust
fn log_prob(&self, word: &str, context: &[&str]) -> f64
```

**Parameters**:
- `word`: Target word to score
- `context`: Preceding words (up to `order - 1`)

**Returns**: Log probability (base e, always negative or zero)

**Example**:
```rust
let prob = model.log_prob("fox", &["quick", "brown"]);
// prob ≈ -2.345 (log probability)

// Convert to probability
let p = prob.exp();  // p ≈ 0.096
```

**Behavior**:
- Uses Modified Kneser-Ney smoothing
- Backs off to shorter contexts if needed
- Returns `unk_log_prob()` for unknown words without context

### prob

Compute probability (not log) of a word given context.

```rust
fn prob(&self, word: &str, context: &[&str]) -> f64
```

Convenience wrapper: `self.log_prob(word, context).exp()`

### sentence_log_prob

Compute log probability of an entire sentence.

```rust
fn sentence_log_prob(&self, tokens: &[&str]) -> f64
```

**Example**:
```rust
let tokens = ["the", "quick", "brown", "fox"];
let log_prob = model.sentence_log_prob(&tokens);
// log_prob ≈ -12.456
```

**Computation**:
```rust
// Equivalent to:
let mut total = 0.0;
for i in 0..tokens.len() {
    let context_start = i.saturating_sub(order - 1);
    total += model.log_prob(&tokens[i], &tokens[context_start..i]);
}
```

### perplexity

Compute perplexity of a sentence.

```rust
fn perplexity(&self, tokens: &[&str]) -> f64
```

**Formula**: `exp(-log_prob / N)` where N is token count

**Interpretation**:
- Lower perplexity = better model fit
- Perplexity of N means model is as uncertain as uniform choice from N words

## Vocabulary Methods

### in_vocabulary

Check if a word is in the model's vocabulary.

```rust
fn in_vocabulary(&self, word: &str) -> bool
```

**Example**:
```rust
if model.in_vocabulary("fox") {
    // Word seen during training
} else {
    // OOV word
}
```

### vocab_size

Get vocabulary size.

```rust
fn vocab_size(&self) -> usize
```

Returns count of unique unigrams.

### ngram_count

Get total n-gram count.

```rust
fn ngram_count(&self) -> usize
```

Returns count of all n-grams (all orders).

## Model Properties

### order

Get the model order.

```rust
fn order(&self) -> usize
```

**Example**:
```rust
let order = model.order();  // 3 for trigram model
let context_len = order - 1;  // Maximum context length
```

### unk_log_prob

Get log probability for unknown words.

```rust
fn unk_log_prob(&self) -> f64
```

Returns the probability assigned to unseen words (typically very low).

## Iteration

### iter_vocabulary

Iterate over vocabulary words.

```rust
fn iter_vocabulary(&self) -> impl Iterator<Item = &str>
```

**Example**:
```rust
for word in model.iter_vocabulary() {
    println!("{}: {}", word, model.prob(word, &[]));
}
```

### iter_ngrams

Iterate over all n-grams.

```rust
fn iter_ngrams(&self) -> impl Iterator<Item = (&str, &NgramEntry)>
```

**Example**:
```rust
for (ngram, entry) in model.iter_ngrams() {
    println!("{}: count={}", ngram, entry.count());
}
```

## Prediction

### predict_next

Get most likely next words.

```rust
fn predict_next(&self, context: &[&str], k: usize) -> Vec<(String, f64)>
```

**Parameters**:
- `context`: Preceding words
- `k`: Number of predictions to return

**Returns**: Vector of (word, log_prob) sorted by probability descending

**Example**:
```rust
let predictions = model.predict_next(&["the", "quick"], 5);
for (word, log_prob) in predictions {
    println!("{}: {:.4}", word, log_prob);
}
// Output:
// brown: -1.234
// fox: -2.345
// dog: -2.567
// ...
```

### sample

Sample a word from the distribution.

```rust
fn sample(&self, context: &[&str], rng: &mut impl Rng) -> String
```

**Example**:
```rust
use rand::thread_rng;

let mut rng = thread_rng();
let word = model.sample(&["the", "quick"], &mut rng);
println!("Sampled: {}", word);
```

### generate

Generate a sequence of words.

```rust
fn generate(&self, seed: &[&str], length: usize, rng: &mut impl Rng) -> Vec<String>
```

**Example**:
```rust
let mut rng = thread_rng();
let text = model.generate(&["the"], 10, &mut rng);
println!("{}", text.join(" "));
// Output: "the quick brown fox jumps over the lazy dog and"
```

## Batch Operations

### batch_log_prob

Score multiple queries efficiently.

```rust
fn batch_log_prob(&self, queries: &[(String, Vec<String>)]) -> Vec<f64>
```

**Example**:
```rust
let queries = vec![
    ("fox".to_string(), vec!["quick".to_string(), "brown".to_string()]),
    ("dog".to_string(), vec!["lazy".to_string()]),
];

let scores = model.batch_log_prob(&queries);
```

### parallel_score_sentences

Score sentences in parallel.

```rust
fn parallel_score_sentences(&self, sentences: &[Vec<&str>]) -> Vec<f64>
```

Uses Rayon for parallel processing.

## Count Access

### get_count

Get raw count for an n-gram.

```rust
fn get_count(&self, ngram: &[&str]) -> Option<u64>
```

**Example**:
```rust
let count = model.get_count(&["the", "quick", "brown"]);
// count = Some(42) or None if not found
```

### get_context_count

Get count of context occurrences.

```rust
fn get_context_count(&self, context: &[&str]) -> u64
```

**Example**:
```rust
let count = model.get_context_count(&["the", "quick"]);
// How many times "the quick" was followed by any word
```

## Query Patterns

### Efficient Batch Scoring

```rust
use rayon::prelude::*;

// Parallel scoring
let scores: Vec<f64> = sentences.par_iter()
    .map(|s| model.sentence_log_prob(s))
    .collect();
```

### Finding Unusual N-grams

```rust
// Find n-grams with low probability
let unusual: Vec<_> = test_ngrams.iter()
    .filter(|ng| model.log_prob(&ng.last().unwrap(), &ng[..ng.len()-1]) < -10.0)
    .collect();
```

### Perplexity Calculation

```rust
fn corpus_perplexity(model: &NgramModel<D>, sentences: &[Vec<&str>]) -> f64 {
    let mut total_log_prob = 0.0;
    let mut total_tokens = 0;

    for sentence in sentences {
        total_log_prob += model.sentence_log_prob(sentence);
        total_tokens += sentence.len();
    }

    (-total_log_prob / total_tokens as f64).exp()
}
```

## See Also

- [Trie Storage]trie-storage.md - Backend details
- [NgramModel API]../../api/ngram.md - Complete API reference
- [Training Guide]../../training/ngram.md - Training workflow