# Grammar Corrector
The grammar corrector uses Probabilistic Context-Free Grammars (PCFG) and Earley parsing to suggest syntactically valid corrections based on grammar constraints.
## Overview
The `GrammarCorrector` provides:
- **Grammar validation**: Check token sequences against PCFG rules
- **Completion suggestions**: Valid next tokens from parse state
- **Syntax error detection**: Identify invalid token positions
- **Insertion/deletion suggestions**: Fix missing or extra tokens
## Architecture
```
┌──────────────────────────────────────────────────────────────────┐
│ GrammarCorrector │
│ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ WeightedCFG ││
│ │ ┌─────────────────────────────────────────────────────────┐││
│ │ │ Production Rules with Probabilities │││
│ │ │ │││
│ │ │ stmt → "if" "(" expr ")" stmt [0.30] │││
│ │ │ stmt → "while" "(" expr ")" stmt [0.20] │││
│ │ │ stmt → "return" expr ";" [0.30] │││
│ │ │ stmt → expr ";" [0.20] │││
│ │ │ expr → identifier [0.50] │││
│ │ │ expr → literal [0.50] │││
│ │ └─────────────────────────────────────────────────────────┘││
│ └─────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ EarleyParser ││
│ │ ││
│ │ • Incremental parsing ││
│ │ • Valid next token computation ││
│ │ • Error recovery ││
│ └─────────────────────────────────────────────────────────────┘│
└──────────────────────────────────────────────────────────────────┘
```
## GrammarCorrectorConfig
Configuration options for the grammar corrector:
```rust
pub struct GrammarCorrectorConfig {
/// Maximum candidates per error (default: 5)
pub max_candidates: usize,
/// Minimum rule probability to consider (default: 0.01)
pub min_rule_probability: f64,
/// Whether to suggest insertions (default: true)
pub suggest_insertions: bool,
/// Whether to suggest deletions (default: true)
pub suggest_deletions: bool,
/// Maximum lookahead for completions (default: 3)
pub max_lookahead: usize,
/// Base confidence for grammar corrections (default: 0.8)
pub base_confidence: f64,
}
```
### Configuration Parameters
| `max_candidates` | 5 | Maximum suggestions per error |
| `min_rule_probability` | 0.01 | Filter low-probability rules |
| `suggest_insertions` | true | Suggest missing tokens |
| `suggest_deletions` | true | Suggest removing tokens |
| `max_lookahead` | 3 | Tokens to look ahead |
| `base_confidence` | 0.8 | Starting confidence score |
## Creating a Grammar Corrector
### With a Trained Grammar
```rust
use libgrammstein::code::{
GrammarCorrector, GrammarCorrectorConfig, Python,
WeightedCFG, Production, Symbol
};
use std::sync::Arc;
// Create a grammar
let mut grammar = WeightedCFG::new("stmt");
grammar.add_rule(
Production::new("stmt", vec![
Symbol::Terminal("if".to_string()),
Symbol::Terminal("(".to_string()),
Symbol::NonTerminal("expr".to_string()),
Symbol::Terminal(")".to_string()),
Symbol::NonTerminal("stmt".to_string()),
]),
0.3, // 30% probability
);
grammar.add_rule(
Production::new("stmt", vec![
Symbol::Terminal("return".to_string()),
Symbol::NonTerminal("expr".to_string()),
Symbol::Terminal(";".to_string()),
]),
0.4,
);
// Create corrector
let python = Arc::new(Python::new());
let corrector = GrammarCorrector::with_defaults(python, grammar);
```
### With Custom Configuration
```rust
let config = GrammarCorrectorConfig {
max_candidates: 10,
min_rule_probability: 0.05,
suggest_insertions: true,
suggest_deletions: false, // Only insertions
max_lookahead: 5,
base_confidence: 0.9,
};
let corrector = GrammarCorrector::new(python, grammar, config);
```
## Grammar Constraint
The corrector creates a `GrammarConstraint` for validation:
```rust
// Create a constraint for checking validity
let constraint = corrector.create_constraint();
// Check if a token is valid at current position
if constraint.is_valid_token("if") {
println!("'if' is valid here");
}
// Advance the parser state
constraint.advance("if");
// Get all valid tokens at current position
let valid = constraint.valid_tokens();
println!("Valid next tokens: {:?}", valid);
```
## Valid Next Tokens
Query valid tokens given a token history:
```rust
// After "if", what tokens are valid?
let valid = corrector.valid_next_tokens(&["if"]);
// Returns: {"("} - an opening paren is expected
// After "if (", what's valid?
let valid = corrector.valid_next_tokens(&["if", "("]);
// Returns: expressions - identifiers, literals, etc.
// After "if ( x )", what's valid?
let valid = corrector.valid_next_tokens(&["if", "(", "x", ")"]);
// Returns: statement starters - "{", another if, etc.
```
## Suggesting Completions
Get ranked completion suggestions:
```rust
// Get top 5 completions after "return"
let completions = corrector.suggest_completions(&["return"], 5);
for (token, probability) in &completions {
println!(" {} (prob: {:.2})", token, probability);
}
// Output:
// x (prob: 0.50)
// y (prob: 0.50)
```
## Syntax Error Detection
Find syntax errors in a token sequence:
```rust
let tokens = vec!["fi", "(", "x", ")", "return", "x", ";"];
// ^^ "fi" is not a valid keyword
let errors = corrector.find_syntax_errors(&tokens);
for error in &errors {
println!("Position {}: {}", error.position, error.message());
}
// Output: Position 0: Invalid token 'fi', expected one of: 'if', 'while', 'return'
```
### SyntaxErrorType
Types of syntax errors detected:
```rust
pub enum SyntaxErrorType {
InvalidToken, // Token not valid at this position
UnexpectedToken, // No valid tokens possible
MissingToken, // Expected token not present
UnclosedDelimiter, // Missing closing bracket/paren
}
```
### SyntaxError Structure
```rust
pub struct SyntaxError {
/// Position in the token stream
pub position: usize,
/// The problematic token
pub token: String,
/// Valid tokens expected at this position
pub expected: HashSet<String>,
/// Type of error
pub error_type: SyntaxErrorType,
}
```
## Correction Types
The grammar corrector generates three types of corrections:
### Insertions
For missing tokens (e.g., missing semicolon):
```rust
// Source: "return x" (missing ";")
// Correction: Insert ";" after "x"
let correction = Correction::new(
CorrectionKind::Insertion,
byte_position,
byte_position, // Same start/end for insertion
"", // Nothing being replaced
";", // Token to insert
)
.with_source(CorrectionSource::Grammar)
.with_context("Expected token at position N");
```
### Deletions
For extra tokens:
```rust
// Source: "return return x;" (duplicate "return")
// Correction: Delete first or second "return"
let correction = Correction::new(
CorrectionKind::Deletion,
start_byte,
end_byte,
"return", // Token to remove
"", // Replace with nothing
)
.with_source(CorrectionSource::Grammar)
.with_context("Unexpected token");
```
### Replacements
For wrong tokens that should be something else:
```rust
// Source: "fi (x) return x;" ("fi" instead of "if")
// Correction: Replace "fi" with "if"
let correction = Correction::new(
CorrectionKind::Replacement,
start_byte,
end_byte,
"fi", // Original
"if", // Replacement
)
.with_source(CorrectionSource::Grammar)
.with_context("Grammar suggests 'if'");
```
## Confidence Calculation
Grammar corrections compute confidence based on:
1. **Base confidence**: Configurable starting point (default 0.8)
2. **Rule probability**: Higher probability rules score better
3. **String similarity**: For replacements, similar tokens score higher
```rust
// Insertion confidence
confidence = base_confidence × rule_probability
// Deletion confidence (lower due to destructive nature)
confidence = base_confidence × 0.7
// Replacement confidence
confidence = base_confidence × similarity × (0.5 + 0.5 × rule_probability)
```
## String Similarity
The corrector uses Jaccard similarity on character bigrams:
```rust
// Compare "fi" to "if"
// Bigrams of "fi": {('f', 'i')}
// Bigrams of "if": {('i', 'f')}
// Intersection: {} (empty)
// Union: {('f', 'i'), ('i', 'f')}
// Similarity: 0 / 2 = 0.0
// Compare "whiel" to "while"
// Bigrams of "whiel": {('w','h'), ('h','i'), ('i','e'), ('e','l')}
// Bigrams of "while": {('w','h'), ('h','i'), ('i','l'), ('l','e')}
// Intersection: {('w','h'), ('h','i')}
// Union: 6 unique bigrams
// Similarity: 2 / 6 ≈ 0.33
```
Replacements with similarity below 0.3 are filtered out.
## Integration Example
Complete example using the grammar corrector:
```rust
use libgrammstein::code::{
GrammarCorrector, WeightedCFG, Production, Symbol,
Python, CodeCorrector, CodeToken, TokenContext, TokenType
};
use std::sync::Arc;
// Build a simple expression grammar
fn build_expr_grammar() -> WeightedCFG {
let mut cfg = WeightedCFG::new("expr");
// expr -> expr "+" term
cfg.add_rule(
Production::new("expr", vec![
Symbol::NonTerminal("expr".to_string()),
Symbol::Terminal("+".to_string()),
Symbol::NonTerminal("term".to_string()),
]),
0.3,
);
// expr -> term
cfg.add_rule(
Production::new("expr", vec![
Symbol::NonTerminal("term".to_string()),
]),
0.7,
);
// term -> "(" expr ")"
cfg.add_rule(
Production::new("term", vec![
Symbol::Terminal("(".to_string()),
Symbol::NonTerminal("expr".to_string()),
Symbol::Terminal(")".to_string()),
]),
0.3,
);
// term -> NUMBER
cfg.add_rule(
Production::new("term", vec![
Symbol::Terminal("NUMBER".to_string()),
]),
0.7,
);
cfg
}
fn main() {
let python = Arc::new(Python::new());
let grammar = build_expr_grammar();
let corrector = GrammarCorrector::with_defaults(python, grammar);
// Check what's valid at start
let valid_start = corrector.valid_next_tokens(&[]);
println!("Valid start tokens: {:?}", valid_start);
// Check what's valid after "("
let valid_after_paren = corrector.valid_next_tokens(&["("]);
println!("After '(': {:?}", valid_after_paren);
// Find syntax errors
let tokens = vec!["(", "NUMBER", "+", ")"]; // Missing second operand
let errors = corrector.find_syntax_errors(&tokens);
for error in errors {
println!("Error: {}", error.message());
}
}
```
## Working with PCFG
### Accessing the Grammar
```rust
let grammar = corrector.grammar();
// Iterate over rules
for (production, normalized) in grammar.iter_rules() {
let prob = grammar.probability(production);
println!("{} -> {:?} (prob: {:.2})",
production.lhs,
production.rhs,
prob
);
}
```
### Token Probability
Estimate the probability of a token in context:
```rust
// Internal method used for ranking
let prob = corrector.token_probability("return", &["if", "(", "x", ")"]);
```
## Limitations
1. **Context window**: Grammar checking requires token history
2. **Ambiguous grammars**: May produce multiple valid suggestions
3. **Training data**: Quality depends on grammar training
4. **Performance**: Earley parsing is O(n³) worst case
## Performance
| Create constraint | O(1) | Initializes parser state |
| Valid next tokens | O(g) | g = grammar size |
| Find syntax errors | O(n × g) | n = token count |
| Suggest completions | O(g) | Filtered by probability |
## Thread Safety
`GrammarCorrector` is `Send + Sync` when its language type is:
```rust
use std::sync::Arc;
let corrector = Arc::new(GrammarCorrector::with_defaults(python, grammar));
// Safe to share across threads for reading
let corrector_clone = Arc::clone(&corrector);
});
```
## See Also
- [Correctors Overview](overview.md) - Architecture and comparison
- [Lexical Corrector](lexical.md) - Fuzzy matching
- [Semantic Corrector](semantic.md) - CPG/GNN analysis
- [Ensemble Corrector](ensemble.md) - Multi-source aggregation
- [PCFG](../pcfg.md) - Probabilistic grammars
- [Constrained Decoding](../constrained-decoding.md) - Grammar constraints