lling-llang 0.1.0

# Text Correction Applications

lling-llang provides a framework for building spelling and grammar correction applications.

## Overview

Text correction involves multiple stages:

```
Input Text: "Teh quikc brwon fox jmups ovre the lzay dog"
                │
                ▼
┌─────────────────────────────────────┐
│           Tokenization              │
└───────────────┬─────────────────────┘
                ▼
┌─────────────────────────────────────┐
│     Initial Lattice (1 path)        │
└───────────────┬─────────────────────┘
                ▼
┌─────────────────────────────────────┐
│       Spelling Candidates           │
│   (add edit-distance matches)       │
└───────────────┬─────────────────────┘
                ▼
┌─────────────────────────────────────┐
│        Grammar Filtering            │
│   (remove invalid sequences)        │
└───────────────┬─────────────────────┘
                ▼
┌─────────────────────────────────────┐
│      Language Model Ranking         │
│   (score by fluency)                │
└───────────────┬─────────────────────┘
                ▼
Output: "The quick brown fox jumps over the lazy dog"
```

## Building a Spelling Corrector

### Basic Architecture

```rust
use lling_llang::prelude::*;
use liblevenshtein::prelude::*;

pub struct SpellingCorrector {
    dictionary: DoubleArrayTrie,
    transducer: Transducer<DoubleArrayTrie>,
    max_edit_distance: usize,
}

impl SpellingCorrector {
    pub fn new(words: Vec<&str>, max_distance: usize) -> Self {
        let dictionary = DoubleArrayTrie::from_terms(words);
        let transducer = Transducer::with_transposition(dictionary.clone());

        Self {
            dictionary,
            transducer,
            max_edit_distance: max_distance,
        }
    }

    pub fn correct(&self, text: &str) -> String {
        // Tokenize
        let tokens = self.tokenize(text);

        // Build lattice with candidates
        let lattice = self.build_lattice(&tokens);

        // Extract best path
        let best = viterbi(&mut lattice.clone());

        // Reconstruct text
        self.reconstruct(&best, &tokens)
    }

    fn tokenize(&self, text: &str) -> Vec<Token> {
        // Preserve whitespace and punctuation info
        tokenize_preserving_context(text)
    }

    fn build_lattice(&self, tokens: &[Token]) -> Lattice<TropicalWeight, HashMapBackend> {
        let mut builder = LatticeBuilder::new(HashMapBackend::new());

        for (i, token) in tokens.iter().enumerate() {
            // Add original token
            builder.add_correction(
                i, i + 1,
                &token.text,
                TropicalWeight::zero(),  // No cost for original
                EdgeMetadata::original(),
            );

            // Skip if in dictionary
            if self.dictionary.contains(&token.text.to_lowercase()) {
                continue;
            }

            // Add spelling candidates
            for candidate in self.transducer.query_with_distance(
                &token.text.to_lowercase(),
                self.max_edit_distance
            ) {
                let weight = TropicalWeight::new(candidate.distance as f64);
                builder.add_correction(
                    i, i + 1,
                    &candidate.term,
                    weight,
                    EdgeMetadata::spelling_correction(&token.text, &candidate.term),
                );
            }
        }

        builder.build(tokens.len())
    }

    fn reconstruct(&self, path: &Path<TropicalWeight>, original: &[Token]) -> String {
        // Preserve original spacing and capitalization
        let mut result = String::new();

        for (i, (label, token)) in path.labels.iter().zip(original).enumerate() {
            if i > 0 {
                result.push_str(&token.leading_whitespace);
            }

            // Match original capitalization
            let corrected = match_case(label, &token.text);
            result.push_str(&corrected);
        }

        result
    }
}
```

### Preserving Context

```rust
#[derive(Clone)]
pub struct Token {
    pub text: String,
    pub leading_whitespace: String,
    pub is_capitalized: bool,
    pub is_all_caps: bool,
}

fn tokenize_preserving_context(text: &str) -> Vec<Token> {
    let mut tokens = Vec::new();
    let mut current_whitespace = String::new();

    for segment in text.split_inclusive(char::is_whitespace) {
        let trimmed = segment.trim_end();
        let whitespace = &segment[trimmed.len()..];

        if !trimmed.is_empty() {
            tokens.push(Token {
                text: trimmed.to_string(),
                leading_whitespace: current_whitespace.clone(),
                is_capitalized: trimmed.chars().next().map(|c| c.is_uppercase()).unwrap_or(false),
                is_all_caps: trimmed.chars().all(|c| c.is_uppercase()),
            });
        }

        current_whitespace = whitespace.to_string();
    }

    tokens
}

fn match_case(corrected: &str, original: &str) -> String {
    if original.chars().all(|c| c.is_uppercase()) {
        corrected.to_uppercase()
    } else if original.chars().next().map(|c| c.is_uppercase()).unwrap_or(false) {
        let mut chars = corrected.chars();
        match chars.next() {
            Some(first) => first.to_uppercase().collect::<String>() + chars.as_str(),
            None => String::new(),
        }
    } else {
        corrected.to_string()
    }
}
```

## Grammar Correction

### Grammar-Aware Corrector

```rust
use lling_llang::cfg::Grammar;
use lling_llang::layers::CfgFilterLayer;

pub struct GrammarCorrector {
    spelling: SpellingCorrector,
    grammar: Grammar,
}

impl GrammarCorrector {
    pub fn correct(&self, text: &str) -> Result<String, CorrectionError> {
        let tokens = self.spelling.tokenize(text);

        // Build spelling lattice
        let spelling_lattice = self.spelling.build_lattice(&tokens);

        // Filter by grammar
        let grammar_layer = CfgFilterLayer::new(&self.grammar);
        let filtered = grammar_layer.apply(&spelling_lattice)?;

        // Check if any valid paths remain
        if filtered.num_edges() == 0 {
            return Err(CorrectionError::NoValidCorrection);
        }

        // Extract best
        let best = viterbi(&mut filtered.clone());
        Ok(self.spelling.reconstruct(&best, &tokens))
    }
}
```

### Part-of-Speech Based Grammar

```rust
use lling_llang::cfg::GrammarBuilder;

fn build_english_grammar() -> Grammar {
    GrammarBuilder::new()
        .start("S")
        // Sentence structures
        .rule("S", &["NP", "VP"])
        .rule("S", &["NP", "VP", "PP"])

        // Noun phrases
        .rule("NP", &["Det", "N"])
        .rule("NP", &["Det", "Adj", "N"])
        .rule("NP", &["Det", "Adj", "Adj", "N"])
        .rule("NP", &["PropN"])

        // Verb phrases
        .rule("VP", &["V"])
        .rule("VP", &["V", "NP"])
        .rule("VP", &["V", "NP", "PP"])
        .rule("VP", &["Aux", "V"])
        .rule("VP", &["Aux", "V", "NP"])

        // Prepositional phrases
        .rule("PP", &["P", "NP"])

        // Terminals (simplified)
        .terminal("Det", &["the", "a", "an", "this", "that"])
        .terminal("N", &["dog", "cat", "fox", "man", "woman"])
        .terminal("Adj", &["quick", "brown", "lazy", "big", "small"])
        .terminal("V", &["runs", "jumps", "eats", "sees", "likes"])
        .terminal("Aux", &["is", "was", "has", "have", "will"])
        .terminal("P", &["over", "under", "with", "to", "from"])
        .terminal("PropN", &["John", "Mary", "London", "Paris"])

        .build()
        .expect("valid grammar")
}
```

## Language Model Integration

### N-gram Scoring

```rust
pub struct NgramScorer {
    model: NgramModel,
    unknown_penalty: f64,
}

impl NgramScorer {
    pub fn score_path(&self, path: &[String]) -> f64 {
        let mut total_score = 0.0;

        // Add BOS token
        let mut context = vec!["<s>".to_string()];

        for word in path {
            let score = self.model.score(&context, word)
                .unwrap_or(self.unknown_penalty);
            total_score += score;
            context.push(word.clone());
        }

        // Add EOS token
        total_score += self.model.score(&context, "</s>")
            .unwrap_or(self.unknown_penalty);

        total_score
    }
}

pub struct LmScoringLayer {
    scorer: NgramScorer,
    weight: f64,
}

impl<B: LatticeBackend> CorrectionLayer<TropicalWeight, B> for LmScoringLayer {
    fn name(&self) -> &str {
        "lm-scoring"
    }

    fn apply(&self, lattice: &Lattice<TropicalWeight, B>)
        -> Result<Lattice<TropicalWeight, B>, LayerError>
    {
        let mut new_lattice = lattice.clone();

        for path in enumerate_all_paths(&new_lattice) {
            let lm_score = self.scorer.score_path(&path.labels);
            let combined = path.weight.value() + self.weight * lm_score;

            // Update path weight
            set_path_weight(&mut new_lattice, &path, TropicalWeight::new(combined));
        }

        Ok(new_lattice)
    }

    fn estimated_reduction(&self) -> f64 {
        1.0
    }
}
```

## Real-Word Error Detection

### Context-Based Detection

```rust
pub struct ContextualCorrector {
    dictionary: DoubleArrayTrie,
    confusion_sets: HashMap<String, Vec<String>>,
    lm: NgramModel,
}

impl ContextualCorrector {
    /// Detect real-word errors using context
    pub fn detect_errors(&self, tokens: &[String]) -> Vec<(usize, Vec<String>)> {
        let mut errors = Vec::new();

        for i in 0..tokens.len() {
            // Get confusion set for this word
            if let Some(confusions) = self.confusion_sets.get(&tokens[i]) {
                let context = self.get_context(tokens, i);

                // Score each alternative
                let mut scores: Vec<_> = confusions.iter()
                    .map(|alt| {
                        let mut test_tokens = tokens.to_vec();
                        test_tokens[i] = alt.clone();
                        (alt.clone(), self.lm.score_sequence(&test_tokens))
                    })
                    .collect();

                scores.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());

                // If best alternative is not the original, flag as error
                if scores[0].0 != tokens[i] {
                    errors.push((i, scores.into_iter().map(|(w, _)| w).collect()));
                }
            }
        }

        errors
    }

    fn get_context(&self, tokens: &[String], pos: usize) -> Vec<String> {
        let start = pos.saturating_sub(2);
        let end = (pos + 3).min(tokens.len());
        tokens[start..end].to_vec()
    }
}
```

### Common Confusion Sets

```rust
fn build_confusion_sets() -> HashMap<String, Vec<String>> {
    let mut sets = HashMap::new();

    // Homophones
    sets.insert("there".into(), vec!["their".into(), "they're".into()]);
    sets.insert("their".into(), vec!["there".into(), "they're".into()]);
    sets.insert("they're".into(), vec!["there".into(), "their".into()]);

    sets.insert("your".into(), vec!["you're".into()]);
    sets.insert("you're".into(), vec!["your".into()]);

    sets.insert("its".into(), vec!["it's".into()]);
    sets.insert("it's".into(), vec!["its".into()]);

    // Common typos that are words
    sets.insert("form".into(), vec!["from".into()]);
    sets.insert("from".into(), vec!["form".into()]);

    sets.insert("then".into(), vec!["than".into()]);
    sets.insert("than".into(), vec!["then".into()]);

    sets
}
```

## Complete Correction Pipeline

```rust
use lling_llang::layers::LayerPipelineBuilder;

pub struct TextCorrector {
    pipeline: LayerPipeline<TropicalWeight, HashMapBackend>,
    tokenizer: Tokenizer,
}

impl TextCorrector {
    pub fn new(config: CorrectorConfig) -> Result<Self, Error> {
        let spelling_layer = SpellingCorrectionLayer::new(
            config.dictionary.clone(),
            config.max_edit_distance,
        );

        let grammar_layer = CfgFilterLayer::new(&config.grammar);

        let lm_layer = LmScoringLayer::new(
            config.language_model,
            config.lm_weight,
        );

        let pipeline = LayerPipelineBuilder::new()
            .add_layer(spelling_layer)
            .add_layer(grammar_layer)
            .add_layer(lm_layer)
            .build();

        Ok(Self {
            pipeline,
            tokenizer: Tokenizer::new(),
        })
    }

    pub fn correct(&self, text: &str) -> CorrectionResult {
        let tokens = self.tokenizer.tokenize(text);

        // Build initial lattice
        let initial = tokens_to_lattice(&tokens);

        // Apply correction pipeline
        let corrected = match self.pipeline.apply(&initial) {
            Ok(lat) => lat,
            Err(e) => return CorrectionResult::error(e),
        };

        // Extract best path
        let best = viterbi(&mut corrected.clone());

        // Build result with change tracking
        CorrectionResult {
            original: text.to_string(),
            corrected: reconstruct_text(&best, &tokens),
            changes: extract_changes(&tokens, &best),
            confidence: calculate_confidence(&best),
        }
    }
}

#[derive(Debug)]
pub struct CorrectionResult {
    pub original: String,
    pub corrected: String,
    pub changes: Vec<Change>,
    pub confidence: f64,
}

#[derive(Debug)]
pub struct Change {
    pub position: usize,
    pub original: String,
    pub corrected: String,
    pub change_type: ChangeType,
}

#[derive(Debug)]
pub enum ChangeType {
    Spelling,
    Grammar,
    Punctuation,
    Capitalization,
}
```

## Interactive Correction

### Suggestions API

```rust
impl TextCorrector {
    /// Get suggestions for a specific position
    pub fn get_suggestions(&self, text: &str, position: usize, limit: usize)
        -> Vec<Suggestion>
    {
        let tokens = self.tokenizer.tokenize(text);
        let lattice = self.build_lattice(&tokens);

        // Get edges at position
        let candidates: Vec<_> = lattice.edges_at_position(position)
            .map(|edge| Suggestion {
                text: edge.label().to_string(),
                score: edge.weight.value(),
                reason: edge.metadata().correction_type(),
            })
            .sorted_by(|a, b| a.score.partial_cmp(&b.score).unwrap())
            .take(limit)
            .collect();

        candidates
    }

    /// Apply a specific suggestion
    pub fn apply_suggestion(&self, text: &str, position: usize, suggestion: &str)
        -> String
    {
        let mut tokens = self.tokenizer.tokenize(text);
        tokens[position].text = suggestion.to_string();
        reconstruct_from_tokens(&tokens)
    }
}

#[derive(Debug)]
pub struct Suggestion {
    pub text: String,
    pub score: f64,
    pub reason: String,
}
```

## Performance Optimization

### Caching Corrections

```rust
use std::collections::HashMap;
use std::sync::RwLock;

pub struct CachedCorrector {
    inner: TextCorrector,
    cache: RwLock<HashMap<String, CorrectionResult>>,
    max_cache_size: usize,
}

impl CachedCorrector {
    pub fn correct(&self, text: &str) -> CorrectionResult {
        // Check cache
        if let Some(result) = self.cache.read().unwrap().get(text) {
            return result.clone();
        }

        // Compute correction
        let result = self.inner.correct(text);

        // Cache result
        let mut cache = self.cache.write().unwrap();
        if cache.len() < self.max_cache_size {
            cache.insert(text.to_string(), result.clone());
        }

        result
    }
}
```

### Incremental Correction

```rust
pub struct IncrementalCorrector {
    inner: TextCorrector,
    last_text: String,
    last_lattice: Option<Lattice<TropicalWeight, HashMapBackend>>,
}

impl IncrementalCorrector {
    pub fn correct(&mut self, text: &str) -> CorrectionResult {
        // Find common prefix with last text
        let common_len = self.find_common_prefix_len(&self.last_text, text);

        if common_len > 0 && self.last_lattice.is_some() {
            // Reuse prefix of previous lattice
            let lattice = self.extend_lattice(text, common_len);
            self.last_lattice = Some(lattice.clone());
            self.last_text = text.to_string();
            // Process...
        } else {
            // Full recomputation
            let result = self.inner.correct(text);
            self.last_text = text.to_string();
            result
        }
    }
}
```

## Next Steps

- [Speech/NLP](speech-nlp.md): Speech recognition integration
- [Library Usage](library-usage.md): Generic integration patterns
- [liblevenshtein](../liblevenshtein/overview.md): Fuzzy matching library
- [Layers](../../architecture/layers.md): Layer architecture