libgrammstein 0.1.0

Hybrid language model (N-gram + Embeddings) for WFST text correction
# libgrammstein Documentation

**libgrammstein** is a hybrid language model library combining N-gram models with subword embeddings. It is designed to integrate with [lling-llang](https://github.com/f1r3fly-io/lling-llang) for WFST-based text correction and normalization.

## What is libgrammstein?

libgrammstein provides:

- **N-gram Language Model**: Statistical word sequence prediction using Modified Kneser-Ney smoothing
- **Subword Embeddings**: FastText-style embeddings for handling out-of-vocabulary words
- **Hybrid Model**: Combines both approaches for robust scoring
- **WFST Integration**: Implements lling-llang's `LanguageModel` trait for lattice rescoring

```
┌─────────────────────────────────────────────────────────────────┐
│                        libgrammstein                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐   ┌─────────────────┐                     │
│  │   N-gram Model  │   │    Subword      │                     │
│  │                 │   │   Embeddings    │                     │
│  │  Modified KN    │   │   FastText-     │                     │
│  │  smoothing      │   │   style         │                     │
│  └────────┬────────┘   └────────┬────────┘                     │
│           │                     │                               │
│           └──────────┬──────────┘                               │
│                      ▼                                          │
│           ┌─────────────────────┐                               │
│           │   HybridLanguage    │                               │
│           │       Model         │                               │
│           │                     │                               │
│           │  Implements         │                               │
│           │  LanguageModel      │                               │
│           │  trait              │                               │
│           └─────────────────────┘                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## Quick Start

```rust
use libgrammstein::prelude::*;

// Load a trained hybrid model
let model = HybridLanguageModel::load("model.bin")?;

// Score a token sequence
let log_prob = model.score_sequence(&["the", "quick", "brown", "fox"]);

// Score a continuation
let next_prob = model.score_continuation(&["the", "quick"], "brown");

println!("Sequence log probability: {}", log_prob);
println!("P(brown | the quick): {}", next_prob.exp());
```

### Training a Model

```rust
use libgrammstein::prelude::*;
use libgrammstein::corpus::PlaintextReader;
use libgrammstein::ngram::TrainerBuilder;

// Stream corpus from directory
let reader = PlaintextReader::from_directory("./corpus")?;

// Train N-gram model
let ngram = TrainerBuilder::new()
    .order(5)
    .train(&reader)?;

// Train subword embeddings
let embedding = EmbeddingTrainer::new()
    .dimension(100)
    .epochs(20)
    .train(&reader)?;

// Combine into hybrid model
let hybrid = HybridLanguageModel::new(ngram, embedding);
hybrid.save("model.bin")?;
```

## Documentation Structure

### Architecture

- [Overview]architecture/overview.md - High-level design and principles
- [Data Flow]architecture/data-flow.md - How data flows through the system
- [Threading Model]architecture/threading-model.md - Concurrency and parallelism

### Components

#### N-gram Model

- [Overview]components/ngram/overview.md - What N-gram models are and how they work
- [Modified Kneser-Ney]components/ngram/modified-kneser-ney.md - The smoothing algorithm
- [Trie Storage]components/ngram/trie-storage.md - Dictionary backend storage
- [Query API]components/ngram/query-api.md - Probability computation

#### Subword Embeddings

- [Overview]components/embedding/overview.md - Word embeddings with subword enrichment
- [BPE Tokenizer]components/embedding/bpe-tokenizer.md - Byte-Pair Encoding
- [Skip-gram]components/embedding/skip-gram.md - Training with negative sampling
- [Similarity]components/embedding/similarity.md - Cosine similarity and nearest neighbors
- [Phonetic Embeddings]components/embedding/phonetic.md - Combining orthographic and phonetic similarity
- [Acoustic Word Embeddings]components/embedding/acoustic-word.md - Fixed-dimensional audio representations

#### Hybrid Model

- [Overview]components/hybrid/overview.md - Combining N-gram and embeddings
- [Interpolation]components/hybrid/interpolation.md - Score combination strategies
- [OOV Handling]components/hybrid/oov-handling.md - Out-of-vocabulary word handling

#### Text Generation

- [Text Generation]components/generation/text-generation.md - Autoregressive text generation with sampling strategies

#### Corpus Processing

- [Overview]components/corpus/overview.md - Streaming corpus architecture
- [Streaming]components/corpus/streaming.md - Memory-efficient processing
- [Formats]components/corpus/formats.md - Wikipedia, Gutenberg, plaintext

#### Acoustic Processing

- [Overview]components/acoustic/overview.md - Audio feature extraction module
- [Feature Extraction]components/acoustic/features.md - MFCC, filterbank, streaming extraction
- [Acoustic Models]components/acoustic/models.md - Candle-based neural acoustic models

#### Neural Module

- [Overview]components/neural/overview.md - ModernBERT-based neural components
- [Model]components/neural/model.md - ModernBERT model wrapper and inference
- [Embedder]components/neural/embedder.md - Document and query embedding
- [Rescorer]components/neural/rescorer.md - Neural rescoring for beam search
- [Summarizer]components/neural/summarizer.md - Extractive summarization with MMR
- [Cache]components/neural/cache.md - KV cache for efficient inference

#### RAG (Retrieval-Augmented Generation)

- [Overview]components/rag/overview.md - Document indexing and retrieval
- [Document]components/rag/document.md - Document structures and metadata
- [Backend]components/rag/backend.md - Retrieval backends (Exact, HNSW)
- [Index]components/rag/index.md - RagIndex with topic integration
- [Retriever]components/rag/retriever.md - Query and retrieval pipeline
- [Builder]components/rag/builder.md - Index construction workflow

#### Topic Extraction

- [Overview]components/topic/overview.md - BERTopic-style topic modeling
- [Clustering]components/topic/clustering.md - Hierarchical agglomerative clustering
- [c-TF-IDF]components/topic/ctfidf.md - Keyword extraction algorithm
- [Dendrogram]components/topic/dendrogram.md - Topic hierarchy navigation

#### Paradigm Detection

- [Overview]components/paradigm/overview.md - Programming paradigm analysis
- [Detection]components/paradigm/detection.md - ParadigmDetector usage and configuration
- [Indicators]components/paradigm/indicators.md - OOP, FP, reactive, procedural indicators
- [API Patterns]components/paradigm/api-patterns.md - PrefixSpan sequence mining
- [Domain Patterns]components/paradigm/domain-patterns.md - Rholang and MeTTa pattern catalogs

#### Code Embeddings

- [Overview]components/code-embeddings/overview.md - Neural code embedding models
- [CodeT5+]components/code-embeddings/codet5.md - CodeT5+ model integration
- [UniXcoder]components/code-embeddings/unixcoder.md - UniXcoder model integration
- [GraphCodeBERT]components/code-embeddings/graphcodebert.md - GraphCodeBERT with data flow
- [Ensemble]components/code-embeddings/ensemble.md - Multi-model ensemble strategies
- [Caching]components/code-embeddings/caching.md - Embedding cache management

#### Subtree Mining

- [Overview]components/subtree/overview.md - Frequent subtree pattern discovery
- [TreeminerD]components/subtree/treeminer-d.md - TreeminerD algorithm details

#### Code Correction

- [Overview]components/code/overview.md - Multi-language code correction module
- [Language Trait]components/code/language.md - CodeLanguage trait and token types
- [Languages]components/code/languages.md - Python, Rust, JavaScript, Rholang, MeTTa support
- [AST]components/code/ast.md - Tree-sitter integration and parsing
- [Tokenizer]components/code/tokenizer.md - Code tokenization system
- [CPG]components/code/cpg.md - Code Property Graphs (AST + CFG + DFG)
- [Correction Framework]components/code/correction.md - Correction types and traits
- [Correctors]components/code/correctors/overview.md - Lexical, grammar, semantic correctors
  - [Lexical]components/code/correctors/lexical.md - Fuzzy matching with Levenshtein distance
  - [Grammar]components/code/correctors/grammar.md - PCFG + Earley parsing
  - [Semantic]components/code/correctors/semantic.md - GNN + CPG analysis
  - [Ensemble]components/code/correctors/ensemble.md - Multi-source aggregation
- [Pipeline]components/code/pipeline.md - End-to-end correction workflow
- [PCFG]components/code/pcfg.md - Probabilistic context-free grammars
- [GNN]components/code/gnn.md - Graph neural networks for code
- [Embeddings]components/code/embeddings.md - Code embeddings (UniXcoder, GraphCodeBERT)
- [Constrained Decoding]components/code/constrained-decoding.md - Grammar-constrained generation
- [WFST Export]components/code/wfst-export.md - PCFG to WFST approximation
- [Subtree Mining]components/code/subtree-mining.md - TreeminerD frequent pattern mining

### Integration

#### lling-llang

- [Overview]integration/lling-llang/overview.md - Integration architecture
- [LanguageModel Trait]integration/lling-llang/language-model-trait.md - Implementing the trait
- [Pipeline Usage]integration/lling-llang/pipeline-usage.md - Using in correction pipelines
- [PathMap Synergy]integration/lling-llang/pathmap-synergy.md - Shared infrastructure

#### liblevenshtein

- [Overview]integration/liblevenshtein/overview.md - Dictionary backend integration
- [Backend Selection]integration/liblevenshtein/backend-selection.md - Choosing the right backend

### Training

- [N-gram Training]training/ngram-training.md - Count collection and smoothing
- [Embedding Training]training/embedding-training.md - Skip-gram training workflow
- [Hyperparameters]training/hyperparameters.md - Tuning guide

### API Reference

- [NgramModel]api/ngram-reference.md - N-gram model API
- [SubwordEmbedding]api/embedding-reference.md - Embedding API
- [HybridLanguageModel]api/hybrid-reference.md - Hybrid model API
- [Traits]api/trait-reference.md - Key traits and interfaces

### Examples

- [Train and Evaluate]examples/train-and-evaluate.md - End-to-end workflow
- [Perplexity Filter]examples/perplexity-filter.md - Text quality filtering
- [Spell Correction]examples/spell-correction.md - lling-llang integration

## Prerequisites

- **Rust**: 1.75+ (2024 edition)
- **liblevenshtein-rust**: Dictionary backends
- **Corpus data**: Wikipedia dumps, Project Gutenberg, or custom text files

## Features

```toml
[dependencies]
libgrammstein = { version = "0.1", features = ["lling-llang-integration", "serde"] }
```

| Feature | Description |
|---------|-------------|
| `default` | Core N-gram and embedding functionality |
| `lling-llang-integration` | Implements lling-llang's `LanguageModel` trait |
| `serde` | Model serialization/deserialization |
| `async` | Async corpus streaming (Tokio) |
| `acoustic` | Audio feature extraction (MFCC, filterbank) |
| `candle-model` | Candle-based neural acoustic models |
| `phonetic` | Phonetic-aware embeddings with Zompist rules |
| `neural-rescore` | ModernBERT embeddings, rescoring, and summarization |
| `rag` | RAG indexing with topic extraction |
| `code` | Core code correction (lexical corrector only) |
| `code-python` | Python language support |
| `code-rust` | Rust language support |
| `code-javascript` | JavaScript language support |
| `code-rholang` | Rholang language support |
| `code-metta` | MeTTa language support |
| `code-neural` | Neural code embeddings (UniXcoder, GraphCodeBERT) |

## Related Projects

- [lling-llang]https://github.com/f1r3fly-io/lling-llang: WFST framework for text correction
- [liblevenshtein-rust]https://github.com/f1r3fly-io/liblevenshtein-rust: Fuzzy string matching and trie dictionaries
- [F1R3FLY.io]https://f1r3fly.io: Distributed computing platform

## License

MIT OR Apache-2.0