libgrammstein 0.1.0

Hybrid language model (N-gram + Embeddings) for WFST text correction
# grammstein CLI Reference

The `grammstein` CLI provides command-line tools for training and querying language models.

## Installation

Build the CLI with the `cli` feature:

```bash
cargo build --release --features cli
```

The binary is located at `target/release/grammstein`.

## Global Options

```
-v, --verbose    Enable verbose output
-q, --quiet      Suppress progress bars and status messages
-h, --help       Print help
-V, --version    Print version
```

## Commands

### corpus - Corpus Processing

Process and analyze text corpora.

#### corpus stats

Display corpus statistics including word counts, vocabulary, and token distribution.

```bash
grammstein corpus stats <PATH> [OPTIONS]

Options:
  --top <N>       Show top N most frequent words (default: 10)
  --format <FMT>  Output format: text, json (default: text)
```

Example:
```bash
grammstein corpus stats corpus.txt
grammstein corpus stats wikipedia.xml.bz2 --top 20
```

#### corpus sample

Sample random sentences from a corpus.

```bash
grammstein corpus sample <PATH> [OPTIONS]

Options:
  -n, --count <N>    Number of sentences to sample (default: 5)
  --seed <SEED>      Random seed for reproducibility
```

Example:
```bash
grammstein corpus sample corpus.txt -n 10
```

#### corpus detect

Detect the language of a corpus.

```bash
grammstein corpus detect <PATH>
```

Example:
```bash
grammstein corpus detect corpus.txt
# Output:
# Detected language: en (English)
# Confidence: 99.8%
# Reliable: yes
```

---

### train - Model Training

Train n-gram and embedding models.

#### train ngram

Train an n-gram language model.

```bash
grammstein train ngram <CORPUS> <OUTPUT> [OPTIONS]

Options:
  --order <N>           N-gram order (default: 5)
  --min-count <N>       Minimum n-gram count (default: 1)
  --checkpoint <PATH>   Save checkpoints to path
  --checkpoint-interval <N>
                        Checkpoint every N sentences (default: 100000)
  --resume <PATH>       Resume from checkpoint
```

Example:
```bash
grammstein train ngram corpus.txt model.bin --order 5
grammstein train ngram large-corpus.txt model.bin --checkpoint ./checkpoints
```

#### train embedding

Train subword embeddings.

```bash
grammstein train embedding <CORPUS> <OUTPUT> [OPTIONS]

Options:
  --dim <N>            Embedding dimension (default: 100)
  --window <N>         Context window size (default: 5)
  --min-count <N>      Minimum word count (default: 5)
  --epochs <N>         Training epochs (default: 5)
  --neg-samples <N>    Negative samples (default: 5)
  --learning-rate <F>  Initial learning rate (default: 0.05)
  --checkpoint <PATH>  Save checkpoints to path
```

Example:
```bash
grammstein train embedding corpus.txt embed.bin --dim 300 --epochs 10
```

#### train hybrid

Train a hybrid model (n-gram + embeddings).

```bash
grammstein train hybrid <CORPUS> <OUTPUT> [OPTIONS]

Options:
  --ngram-order <N>    N-gram order (default: 5)
  --embed-dim <N>      Embedding dimension (default: 100)
  --lambda <F>         Interpolation weight for n-gram (default: 0.5)
```

Example:
```bash
grammstein train hybrid corpus.txt hybrid.bin --lambda 0.7
```

---

### models - Model Information

Inspect and manage trained models.

#### models info

Display model information.

```bash
grammstein models info <MODEL>
```

Example:
```bash
grammstein models info model.bin
# Output:
# Model Information
#
# Path: model.bin
# Type: NgramModel<DynamicDawgChar>
# Size: 10.42 KiB
#
# N-gram component:
#   Order:       3
#   Vocab size:  291
#   Smoothing:   Modified Kneser-Ney
```

#### models list

List n-grams in a model.

```bash
grammstein models list <MODEL> [OPTIONS]

Options:
  -n, --limit <N>     Maximum entries to show (default: 100)
  --prefix <PREFIX>   Filter by prefix
```

---

### query - Query Models

Score text and find similar words.

#### query score

Score a sentence or continuation.

```bash
grammstein query score <MODEL> <TEXT> [OPTIONS]

Options:
  --mode <MODE>     Scoring mode: sentence, continuation (default: sentence)
```

Example:
```bash
grammstein query score model.bin "the quick brown fox"
# Output:
# Tokens: the quick brown fox
# Mode:   sentence
#
# Log probability: -5.6733
# Perplexity:      291.00
```

#### query completions

Get top completions for a context.

```bash
grammstein query completions <MODEL> <CONTEXT> [OPTIONS]

Options:
  -n, --count <N>    Number of completions (default: 10)
```

Example:
```bash
grammstein query completions model.bin "the quick"
# Output:
# Top completions for "the quick":
#   1. brown         -2.345  (P=0.0956)
#   2. fox           -3.012  (P=0.0492)
```

#### query similar

Find similar words (embedding models only).

```bash
grammstein query similar <MODEL> <WORD> [OPTIONS]

Options:
  -n, --count <N>    Number of similar words (default: 10)
```

---

### convert - Format Conversion

Convert between model formats.

```bash
grammstein convert <INPUT> <OUTPUT> [OPTIONS]

Options:
  --format <FMT>    Output format: binary, zstd, json
  --compress        Enable compression (zstd)
```

---

### eval - Model Evaluation

Evaluate model performance.

```bash
grammstein eval <MODEL> <TEST_CORPUS> [OPTIONS]

Options:
  --metric <METRIC>    Metric: perplexity, accuracy (default: perplexity)
  --batch-size <N>     Batch size (default: 1000)
```

Example:
```bash
grammstein eval model.bin test.txt
# Output:
# Evaluation Results
#
# Sentences:   1000
# Tokens:      25431
# Perplexity:  142.56
# Time:        1.23s
```

---

### repl - Interactive Mode

Start an interactive REPL for model exploration.

```bash
grammstein repl <MODEL>
```

REPL Commands:
- `score <text>` - Score a sentence
- `complete <context>` - Get completions
- `similar <word>` - Find similar words (embedding)
- `info` - Show model info
- `help` - Show help
- `quit` / `exit` - Exit REPL

Example session:
```
grammstein> score the quick brown fox
Log probability: -5.6733
Perplexity:      291.00

grammstein> complete the quick
  1. brown         -2.345  (P=0.0956)
  2. fox           -3.012  (P=0.0492)

grammstein> quit
```

---

## Environment Variables

- `GRAMMSTEIN_CACHE_DIR` - Cache directory for downloaded corpora
- `GRAMMSTEIN_LOG_LEVEL` - Log level: debug, info, warn, error

## Exit Codes

- `0` - Success
- `1` - General error
- `2` - Invalid arguments

## Examples

### Complete Workflow

```bash
# 1. Analyze corpus
grammstein corpus stats wikipedia-en.txt

# 2. Train n-gram model
grammstein train ngram wikipedia-en.txt ngram.bin \
  --order 5 \
  --checkpoint ./checkpoints

# 3. Check model
grammstein models info ngram.bin

# 4. Query model
grammstein query score ngram.bin "artificial intelligence"

# 5. Interactive exploration
grammstein repl ngram.bin
```

### Training with HTTP Streaming

```bash
# Train directly from Wikipedia dump URL
grammstein train ngram \
  "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2" \
  model.bin \
  --checkpoint ./checkpoints
```