llm-test-bench 0.1.0

# LLM Test Bench CLI

A production-grade CLI for testing and benchmarking Large Language Model (LLM) applications.

## Phase 1 Implementation Status

This is the **Phase 1 (Milestone 1.3)** implementation of the CLI scaffolding, focusing on command structure and configuration management.

### Completed Features

- Complete Clap-based CLI command structure with derive API
- `config init` command with interactive setup wizard
- Command stubs for `test`, `bench`, and `eval` (full implementation in later phases)
- Shell completion generation for bash, zsh, fish, PowerShell, and elvish
- Comprehensive integration tests with assert_cmd (30 tests, all passing)
- Error handling with anyhow
- Global flags for verbose output and color control

## Installation

### Building from Source

```bash
# Clone the repository
git clone https://github.com/llm-test-bench/llm-test-bench.git
cd llm-test-bench

# Build the CLI
cargo build --release --package llm-test-bench

# The binary will be at: target/release/llm-test-bench
```

### Development Build

```bash
cargo build --package llm-test-bench
```

## Usage

### Basic Commands

```bash
# Show help
llm-test-bench --help

# Show version
llm-test-bench --version

# Enable verbose output (global flag)
llm-test-bench --verbose <COMMAND>

# Disable colored output
llm-test-bench --no-color <COMMAND>
```

### Configuration Management

#### Initialize Configuration

```bash
# Interactive setup wizard
llm-test-bench config init

# Non-interactive mode (uses defaults)
llm-test-bench config init --non-interactive

# Initialize specific provider only
llm-test-bench config init --provider openai
```

The `config init` command will:
1. Guide you through an interactive setup
2. Create a configuration file at `~/.config/llm-test-bench/config.toml`
3. Prompt for API key preferences (environment variable recommended)
4. Configure default models and parameters

#### Show Configuration

```bash
# Display current configuration
llm-test-bench config show

# Show with full TOML content
llm-test-bench config show --verbose
```

#### Validate Configuration

```bash
# Validate default config file
llm-test-bench config validate

# Validate specific config file
llm-test-bench config validate --config /path/to/config.toml
```

### Test Command (Phase 2)

```bash
# Run a single test (stub - coming in Phase 2)
llm-test-bench test openai --prompt "Hello, world!" --model gpt-4

# With additional parameters
llm-test-bench test anthropic \
  --prompt "Explain quantum computing" \
  --model claude-sonnet-4 \
  --temperature 0.7 \
  --max-tokens 1000

# With expected output for validation
llm-test-bench test openai \
  --prompt "What is 2+2?" \
  --expect "4"

# Save results to file
llm-test-bench test openai \
  --prompt "Test prompt" \
  --save ./results.json
```

### Benchmark Command (Phase 3)

```bash
# Run benchmark (stub - coming in Phase 3)
llm-test-bench bench \
  --dataset ./prompts.json \
  --providers openai,anthropic

# With concurrency and iterations
llm-test-bench bench \
  --dataset ./prompts.json \
  --providers openai,anthropic \
  --concurrency 4 \
  --iterations 10

# With caching and output format
llm-test-bench bench \
  --dataset ./dataset.json \
  --providers openai \
  --cache \
  --format html \
  --output ./benchmark-results
```

### Evaluation Command (Phase 4)

```bash
# Evaluate results (stub - coming in Phase 4)
llm-test-bench eval \
  --results ./results.json \
  --metrics accuracy,latency,cost

# With baseline comparison
llm-test-bench eval \
  --results ./results.json \
  --baseline ./baseline.json \
  --metrics all

# With visualizations
llm-test-bench eval \
  --results ./results.json \
  --visualize \
  --format html \
  --output ./evaluation-report
```

### Shell Completions

Generate shell completions for your shell:

```bash
# Bash
llm-test-bench completions bash > ~/.local/share/bash-completion/completions/llm-test-bench

# Zsh
llm-test-bench completions zsh > ~/.zfunc/_llm-test-bench

# Fish
llm-test-bench completions fish > ~/.config/fish/completions/llm-test-bench.fish

# PowerShell
llm-test-bench completions powershell > llm-test-bench.ps1

# Elvish
llm-test-bench completions elvish > ~/.elvish/lib/completions/llm-test-bench.elv
```

## Command Aliases

All major commands have short aliases:

- `test` → `t`
- `bench` → `b`
- `eval` → `e`

```bash
# These are equivalent:
llm-test-bench test openai --prompt "test"
llm-test-bench t openai --prompt "test"
```

## Configuration File Format

The configuration file uses TOML format and is stored at `~/.config/llm-test-bench/config.toml`:

```toml
version = "1.0"

[providers.openai]
type = "openai"
api_key_env = "OPENAI_API_KEY"  # Recommended: use environment variable
base_url = "https://api.openai.com/v1"

[providers.openai.defaults]
model = "gpt-4"
temperature = 0.7
max_tokens = 4096

[providers.anthropic]
type = "anthropic"
api_key_env = "ANTHROPIC_API_KEY"

[providers.anthropic.defaults]
model = "claude-sonnet-4-20250514"
temperature = 0.7

[defaults]
output_dir = "./test-results"
cache_enabled = true
```

## Testing

### Running Tests

```bash
# Run all CLI tests
cargo test --package llm-test-bench

# Run only unit tests
cargo test --package llm-test-bench --lib

# Run only integration tests
cargo test --package llm-test-bench --test integration

# Run with output
cargo test --package llm-test-bench -- --nocapture

# Run specific test
cargo test --package llm-test-bench test_help_command
```

### Test Coverage

Current test suite (Phase 1):
- **6 unit tests**: Argument parsing and configuration serialization
- **24 integration tests**: End-to-end CLI behavior testing

All 30 tests passing.

## Architecture

### CLI Structure

```
cli/
├── src/
│   ├── main.rs                      # Entry point, command routing
│   ├── commands/
│   │   ├── mod.rs                   # Command module declarations
│   │   ├── config.rs                # Config init/show/validate (COMPLETE)
│   │   ├── test.rs                  # Test command stub (Phase 2)
│   │   ├── bench.rs                 # Benchmark command stub (Phase 3)
│   │   └── eval.rs                  # Evaluation command stub (Phase 4)
│   └── lib.rs                       # (Future: shared utilities)
├── tests/
│   └── integration/
│       ├── main.rs                  # Test module entry
│       └── cli_tests.rs             # Integration tests
└── Cargo.toml
```

### Command Flow

```
User Input → Clap Parser → Command Router (main.rs) → Command Handler → Output
                    ↓
            Argument Validation
                    ↓
            Environment Setup
```

### Error Handling

The CLI uses `anyhow` for application-level error handling:

```rust
pub async fn execute(args: Args, verbose: bool) -> Result<()> {
    // Command implementation
    Ok(())
}
```

Errors are caught in `main.rs` and displayed with:
- Error message
- Chain of causes (in verbose mode)
- Appropriate exit code

## Development Guide

### Adding a New Command

1. Create a new file in `cli/src/commands/`:

```rust
// cli/src/commands/mynewcommand.rs
use anyhow::Result;
use clap::Args;

#[derive(Args, Debug)]
pub struct MyNewCommandArgs {
    /// Description of argument
    #[arg(short, long)]
    pub my_arg: String,
}

pub async fn execute(args: MyNewCommandArgs, verbose: bool) -> Result<()> {
    println!("Executing my new command");
    // Implementation here
    Ok(())
}
```

2. Add to `cli/src/commands/mod.rs`:

```rust
pub mod mynewcommand;
```

3. Add to command enum in `cli/src/main.rs`:

```rust
#[derive(Subcommand)]
enum Commands {
    // ... existing commands ...

    /// Description of my new command
    MyNewCommand(mynewcommand::MyNewCommandArgs),
}
```

4. Add to command router:

```rust
let result = match cli.command {
    // ... existing matches ...
    Commands::MyNewCommand(args) => mynewcommand::execute(args, cli.verbose).await,
};
```

5. Write integration tests in `cli/tests/integration/cli_tests.rs`.

### Adding Integration Tests

Use the `assert_cmd` crate for testing:

```rust
#[test]
fn test_my_new_command() {
    cli()
        .arg("mynewcommand")
        .arg("--my-arg")
        .arg("value")
        .assert()
        .success()
        .stdout(predicate::str::contains("expected output"));
}
```

## Phase 2+ Roadmap

### Phase 2: Test Command (Weeks 2-3)
- Implement provider integration (OpenAI, Anthropic)
- Add response generation and streaming
- Implement assertion validation
- Add result formatting and export

### Phase 3: Bench Command (Weeks 4-5)
- Implement multi-provider parallel execution
- Add dataset loading (JSON, YAML, CSV)
- Collect performance metrics
- Generate comparison reports

### Phase 4: Eval Command (Weeks 6-7)
- Implement metrics calculation
- Add baseline comparison
- Generate visualizations
- Create HTML/Markdown reports

## Contributing

When contributing to the CLI:

1. Follow Rust conventions and use `cargo fmt`
2. Add tests for all new features
3. Update this README for new commands
4. Ensure all tests pass: `cargo test --package llm-test-bench`
5. Test the CLI manually with various inputs

## License

MIT License - See LICENSE file

## Related Documentation

- [Main README](../README.md)
- [Architecture Documentation](../docs/ARCHITECTURE.md)
- [Implementation Roadmap](../docs/IMPLEMENTATION_ROADMAP.md)
- [Quick Reference](../docs/QUICK_REFERENCE.md)