llm-test-bench-0.1.0 is not a library.
LLM Test Bench CLI
A production-grade CLI for testing and benchmarking Large Language Model (LLM) applications.
Phase 1 Implementation Status
This is the Phase 1 (Milestone 1.3) implementation of the CLI scaffolding, focusing on command structure and configuration management.
Completed Features
- Complete Clap-based CLI command structure with derive API
config initcommand with interactive setup wizard- Command stubs for
test,bench, andeval(full implementation in later phases) - Shell completion generation for bash, zsh, fish, PowerShell, and elvish
- Comprehensive integration tests with assert_cmd (30 tests, all passing)
- Error handling with anyhow
- Global flags for verbose output and color control
Installation
Building from Source
# Clone the repository
# Build the CLI
# The binary will be at: target/release/llm-test-bench
Development Build
Usage
Basic Commands
# Show help
# Show version
# Enable verbose output (global flag)
# Disable colored output
Configuration Management
Initialize Configuration
# Interactive setup wizard
# Non-interactive mode (uses defaults)
# Initialize specific provider only
The config init command will:
- Guide you through an interactive setup
- Create a configuration file at
~/.config/llm-test-bench/config.toml - Prompt for API key preferences (environment variable recommended)
- Configure default models and parameters
Show Configuration
# Display current configuration
# Show with full TOML content
Validate Configuration
# Validate default config file
# Validate specific config file
Test Command (Phase 2)
# Run a single test (stub - coming in Phase 2)
# With additional parameters
# With expected output for validation
# Save results to file
Benchmark Command (Phase 3)
# Run benchmark (stub - coming in Phase 3)
# With concurrency and iterations
# With caching and output format
Evaluation Command (Phase 4)
# Evaluate results (stub - coming in Phase 4)
# With baseline comparison
# With visualizations
Shell Completions
Generate shell completions for your shell:
# Bash
# Zsh
# Fish
# PowerShell
# Elvish
Command Aliases
All major commands have short aliases:
test→tbench→beval→e
# These are equivalent:
Configuration File Format
The configuration file uses TOML format and is stored at ~/.config/llm-test-bench/config.toml:
= "1.0"
[]
= "openai"
= "OPENAI_API_KEY" # Recommended: use environment variable
= "https://api.openai.com/v1"
[]
= "gpt-4"
= 0.7
= 4096
[]
= "anthropic"
= "ANTHROPIC_API_KEY"
[]
= "claude-sonnet-4-20250514"
= 0.7
[]
= "./test-results"
= true
Testing
Running Tests
# Run all CLI tests
# Run only unit tests
# Run only integration tests
# Run with output
# Run specific test
Test Coverage
Current test suite (Phase 1):
- 6 unit tests: Argument parsing and configuration serialization
- 24 integration tests: End-to-end CLI behavior testing
All 30 tests passing.
Architecture
CLI Structure
cli/
├── src/
│ ├── main.rs # Entry point, command routing
│ ├── commands/
│ │ ├── mod.rs # Command module declarations
│ │ ├── config.rs # Config init/show/validate (COMPLETE)
│ │ ├── test.rs # Test command stub (Phase 2)
│ │ ├── bench.rs # Benchmark command stub (Phase 3)
│ │ └── eval.rs # Evaluation command stub (Phase 4)
│ └── lib.rs # (Future: shared utilities)
├── tests/
│ └── integration/
│ ├── main.rs # Test module entry
│ └── cli_tests.rs # Integration tests
└── Cargo.toml
Command Flow
User Input → Clap Parser → Command Router (main.rs) → Command Handler → Output
↓
Argument Validation
↓
Environment Setup
Error Handling
The CLI uses anyhow for application-level error handling:
pub async
Errors are caught in main.rs and displayed with:
- Error message
- Chain of causes (in verbose mode)
- Appropriate exit code
Development Guide
Adding a New Command
- Create a new file in
cli/src/commands/:
// cli/src/commands/mynewcommand.rs
use Result;
use Args;
pub async
- Add to
cli/src/commands/mod.rs:
- Add to command enum in
cli/src/main.rs:
- Add to command router:
let result = match cli.command ;
- Write integration tests in
cli/tests/integration/cli_tests.rs.
Adding Integration Tests
Use the assert_cmd crate for testing:
Phase 2+ Roadmap
Phase 2: Test Command (Weeks 2-3)
- Implement provider integration (OpenAI, Anthropic)
- Add response generation and streaming
- Implement assertion validation
- Add result formatting and export
Phase 3: Bench Command (Weeks 4-5)
- Implement multi-provider parallel execution
- Add dataset loading (JSON, YAML, CSV)
- Collect performance metrics
- Generate comparison reports
Phase 4: Eval Command (Weeks 6-7)
- Implement metrics calculation
- Add baseline comparison
- Generate visualizations
- Create HTML/Markdown reports
Contributing
When contributing to the CLI:
- Follow Rust conventions and use
cargo fmt - Add tests for all new features
- Update this README for new commands
- Ensure all tests pass:
cargo test --package llm-test-bench - Test the CLI manually with various inputs
License
MIT License - See LICENSE file