<div align="center">
# π§ͺ LLM Test Bench
**A comprehensive, production-ready framework for benchmarking, testing, and evaluating Large Language Models**
[](https://github.com/globalbusinessadvisors/llm-test-bench/actions)
[](https://opensource.org/licenses/MIT)
[](https://www.rust-lang.org)
[](https://crates.io)
[Features](#-features) β’ [Quick Start](#-quick-start) β’ [Documentation](#-documentation) β’ [Architecture](#-architecture) β’ [Contributing](#-contributing)
</div>
---
## π Overview
LLM Test Bench is a powerful, enterprise-grade framework built in Rust for comprehensive testing, benchmarking, and evaluation of Large Language Models. It provides a unified interface to test multiple LLM providers, evaluate responses with sophisticated metrics, and visualize results through an intuitive dashboard.
### Why LLM Test Bench?
- **π Multi-Provider Support**: Test 14+ LLM providers with 65 models through a single, unified interface
- **π Latest Models**: Full support for GPT-5, Claude Opus 4, Gemini 2.5, and all 2025 releases
- **π Comprehensive Metrics**: Evaluate models with perplexity, coherence, relevance, faithfulness, and custom evaluators
- **β‘ High Performance**: Built in Rust for speed, safety, and scalability
- **π¨ Rich Visualization**: Interactive dashboards with real-time metrics and beautiful charts
- **π Extensible**: Plugin system, custom evaluators, and distributed computing support
- **π³ Production Ready**: Docker support, monitoring, REST/GraphQL APIs, and WebSocket streaming
---
## β¨ Features
### Core Capabilities
#### π€ Multi-Provider LLM Support
**OpenAI (27 models)**
```
gpt-5
gpt-4.5, gpt-4.5-2025-02-27
gpt-4.1, gpt-4.1-2025-04
gpt-4o, gpt-4o-2024-11-20, gpt-4o-2024-08-06, gpt-4o-2024-05-13
gpt-4o-mini, gpt-4o-mini-2024-07-18
o1, o1-preview, o1-preview-2024-09-12, o1-mini, o1-mini-2024-09-12, o3-mini
gpt-4-turbo, gpt-4-turbo-2024-04-09, gpt-4-turbo-preview
gpt-4-0125-preview, gpt-4-1106-preview
gpt-4, gpt-4-0613
gpt-3.5-turbo, gpt-3.5-turbo-0125, gpt-3.5-turbo-1106
```
**Anthropic (15 models)**
```
claude-opus-4, claude-opus-4-20250501
claude-sonnet-4.5, claude-sonnet-4.5-20250901
claude-sonnet-4, claude-sonnet-4-20250514
claude-3-5-sonnet-latest, claude-3-5-sonnet-20241022, claude-3-5-sonnet-20240620
claude-3-5-haiku-latest, claude-3-5-haiku-20241022
claude-3-opus-latest, claude-3-opus-20240229
claude-3-sonnet-20240229
claude-3-haiku-20240307
```
**Google Gemini (16 models)**
```
gemini-2.5-pro
gemini-2.5-computer-use, gemini-2.5-computer-use-20251007
gemini-2.0-flash-exp, gemini-2.0-flash-thinking-exp-1219
gemini-1.5-pro, gemini-1.5-pro-latest, gemini-1.5-pro-002, gemini-1.5-pro-001
gemini-1.5-flash, gemini-1.5-flash-latest, gemini-1.5-flash-002
gemini-1.5-flash-001, gemini-1.5-flash-8b
gemini-pro, gemini-pro-vision
```
**Mistral AI (7 models)**
```
mistral-code, mistral-code-20250604
magistral-large, magistral-medium, magistral-small
voxtral-small, voxtral-small-20250701
```
**Additional Providers**
- **Azure OpenAI**: All OpenAI models via Azure endpoints
- **AWS Bedrock**: Claude, Llama, Titan, and more
- **Cohere**: Command, Command R/R+
- **Open Source**: Ollama, Hugging Face, Together AI, Replicate
- **Specialized**: Groq, Perplexity AI
#### π Advanced Evaluation Metrics
- **Perplexity Analysis**: Statistical language model evaluation
- **Coherence Scoring**: Semantic consistency and logical flow
- **Relevance Evaluation**: Context-aware response quality
- **Faithfulness Testing**: Source attribution and hallucination detection
- **LLM-as-Judge**: Use LLMs to evaluate other LLMs
- **Text Analysis**: Readability, sentiment, toxicity, PII detection
- **Custom Evaluators**: Build your own evaluation metrics
#### π― Benchmarking & Testing
- **Systematic Testing**: Automated test suites with rich assertions
- **Comparative Analysis**: Side-by-side model comparison
- **Performance Profiling**: Latency, throughput, and cost tracking
- **A/B Testing**: Statistical significance testing for model selection
- **Optimization Tools**: Automatic parameter tuning and model recommendation
#### π Visualization & Reporting
- **Interactive Dashboard**: Real-time metrics with Chart.js
- **Rich Charts**: Performance graphs, cost analysis, trend visualization
- **Multiple Formats**: HTML reports, JSON exports, custom templates
- **Cost Analysis**: Track spending across providers and models
- **Historical Trends**: Long-term performance tracking
#### π API & Integration
- **REST API**: Complete HTTP API with authentication
- **GraphQL**: Flexible query interface for complex data needs
- **WebSocket**: Real-time streaming and live updates
- **Monitoring**: Prometheus metrics and health checks
- **Distributed Computing**: Scale benchmarks across multiple nodes
#### π Extensibility
- **Plugin System**: WASM-based sandboxed plugins
- **Custom Evaluators**: Implement domain-specific metrics
- **Multimodal Support**: Image, audio, and video evaluation
- **Database Backend**: PostgreSQL with repository pattern
- **Flexible Architecture**: Clean, modular design for easy extension
---
## π Quick Start
### Prerequisites
- **Rust**: 1.75.0 or later ([Install Rust](https://rustup.rs/))
- **API Keys**: At least one LLM provider API key
### Installation
```bash
# Clone the repository
git clone https://github.com/globalbusinessadvisors/llm-test-bench.git
cd llm-test-bench
# Build the project
cargo build --release
# Install CLI globally (optional)
cargo install --path cli
```
### Configuration
Set up your API keys as environment variables:
```bash
# OpenAI
export OPENAI_API_KEY="sk-..."
# Anthropic
export ANTHROPIC_API_KEY="sk-ant-..."
# Google
export GOOGLE_API_KEY="..."
# AWS Bedrock
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="us-east-1"
```
Or create a `.env` file:
```bash
cp .env.example .env
# Edit .env with your API keys
```
### Basic Usage
```bash
# Run a simple benchmark with GPT-5
llm-test-bench bench --provider openai --model gpt-5 --prompt "Explain quantum computing"
# Test with Claude Opus 4
llm-test-bench bench --provider anthropic --model claude-opus-4 --prompt "Code review this function"
# Use Gemini 2.5 Computer Use
llm-test-bench bench --provider google --model gemini-2.5-computer-use --prompt "Automate this task"
# Compare multiple models across providers
llm-test-bench compare \
--models "openai:gpt-5,anthropic:claude-opus-4,google:gemini-2.5-pro" \
--prompt "Write a Python function to sort a list"
# Benchmark code models
llm-test-bench bench --provider mistral --model mistral-code --prompt "Implement binary search"
# Analyze results
llm-test-bench analyze --results benchmark_results.json
# Launch interactive dashboard
llm-test-bench dashboard --port 8080
# Optimize model selection
llm-test-bench optimize \
--metric latency \
--max-cost 0.01 \
--dataset prompts.json
```
### Docker Deployment
```bash
# Using Docker Compose (includes PostgreSQL, Redis, Prometheus)
docker-compose up -d
# Access the dashboard
open http://localhost:8080
# View metrics
open http://localhost:9090 # Prometheus
```
---
## π Documentation
### Getting Started
- [Quick Start Guide](docs/QUICKSTART_PHASE4.md) - Get up and running in 5 minutes
- [CLI Reference](docs/CLI_REFERENCE.md) - Complete command-line documentation
- [Configuration Guide](docs/CONFIGURATION.md) - Advanced configuration options
### Architecture & Design
- [Architecture Overview](docs/ARCHITECTURE_REPORT.md) - System design and components
- [Workspace Structure](docs/WORKSPACE_STRUCTURE.md) - Project organization
- [Technical Architecture](plans/PHASE5_TECHNICAL_ARCHITECTURE.md) - Deep dive into design
### Features
- [Provider Support](docs/PROVIDERS.md) - All supported LLM providers
- [API Documentation](docs/API.md) - REST & GraphQL API reference
- [Monitoring](docs/MONITORING.md) - Observability and metrics
- [Distributed Computing](docs/DISTRIBUTED.md) - Scaling across nodes
- [Multimodal](docs/MULTIMODAL.md) - Image, audio, and video support
- [Plugins](docs/PLUGINS.md) - Extensibility and custom plugins
### Deployment
- [Docker Deployment](docs/DOCKER_DEPLOYMENT.md) - Containerized deployment guide
- [Database Setup](docs/DATABASE.md) - PostgreSQL configuration
### Development
- [Phase Implementation Reports](docs/) - Detailed implementation history
- [Contributing Guide](CONTRIBUTING.md) - How to contribute
- [Development Setup](docs/DEVELOPMENT.md) - Set up your dev environment
---
## ποΈ Architecture
LLM Test Bench follows a clean, modular architecture:
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLI Layer β
β bench β compare β analyze β dashboard β optimize β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Core Library (core/) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β’ Providers β’ Evaluators β’ Orchestration β
β β’ Analytics β’ Visualization β’ Monitoring β
β β’ Distributed β’ Plugins β’ Multimodal β
β β’ API Server β’ Database β’ Configuration β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β External Services β
β LLM APIs β PostgreSQL β Redis β Prometheus β S3 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
### Key Components
- **Providers**: Unified interface for 14+ LLM providers
- **Evaluators**: Pluggable metrics for response quality assessment
- **Orchestration**: Intelligent routing, ranking, and comparison
- **Visualization**: Interactive dashboards and rich reporting
- **API Server**: REST, GraphQL, and WebSocket endpoints
- **Distributed**: Cluster coordination for large-scale benchmarks
- **Monitoring**: Prometheus metrics and health checks
- **Plugins**: WASM-based extensibility system
---
## π οΈ Technology Stack
- **Language**: Rust π¦
- **CLI**: Clap (command-line parsing)
- **Async**: Tokio (async runtime)
- **HTTP**: Axum (web framework)
- **Database**: SQLx + PostgreSQL
- **Serialization**: Serde (JSON/YAML)
- **GraphQL**: Async-GraphQL
- **Monitoring**: Prometheus client
- **WebSocket**: Tokio-Tungstenite
- **Distributed**: Custom protocol over TCP
- **Plugins**: Wasmtime (WASM runtime)
---
## π¦ Project Structure
```
llm-test-bench/
βββ cli/ # Command-line interface
β βββ src/
β β βββ commands/ # CLI commands (bench, compare, etc.)
β β βββ main.rs
β βββ tests/ # Integration tests
βββ core/ # Core library
β βββ src/
β β βββ providers/ # LLM provider implementations
β β βββ evaluators/ # Evaluation metrics
β β βββ orchestration/ # Model routing & comparison
β β βββ visualization/ # Dashboard & charts
β β βββ api/ # REST/GraphQL/WebSocket
β β βββ distributed/ # Cluster coordination
β β βββ monitoring/ # Metrics & health checks
β β βββ plugins/ # Plugin system
β β βββ multimodal/ # Image/audio/video
β β βββ analytics/ # Statistics & optimization
β β βββ config/ # Configuration
β βββ tests/ # Unit & integration tests
βββ docs/ # Documentation
βββ examples/ # Usage examples
βββ plans/ # Architecture & planning docs
βββ docker-compose.yml # Docker deployment
```
---
## π― Use Cases
### 1. Model Selection
Compare multiple LLM providers to choose the best model for your use case based on quality, cost, and latency.
### 2. Quality Assurance
Systematic testing of LLM applications with rich assertions and automated evaluation metrics.
### 3. Performance Benchmarking
Measure and track latency, throughput, and cost across different models and configurations.
### 4. Regression Testing
Ensure model updates don't degrade quality with historical comparison and automated alerts.
### 5. Cost Optimization
Identify the most cost-effective model that meets your quality requirements.
### 6. Research & Experimentation
Rapid prototyping and comparison of different prompts, models, and parameters.
---
## π€ Contributing
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
### Development Setup
```bash
# Clone and build
git clone https://github.com/globalbusinessadvisors/llm-test-bench.git
cd llm-test-bench
cargo build
# Run tests
cargo test
# Run with logging
RUST_LOG=debug cargo run -- bench --help
# Format code
cargo fmt
# Lint
cargo clippy -- -D warnings
```
### Areas for Contribution
- π New LLM provider integrations
- π Additional evaluation metrics
- π¨ Visualization improvements
- π Documentation enhancements
- π Bug fixes and performance improvements
- β¨ New features and capabilities
---
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
## π Acknowledgments
- Built with [Rust](https://www.rust-lang.org/) π¦
- Inspired by the need for comprehensive LLM testing tools
- Thanks to all contributors and the open-source community
---
## π Support
- **Issues**: [GitHub Issues](https://github.com/globalbusinessadvisors/llm-test-bench/issues)
- **Discussions**: [GitHub Discussions](https://github.com/globalbusinessadvisors/llm-test-bench/discussions)
- **Documentation**: [docs/](docs/)
---
## πΊοΈ Roadmap
### Completed β
- β
Multi-provider LLM support (14+ providers)
- β
Advanced evaluation metrics
- β
Visualization dashboard
- β
REST/GraphQL/WebSocket APIs
- β
Distributed computing
- β
Monitoring & observability
- β
Plugin system
- β
Docker deployment
- β
PostgreSQL backend
### In Progress π§
- π§ Enhanced multimodal support
- π§ Advanced cost optimization
- π§ Plugin marketplace
- π§ Cloud deployment templates
### Planned π
- π Real-time collaboration features
- π Advanced A/B testing framework
- π Integration with MLOps platforms
- π Enterprise SSO and RBAC
---
<div align="center">
**β Star us on GitHub β it motivates us a lot!**
[Report Bug](https://github.com/globalbusinessadvisors/llm-test-bench/issues) β’ [Request Feature](https://github.com/globalbusinessadvisors/llm-test-bench/issues) β’ [Documentation](docs/)
Made with β€οΈ by the LLM Test Bench Team
</div>