# RAG Performance Regression Detection System
A comprehensive system for continuously benchmarking and monitoring RAG (Retrieval-Augmented Generation) operations to catch performance regressions immediately.
## 🎯 Overview
This system implements statistical significance testing with historical trend tracking to automatically detect when RAG operations degrade beyond acceptable thresholds (>5% performance loss).
## 🚀 Key Features
- **Continuous Monitoring**: Automated benchmarking at configurable intervals
- **Statistical Analysis**: Z-test based regression detection with configurable confidence levels
- **Historical Tracking**: Rolling window analysis for trend identification
- **Multi-Metric Support**: Latency, throughput, memory usage, and accuracy tracking
- **Alerting System**: Immediate notifications when regressions are detected
- **CLI Integration**: Full command-line interface for manual and automated operations
## 📊 Monitored Metrics
### Retrieval Phase
- Retrieval time (BM25/sparse search)
- Chunks retrieved and used
- Relevance score distributions
### Generation Phase
- LLM response generation time
- Token usage and generation rates
- Context processing overhead
### End-to-End Performance
- Total query latency
- Throughput (queries/second)
- Memory usage tracking
- Success rate monitoring
## 🛠️ Usage
### CLI Commands
```bash
# Run a single performance benchmark
rk rag-perf benchmark --iterations 5 --format json
# Start continuous monitoring (Ctrl+C to stop)
rk rag-perf monitor --interval 300 --threshold 0.05
# Check for current regressions
rk rag-perf check
# View performance history and trends
rk rag-perf history --format text
# Configure monitoring settings
rk rag-perf config --threshold 0.03 --history-window 200
```
### Programmatic Usage
```rust
use reasonkit::rag::performance::{PerformanceConfig, RagPerformanceMonitor};
use reasonkit::rag::RagEngine;
let rag_engine = RagEngine::in_memory()?;
let config = PerformanceConfig::default();
let monitor = RagPerformanceMonitor::new(rag_engine, config);
// Run benchmark
let metrics = monitor.run_benchmark().await?;
// Check for regressions
let regressions = monitor.detect_regressions().await?;
for regression in regressions {
println!("🚨 {} degraded by {:.1}%",
regression.metric,
regression.change_percent * 100.0);
}
// Start continuous monitoring
monitor.start_continuous_monitoring().await?;
```
## 🔬 Statistical Methodology
### Regression Detection Algorithm
1. **Baseline Establishment**: Uses rolling window of recent measurements
2. **Change Detection**: Compares current performance to historical baseline
3. **Statistical Testing**: Z-test with configurable confidence levels
4. **Threshold Application**: Only alerts on changes exceeding threshold (default: 5%)
### Significance Testing
- **Test Type**: Two-tailed Z-test for proportions/ratios
- **Confidence Level**: Configurable (default: 95%)
- **Sample Size**: Rolling window with minimum size requirements
- **P-Value Threshold**: Automatic based on confidence level
## ⚙️ Configuration
```rust
let config = PerformanceConfig {
alert_threshold: 0.05, // 5% degradation threshold
sample_size: 100, // Statistical sample size
confidence_level: 0.95, // 95% confidence
history_window: 1000, // Rolling window size
benchmark_queries: vec![...], // Test queries
enable_memory_monitoring: true,
monitoring_interval: Duration::from_secs(300),
};
```
## 📈 Example Output
```
🚨 PERFORMANCE REGRESSION DETECTED 🚨
Metric: Retrieval Time
Change: +12.3% (+45.2 ms absolute)
Baseline: 368.1 ms, Current: 413.3 ms
Confidence: 97.2%, p-value: 0.0021
Timestamp: 2026-01-18T14:30:15Z
```
## 🏗️ Architecture
### Core Components
- **`RagPerformanceMonitor`**: Main monitoring engine
- **`PerformanceConfig`**: Configuration management
- **`RagPerformanceMetrics`**: Structured metric collection
- **`PerformanceRegression`**: Regression detection results
- **`RollingStats`**: Historical data analysis
### Data Flow
1. **Benchmark Execution**: Run queries against RAG engine with timing
2. **Metric Collection**: Gather latency, throughput, and quality metrics
3. **Statistical Analysis**: Compare against historical baselines
4. **Regression Detection**: Identify significant performance changes
5. **Alert Generation**: Notify stakeholders of issues
## 🔧 Integration
### CI/CD Integration
```yaml
# Example GitHub Actions workflow
- name: Performance Regression Check
run: |
rk rag-perf benchmark --iterations 10
rk rag-perf check
- name: Alert on Regression
if: failure()
run: |
echo "Performance regression detected!"
# Send notification to team
```
### Monitoring Dashboard
The system provides JSON output suitable for integration with monitoring dashboards:
```json
{
"total_measurements": 150,
"metrics_tracked": 6,
"rolling_stats": {
"retrieval_time_ms": {
"mean": 245.3,
"std_dev": 12.8,
"trend_slope": 0.02,
"count": 150
}
}
}
```
## 🎯 Benefits
- **Early Detection**: Catch performance issues before they impact users
- **Data-Driven**: Statistical significance prevents false positives
- **Automated**: Continuous monitoring with minimal manual intervention
- **Configurable**: Adaptable to different performance requirements
- **Integrated**: Works with existing RAG infrastructure
## 📚 API Reference
See the [Rust documentation](https://docs.rs/reasonkit-core) for detailed API reference and examples.
## 🤝 Contributing
Performance regression detection is critical for maintaining RAG system reliability. Contributions to improve statistical methods, add new metrics, or enhance alerting mechanisms are welcome.