reasonkit-core 0.1.8

# RAG Performance Regression Detection System

A comprehensive system for continuously benchmarking and monitoring RAG (Retrieval-Augmented Generation) operations to catch performance regressions immediately.

## 🎯 Overview

This system implements statistical significance testing with historical trend tracking to automatically detect when RAG operations degrade beyond acceptable thresholds (>5% performance loss).

## 🚀 Key Features

- **Continuous Monitoring**: Automated benchmarking at configurable intervals
- **Statistical Analysis**: Z-test based regression detection with configurable confidence levels
- **Historical Tracking**: Rolling window analysis for trend identification
- **Multi-Metric Support**: Latency, throughput, memory usage, and accuracy tracking
- **Alerting System**: Immediate notifications when regressions are detected
- **CLI Integration**: Full command-line interface for manual and automated operations

## 📊 Monitored Metrics

### Retrieval Phase

- Retrieval time (BM25/sparse search)
- Chunks retrieved and used
- Relevance score distributions

### Generation Phase

- LLM response generation time
- Token usage and generation rates
- Context processing overhead

### End-to-End Performance

- Total query latency
- Throughput (queries/second)
- Memory usage tracking
- Success rate monitoring

## 🛠️ Usage

### CLI Commands

```bash
# Run a single performance benchmark
rk rag-perf benchmark --iterations 5 --format json

# Start continuous monitoring (Ctrl+C to stop)
rk rag-perf monitor --interval 300 --threshold 0.05

# Check for current regressions
rk rag-perf check

# View performance history and trends
rk rag-perf history --format text

# Configure monitoring settings
rk rag-perf config --threshold 0.03 --history-window 200
```

### Programmatic Usage

```rust
use reasonkit::rag::performance::{PerformanceConfig, RagPerformanceMonitor};
use reasonkit::rag::RagEngine;

let rag_engine = RagEngine::in_memory()?;
let config = PerformanceConfig::default();
let monitor = RagPerformanceMonitor::new(rag_engine, config);

// Run benchmark
let metrics = monitor.run_benchmark().await?;

// Check for regressions
let regressions = monitor.detect_regressions().await?;
for regression in regressions {
    println!("🚨 {} degraded by {:.1}%",
             regression.metric,
             regression.change_percent * 100.0);
}

// Start continuous monitoring
monitor.start_continuous_monitoring().await?;
```

## 🔬 Statistical Methodology

### Regression Detection Algorithm

1. **Baseline Establishment**: Uses rolling window of recent measurements
2. **Change Detection**: Compares current performance to historical baseline
3. **Statistical Testing**: Z-test with configurable confidence levels
4. **Threshold Application**: Only alerts on changes exceeding threshold (default: 5%)

### Significance Testing

- **Test Type**: Two-tailed Z-test for proportions/ratios
- **Confidence Level**: Configurable (default: 95%)
- **Sample Size**: Rolling window with minimum size requirements
- **P-Value Threshold**: Automatic based on confidence level

## ⚙️ Configuration

```rust
let config = PerformanceConfig {
    alert_threshold: 0.05,        // 5% degradation threshold
    sample_size: 100,             // Statistical sample size
    confidence_level: 0.95,       // 95% confidence
    history_window: 1000,         // Rolling window size
    benchmark_queries: vec![...], // Test queries
    enable_memory_monitoring: true,
    monitoring_interval: Duration::from_secs(300),
};
```

## 📈 Example Output

```
🚨 PERFORMANCE REGRESSION DETECTED 🚨
Metric: Retrieval Time
Change: +12.3% (+45.2 ms absolute)
Baseline: 368.1 ms, Current: 413.3 ms
Confidence: 97.2%, p-value: 0.0021
Timestamp: 2026-01-18T14:30:15Z
```

## 🏗️ Architecture

### Core Components

- **`RagPerformanceMonitor`**: Main monitoring engine
- **`PerformanceConfig`**: Configuration management
- **`RagPerformanceMetrics`**: Structured metric collection
- **`PerformanceRegression`**: Regression detection results
- **`RollingStats`**: Historical data analysis

### Data Flow

1. **Benchmark Execution**: Run queries against RAG engine with timing
2. **Metric Collection**: Gather latency, throughput, and quality metrics
3. **Statistical Analysis**: Compare against historical baselines
4. **Regression Detection**: Identify significant performance changes
5. **Alert Generation**: Notify stakeholders of issues

## 🔧 Integration

### CI/CD Integration

```yaml
# Example GitHub Actions workflow
- name: Performance Regression Check
  run: |
    rk rag-perf benchmark --iterations 10
    rk rag-perf check

- name: Alert on Regression
  if: failure()
  run: |
    echo "Performance regression detected!"
    # Send notification to team
```

### Monitoring Dashboard

The system provides JSON output suitable for integration with monitoring dashboards:

```json
{
  "total_measurements": 150,
  "metrics_tracked": 6,
  "rolling_stats": {
    "retrieval_time_ms": {
      "mean": 245.3,
      "std_dev": 12.8,
      "trend_slope": 0.02,
      "count": 150
    }
  }
}
```

## 🎯 Benefits

- **Early Detection**: Catch performance issues before they impact users
- **Data-Driven**: Statistical significance prevents false positives
- **Automated**: Continuous monitoring with minimal manual intervention
- **Configurable**: Adaptable to different performance requirements
- **Integrated**: Works with existing RAG infrastructure

## 📚 API Reference

See the [Rust documentation](https://docs.rs/reasonkit-core) for detailed API reference and examples.

## 🤝 Contributing

Performance regression detection is critical for maintaining RAG system reliability. Contributions to improve statistical methods, add new metrics, or enhance alerting mechanisms are welcome.