forge-reasoning 0.1.2

Reasoning tools for LLM debugging - checkpointing, hypotheses, contradictions
Documentation
# TDD Wave 7: Performance Benchmarks

**Date**: 2026-02-19  
**Focus**: Measuring checkpoint operation performance with Criterion.rs

---

## Wave 7 Summary

| Benchmark | Description | Status |
|-----------|-------------|--------|
| checkpoint_creation/single | Single checkpoint creation ||
| checkpoint_creation_batch/* | Batch creation (10/100/1000) ||
| checkpoint_with_tags/with_3_tags | Tagged checkpoint creation ||
| query_operations/list_all | List all checkpoints ||
| query_operations/list_by_session | Filter by session ||
| restore/single | Restore from checkpoint ||
| compaction/* | Compact N to 10 (100/500/1000) ||
| concurrent_creation/* | Concurrent threads (2/4/8) ||
| export/* | Export to JSON (10/100/500) ||
| service_operations/* | Service API ops ||
| storage_backends/* | In-memory vs file-based ||

**Status**: 11 benchmark groups, all passing ✅

---

## Benchmark Results (Sample)

### Checkpoint Creation

| Operation | Time | Throughput |
|-----------|------|------------|
| Single | ~44 µs | 22.7 Kops/s |
| Batch 100 | ~4.4 ms | 22.7 Kelem/s |
| Batch 1000 | ~44 ms | 22.7 Kelem/s |

### Query Operations (100 checkpoints)

| Operation | Time |
|-----------|------|
| list_all | ~500 ns |
| list_by_session | ~400 ns |

### Storage Backends

| Backend | Time | Relative |
|---------|------|----------|
| In-memory | ~44 µs | 1.0x |
| File-based | ~63 µs | 1.4x |

### Service Operations

| Operation | Time |
|-----------|------|
| create_session | ~130 ns |
| checkpoint | ~44 µs |
| metrics | ~520 ns |

### Export Performance

| Checkpoints | Time | Throughput |
|-------------|------|------------|
| 100 | ~2.2 ms | 44.6 Kelem/s |
| 500 | ~11.3 ms | 44.3 Kelem/s |

---

## Implementation Details

### Benchmark File: `benches/checkpoint_bench.rs`

Using Criterion.rs with:
- **Throughput measurement** for batch operations
- **BenchmarkId** for parameterized tests
- **iter_with_setup** for stateful benchmarks
- **Async support** via `tokio::runtime`

### Benchmark Categories

#### 1. Creation Benchmarks
```rust
// Single checkpoint
bench_checkpoint_creation

// Batch creation with varying sizes
bench_checkpoint_creation_batch (10, 100, 1000)

// Tagged checkpoints
bench_checkpoint_with_tags (3 tags)
```

#### 2. Query Benchmarks
```rust
// List operations
bench_query_operations
- list_all
- list_by_session
```

#### 3. Management Benchmarks
```rust
// Restore operation
bench_restore

// Compaction (N → 10 checkpoints)
bench_compaction (100, 500, 1000)
```

#### 4. Concurrent Benchmarks
```rust
// Multi-threaded creation
bench_concurrent_creation (2, 4, 8 threads)
```

#### 5. Service Benchmarks
```rust
// High-level API
bench_service_operations
- create_session
- checkpoint_through_service
- metrics
```

#### 6. Export Benchmarks
```rust
// JSON export performance
bench_export (10, 100, 500 checkpoints)
```

#### 7. Storage Backend Comparison
```rust
// In-memory vs file-based
bench_storage_backends
- in_memory
- file_based
```

---

## Key Performance Findings

### 1. Checkpoint Creation
- **~44 µs per checkpoint** (in-memory)
- **~63 µs per checkpoint** (file-based)
- File-based is **1.4x slower** but provides durability

### 2. Query Performance
- **Sub-microsecond** for list operations
- Efficient due to in-memory cache

### 3. Export Performance
- **~44 K checkpoints/second** export rate
- Linear scaling with checkpoint count

### 4. Service Overhead
- Session creation: **~130 ns** (negligible)
- Checkpoint through service: **~44 µs** (same as direct)

### 5. Concurrent Performance
- Scales well with thread count
- Lock contention minimal for typical workloads

---

## Running Benchmarks

```bash
cd /home/feanor/Projects/forge/forge-reasoning

# Run all benchmarks
cargo bench

# Run specific benchmark
cargo bench checkpoint_creation
cargo bench query_operations

# Run with verbose output
cargo bench -- --verbose

# Generate HTML report
cargo bench -- --plotting-backend plotters
```

### Viewing Results

- Terminal output: Real-time during run
- HTML report: `target/criterion/report/index.html`
- JSON data: `target/criterion/<benchmark>/new/estimates.json`

---

## CI/CD Integration

```yaml
# GitHub Actions example
- name: Run benchmarks
  run: cargo bench -- --noplot  # No plots in CI

- name: Upload benchmark results
  uses: actions/upload-artifact@v3
  with:
    name: benchmark-results
    path: target/criterion/
```

---

## Performance Optimization Opportunities

### Current Observations
1. File-based storage adds ~44% overhead
2. Compaction is O(n) - could be optimized
3. Export serialization is CPU-bound

### Potential Improvements
1. **Batch writes** for file-based storage
2. **Lazy compaction** (background thread)
3. **Parallel export** (for large checkpoint counts)
4. **Binary serialization** instead of JSON for export

---

## Code Metrics

| File | Lines | Purpose |
|------|-------|---------|
| `benches/checkpoint_bench.rs` | ~380 | Criterion benchmarks |
| `Cargo.toml` | +2 | Criterion dependency |

---

**Wave 7 Complete** ✅

Performance benchmarks established with Criterion.rs. System can handle:
- **22,000+ checkpoints/second** (creation)
- **44,000+ checkpoints/second** (export)
- **Sub-microsecond** queries