tenrso-planner 0.1.0-alpha.2

# TenRSo-Planner Integration Guide

> **Version:** 0.1.0-alpha.2
> **Last Updated:** 2025-12-09

This guide demonstrates how to integrate tenrso-planner's advanced features into your tensor computation workflow.

---

## Table of Contents

1. [Quick Start](#quick-start)
2. [Planning Algorithms](#planning-algorithms)
3. [Parallel Ensemble Planning](#parallel-ensemble-planning)
4. [ML-Based Cost Calibration](#ml-based-cost-calibration)
5. [Plan Caching](#plan-caching)
6. [Hardware Simulation](#hardware-simulation)
7. [Quality Tracking](#quality-tracking)
8. [Production Workflow](#production-workflow)
9. [Best Practices](#best-practices)

---

## Quick Start

```rust
use tenrso_planner::{greedy_planner, EinsumSpec, PlanHints};

// Parse Einstein summation notation
let spec = EinsumSpec::parse("ij,jk->ik")?;

// Define tensor shapes
let shapes = vec![vec![100, 200], vec![200, 300]];

// Create a plan
let hints = PlanHints::default();
let plan = greedy_planner(&spec, &shapes, &hints)?;

// Inspect results
println!("FLOPs: {:.2e}", plan.estimated_flops);
println!("Memory: {} bytes", plan.estimated_memory);
println!("Steps: {}", plan.nodes.len());
```

---

## Planning Algorithms

TenRSo-Planner provides 6 production-grade planning algorithms:

### 1. Greedy Planner (Fast, Good Quality)

```rust
use tenrso_planner::{greedy_planner, EinsumSpec, PlanHints};

let spec = EinsumSpec::parse("ij,jk,kl->il")?;
let shapes = vec![vec![10, 20], vec![20, 30], vec![30, 40]];
let hints = PlanHints::default();

let plan = greedy_planner(&spec, &shapes, &hints)?;
// O(n³) time, good for most cases
```

**Use when:** You need fast planning (< 1ms for 10 tensors)

### 2. Dynamic Programming (Optimal, Expensive)

```rust
use tenrso_planner::{dp_planner, EinsumSpec, PlanHints};

let spec = EinsumSpec::parse("ij,jk,kl->il")?;
let shapes = vec![vec![10, 20], vec![20, 30], vec![30, 40]];
let hints = PlanHints::default();

let plan = dp_planner(&spec, &shapes, &hints)?;
// O(3^n) time, guaranteed optimal
```

**Use when:** You need provably optimal plans (≤ 20 tensors)

### 3. Beam Search (Better Quality, Moderate Speed)

```rust
use tenrso_planner::{beam_search_planner, EinsumSpec, PlanHints};

let spec = EinsumSpec::parse("ij,jk,kl->il")?;
let shapes = vec![vec![10, 20], vec![20, 30], vec![30, 40]];
let hints = PlanHints::default();

let plan = beam_search_planner(&spec, &shapes, &hints, 5)?; // beam width = 5
// O(n³ * k) time, better than greedy
```

**Use when:** Medium networks (8-20 tensors), want better than greedy

### 4. Simulated Annealing (Stochastic, Escapes Local Minima)

```rust
use tenrso_planner::{SimulatedAnnealingPlanner, Planner, PlanHints};

let planner = SimulatedAnnealingPlanner::with_params(1000.0, 0.95, 1000);
let plan = planner.make_plan("ij,jk,kl->il", &shapes, &hints)?;
// Stochastic search, configurable iterations
```

**Use when:** Large networks, quality more important than planning speed

### 5. Genetic Algorithm (Population-Based, High Quality)

```rust
use tenrso_planner::{GeneticAlgorithmPlanner, Planner, PlanHints};

let planner = GeneticAlgorithmPlanner::fast(); // or ::high_quality()
let plan = planner.make_plan("ij,jk,kl->il", &shapes, &hints)?;
// Evolutionary search, best for complex topologies
```

**Use when:** Very large networks (> 20), need best quality with time budget

### 6. Adaptive Planner (⭐ Recommended)

```rust
use tenrso_planner::{AdaptivePlanner, Planner, PlanHints};

let planner = AdaptivePlanner::default();
let plan = planner.make_plan("ij,jk,kl->il", &shapes, &hints)?;
// Automatically selects best algorithm based on problem size
```

**Use when:** You want optimal results without manual algorithm selection

---

## Parallel Ensemble Planning

Run multiple planners concurrently and automatically select the best result:

```rust
use tenrso_planner::{EnsemblePlanner, PlanHints};

// Create ensemble with multiple planners
let ensemble = EnsemblePlanner::new(vec!["greedy", "beam_search", "dp"]);

// Run all planners in parallel
let plan = ensemble.plan("ij,jk,kl->il", &shapes, &hints)?;

// Automatically selects best plan by FLOPs
println!("Best plan: {:.2e} FLOPs", plan.estimated_flops);
```

### Configuring Selection Metric

```rust
// Select by memory usage
let ensemble = EnsemblePlanner::new(vec!["greedy", "dp"])
    .with_metric("memory");

// Select by combined FLOPs + memory
let ensemble = EnsemblePlanner::new(vec!["greedy", "dp"])
    .with_metric("combined");
```

### Performance

- **Speedup:** Near-linear with number of cores
- **Example:** 3 planners → 2.8x faster on 4-core system
- **Overhead:** ~1-2ms per planner for thread spawning

### Best Configurations

```rust
// Fast (2 planners): greedy + beam_search
let fast = EnsemblePlanner::new(vec!["greedy", "beam_search"]);

// Balanced (3 planners): add DP
let balanced = EnsemblePlanner::new(vec!["greedy", "beam_search", "dp"]);

// Best Quality (5 planners): add stochastic algorithms
let best = EnsemblePlanner::new(vec![
    "greedy", "beam_search", "dp",
    "simulated_annealing", "genetic_algorithm"
]);
```

---

## ML-Based Cost Calibration

Learn from execution history to improve cost predictions:

```rust
use tenrso_planner::{MLCostModel, ExecutionHistory, ExecutionRecord};
use std::time::SystemTime;

// Step 1: Create execution history
let mut history = ExecutionHistory::with_max_size(100);

// Step 2: Record actual execution results
history.record(ExecutionRecord {
    id: "matmul_1000x2000x3000".to_string(),
    predicted_flops: 12_000_000_000.0,  // What planner predicted
    actual_flops: 12_500_000_000.0,      // What actually happened
    predicted_time_ms: 100.0,
    actual_time_ms: 105.0,
    predicted_memory: 24_000_000,
    actual_memory: 25_000_000,
    timestamp: SystemTime::now(),
    planner: "greedy".to_string(),
});

// Step 3: Train ML cost model (requires ≥ 3 records)
let mut ml_model = MLCostModel::new();
ml_model.train(&history);

// Step 4: Use calibrated predictions
let predicted_flops = 10_000_000_000.0;
let calibrated_flops = ml_model.calibrate_flops(predicted_flops);
let calibrated_time = ml_model.calibrate_time(100.0, calibrated_flops);

println!("Original:   {:.2e} FLOPs", predicted_flops);
println!("Calibrated: {:.2e} FLOPs", calibrated_flops);
println!("Model R²:   {:.4}", ml_model.flops_r_squared());
```

### Per-Planner Calibration

Different planners may have different biases:

```rust
// Calibrate for specific planner
let cal_greedy = ml_model.calibrate_flops_for_planner(1e9, "greedy");
let cal_dp = ml_model.calibrate_flops_for_planner(1e9, "dp");

// Falls back to general model if planner-specific not available
let cal_unknown = ml_model.calibrate_flops_for_planner(1e9, "unknown");
```

---

## Plan Caching

Cache plans with LRU/LFU/ARC eviction policies:

```rust
use tenrso_planner::PlanCache;

// Create cache with LRU eviction (default)
let mut cache = PlanCache::new_lru(100);

// Or use LFU (frequency-based)
let mut cache = PlanCache::new_lfu(100);

// Or use ARC (adaptive, balances recency and frequency)
let mut cache = PlanCache::new_arc(100);

// Cache plans
let key = "ij,jk->ik:100x200:200x300";
cache.put(key.to_string(), plan.clone());

// Retrieve cached plan
if let Some(cached_plan) = cache.get(key) {
    println!("Cache hit! {:.2e} FLOPs", cached_plan.estimated_flops);
} else {
    // Cache miss - compute plan
    let plan = greedy_planner(&spec, &shapes, &hints)?;
    cache.put(key.to_string(), plan.clone());
}

// Check cache statistics
let stats = cache.stats();
println!("Hit rate: {:.1}%", stats.hit_rate() * 100.0);
```

---

## Hardware Simulation

Simulate plan execution on different hardware:

```rust
use tenrso_planner::{HardwareModel, PlanSimulator};

// Create simulator with hardware model
let cpu_low = HardwareModel::cpu_low_end();
let cpu_high = HardwareModel::cpu_high_end();
let gpu_v100 = HardwareModel::nvidia_volta();
let gpu_a100 = HardwareModel::nvidia_ampere();

let simulator_cpu = PlanSimulator::new(cpu_low);
let simulator_gpu = PlanSimulator::new(gpu_a100);

// Simulate plan execution
let sim_cpu = simulator_cpu.simulate(&plan)?;
let sim_gpu = simulator_gpu.simulate(&plan)?;

println!("CPU: {:.2} ms, {:.2} GB/s",
    sim_cpu.total_time_ms, sim_cpu.effective_bandwidth_gbps);
println!("GPU: {:.2} ms, {:.2} GB/s",
    sim_gpu.total_time_ms, sim_gpu.effective_bandwidth_gbps);

// Compare hardware for best choice
if sim_gpu.total_time_ms < sim_cpu.total_time_ms {
    println!("GPU is {:.2}x faster", sim_cpu.total_time_ms / sim_gpu.total_time_ms);
}
```

### Available Hardware Models

- **CPUs:** `cpu_low_end()`, `cpu_high_end()`
- **NVIDIA:** `nvidia_pascal()`, `nvidia_volta()`, `nvidia_turing()`, `nvidia_ampere()`, `nvidia_hopper()`
- **AMD:** `amd_cdna2()`

---

## Quality Tracking

Track plan quality over time:

```rust
use tenrso_planner::{ExecutionHistory, PlanQualityMetrics};

let history = ExecutionHistory::with_max_size(1000);

// Record executions...
// (see ML-Based Cost Calibration section)

// Compute quality metrics
let metrics = history.compute_metrics();

println!("Executions: {}", metrics.num_executions);
println!("Avg FLOPs error: {:.1}%", metrics.avg_flops_error * 100.0);
println!("Accuracy (10%): {:.1}%", metrics.accuracy_10pct * 100.0);

// Per-planner metrics
for (planner, planner_metrics) in &metrics.per_planner {
    println!("{}: {:.1}% error", planner, planner_metrics.avg_flops_error * 100.0);
}

// Find best planner
if let Some(best) = history.best_planner() {
    println!("Best planner: {}", best);
}
```

---

## Production Workflow

Recommended workflow integrating all features:

```rust
use tenrso_planner::*;
use std::sync::{Arc, Mutex};

// 1. Initialize components
let cache = Arc::new(Mutex::new(PlanCache::new_arc(1000)));
let history = Arc::new(Mutex::new(ExecutionHistory::with_max_size(10000)));
let ml_model = Arc::new(Mutex::new(MLCostModel::new()));

// 2. Plan with caching
fn plan_with_cache(
    spec: &str,
    shapes: &[Vec<usize>],
    cache: &Arc<Mutex<PlanCache>>,
) -> anyhow::Result<Plan> {
    let key = format!("{}:{:?}", spec, shapes);

    // Try cache first
    let mut cache_lock = cache.lock().unwrap();
    if let Some(plan) = cache_lock.get(&key) {
        return Ok(plan.clone());
    }
    drop(cache_lock);

    // Cache miss - use ensemble planner
    let ensemble = EnsemblePlanner::new(vec!["greedy", "beam_search", "dp"]);
    let plan = ensemble.plan(spec, shapes, &PlanHints::default())?;

    // Cache for future
    let mut cache_lock = cache.lock().unwrap();
    cache_lock.put(key, plan.clone());

    Ok(plan)
}

// 3. Execute and record results
fn execute_and_record(
    plan: &Plan,
    actual_flops: f64,
    actual_time_ms: f64,
    actual_memory: usize,
    history: &Arc<Mutex<ExecutionHistory>>,
) {
    let record = ExecutionRecord {
        id: "execution_id".to_string(),
        predicted_flops: plan.estimated_flops,
        actual_flops,
        predicted_time_ms: 100.0, // from simulation
        actual_time_ms,
        predicted_memory: plan.estimated_memory,
        actual_memory,
        timestamp: std::time::SystemTime::now(),
        planner: "ensemble".to_string(),
    };

    let mut history_lock = history.lock().unwrap();
    history_lock.record(record);
}

// 4. Periodically retrain ML model
fn retrain_ml_model(
    history: &Arc<Mutex<ExecutionHistory>>,
    ml_model: &Arc<Mutex<MLCostModel>>,
) {
    let history_lock = history.lock().unwrap();
    if history_lock.len() >= 10 {
        let mut model_lock = ml_model.lock().unwrap();
        model_lock.train(&history_lock);
        println!("ML model retrained with {} samples", history_lock.len());
        println!("FLOPs R²: {:.4}", model_lock.flops_r_squared());
    }
}

// 5. Use in production
let spec = "ij,jk,kl->il";
let shapes = vec![vec![100, 200], vec![200, 300], vec![300, 400]];

let plan = plan_with_cache(spec, &shapes, &cache)?;

// ... execute plan ...

execute_and_record(&plan, 1.2e9, 105.0, 25_000_000, &history);

// Retrain periodically (e.g., every 100 executions)
retrain_ml_model(&history, &ml_model);
```

---

## Best Practices

### 1. Algorithm Selection

- **Interactive/Development:** Use `AdaptivePlanner` for automatic selection
- **Production:** Use `EnsemblePlanner` with 2-3 fast planners for best quality within time budget
- **Batch/Offline:** Use `dp_planner` or full ensemble for optimal results

### 2. Caching Strategy

- Use **ARC** for general workloads (adapts to access patterns)
- Use **LFU** for workloads with hot patterns (repeated tensors)
- Use **LRU** for sequential/streaming workloads
- Set cache size to ~1000-10000 entries depending on memory constraints

### 3. ML Calibration

- Collect **≥ 100 execution records** before heavy reliance on ML model
- Check **R² scores** (aim for > 0.9 for good models)
- Retrain **periodically** (e.g., every 100-1000 executions)
- Use **per-planner calibration** for multi-algorithm workflows

### 4. Parallel Planning

- Use **2-3 planners** for fast results (greedy + beam_search)
- Use **5-6 planners** for best quality (includes SA, GA)
- Consider **thread overhead** for small problems (< 5 tensors)
- Set appropriate **beam widths** (3-10) for beam search

### 5. Quality Monitoring

- Track **accuracy percentages** (aim for > 80% within 10% tolerance)
- Monitor **per-planner metrics** to identify systematic biases
- Use **execution history** to detect performance regressions
- Set up **alerts** for prediction errors > 50%

---

## Examples

See the `examples/` directory for detailed usage:

- `ml_calibration.rs` - ML-based cost calibration walkthrough
- `parallel_ensemble.rs` - Parallel planning demonstration
- `basic_matmul.rs` - Simple matrix multiplication
- `matrix_chain.rs` - Greedy vs DP comparison
- `planning_hints.rs` - Advanced hint usage
- `genetic_algorithm.rs` - GA planner showcase
- `comprehensive_comparison.rs` - All planners side-by-side
- `plan_visualization.rs` - Visualization and debugging
- `advanced_features.rs` - Caching, simulation, profiling

Run with: `cargo run --example <name>`

---

## Benchmarks

Compare planning algorithms:

```bash
# Run all benchmarks
cargo bench

# Run specific benchmark suite
cargo bench --bench planner_benchmarks
cargo bench --bench comprehensive_comparison
cargo bench --bench parallel_planners
```

---

## Further Reading

- **API Documentation:** `cargo doc --open`
- **Source Code:** `src/` directory with extensive inline documentation
- **TODO.md:** Roadmap and implementation details
- **CLAUDE.md:** Integration guide for maintainers

---

## Support

For questions, issues, or contributions:

- **GitHub:** https://github.com/cool-japan/tenrso
- **Issues:** https://github.com/cool-japan/tenrso/issues

---

**Last Updated:** 2025-12-09
**Version:** 0.1.0-alpha.2
**Status:** Production-Ready + ML + Parallel