trueno 0.16.4

High-performance SIMD compute library with GPU support for matrix operations
# Trueno Performance Tuning Guide

**Last Updated**: 2025-11-21
**Applies To**: Trueno v0.4.1+

## Executive Summary

Trueno provides **exceptional performance** for compute-intensive operations (dot product, sum, max) with **200-400% speedups** over scalar implementations. However, simple element-wise operations (add, mul) show modest improvements (3-10%) due to **memory bandwidth limitations**.

**Key Insight**: SIMD excels when computation dominates; it can't overcome memory bottlenecks.

## When to Use Trueno

### **Excellent Use Cases** (Compute-Intensive)

Operations where SIMD computation provides massive benefits:

| Operation | Speedup | Why It's Fast |
|-----------|---------|---------------|
| **Dot Product** | 235-440% | Multiple ops per element, single result |
| **Sum Reduction** | 211-315% | Many additions, minimal memory writes |
| **Max Reduction** | 288-448% | SIMD comparison, no branching |
| **Matrix Multiply** | 179-279% vs NumPy | Cache-aware blocking + SIMD micro-kernels |
| **Convolutions** | TBD | Compute-intensive, data reuse |

**Pattern**: Operations with high **compute-to-memory ratio** benefit most.

### ⚠️ **Limited Benefit** (Memory-Bound)

Operations where memory bandwidth dominates:

| Operation | Speedup | Limitation |
|-----------|---------|------------|
| **Element-wise Add** | 3-10% | Memory transfer >> computation |
| **Element-wise Mul** | -3% to 6% | Same limitation as add |
| **Copy/Transform** | Minimal | Pure memory operations |

**Pattern**: Simple operations on large arrays are **memory-bandwidth limited**.

## Understanding the Performance Profile

### Compute-Intensive Operations (SIMD Wins)

```rust
use trueno::Vector;

// Dot product: 340% faster with SSE2!
let a = Vector::from_slice(&[1.0; 1000]);
let b = Vector::from_slice(&[2.0; 1000]);
let result = a.dot(&b).unwrap();  // Fast: 4 multiply-adds per cycle

// Why it's fast:
// - SSE2 processes 4 elements in parallel
// - Multiply + accumulate in single operation
// - Only writes single f32 result (minimal memory)
// - Computation time >> memory transfer time
```

**Performance Characteristics**:
- **Speedup increases** with vector size (more parallelism)
- **Sustained throughput**: 7.6 Gelem/s (SSE2) vs 1.7 Gelem/s (scalar)
- **Why**: Computation dominates, memory is just for reads

### Memory-Bound Operations (Limited SIMD Benefit)

```rust
use trueno::Vector;

// Element-wise add: Only 3-5% faster with SSE2
let a = Vector::from_slice(&[1.0; 10000]);
let b = Vector::from_slice(&[2.0; 10000]);
let result = a.add(&b).unwrap();  // Modest improvement

// Why it's slow:
// - Must read 20K floats from RAM
// - Must write 10K floats to RAM
// - Addition is trivial compared to memory transfer
// - Memory bandwidth >> SIMD computation speed
```

**Performance Characteristics**:
- **Speedup plateaus** at large sizes (memory-limited)
- **Throughput**: 8.2 Gelem/s (SSE2) vs 8.0 Gelem/s (scalar) - barely different!
- **Why**: Waiting for RAM, not waiting for ALU

## Benchmark Results Summary

Tested on x86_64 with SSE2 vs Scalar:

### Exceptional Performance (>200% speedup)

```
Operation    | Size | Scalar  | SSE2    | Speedup
-------------|------|---------|---------|--------
dot product  | 100  | 36.1 ns | 10.8 ns | 235%
dot product  | 1K   | 575 ns  | 131 ns  | 340%
dot product  | 10K  | 6.1 µs  | 1.5 µs  | 315%
sum          | 1K   | 575 ns  | 139 ns  | 315%
max          | 1K   | 395 ns  | 88 ns   | 348%
```

**Takeaway**: Reductions are **3-4x faster** with SIMD.

### Modest Performance (<10% speedup)

```
Operation | Size | Scalar  | SSE2    | Speedup
----------|------|---------|---------|--------
add       | 100  | 46.9 ns | 42.5 ns | 10%
add       | 1K   | 125 ns  | 122 ns  | 3%
add       | 10K  | 1.1 µs  | 1.0 µs  | 5%
mul       | 10K  | 1.0 µs  | 1.1 µs  | -3% (regression!)
```

**Takeaway**: Element-wise ops are **memory-bound**, not compute-bound.

### Matrix Multiplication Performance (v0.4.1+)

```rust
use trueno::Matrix;

// 128×128 matmul: 2.79× faster than NumPy!
let a = Matrix::from_vec(128, 128, vec![1.0; 128*128]).unwrap();
let b = Matrix::from_vec(128, 128, vec![2.0; 128*128]).unwrap();
let result = a.matmul(&b).unwrap();  // 166 μs vs NumPy's 463 μs

// Why it's fast:
// - Cache-aware blocking (L2 blocks: 64×64)
// - SIMD micro-kernels for inner products
// - Smart thresholding (small matrices use simple path)
// - O(n³) compute amortizes memory overhead
```

**Performance by Matrix Size**:

| Size | Trueno | NumPy | Speedup | Notes |
|------|--------|-------|---------|-------|
| 16×16 | 1.9 μs | 2.2 μs | 0.86× slower | Blocking overhead for tiny matrices |
| 32×32 | 15.7 μs | 2.2 μs | 7.1× slower | Simple path, needs tuning |
| 64×64 | 19.6 μs | 8.3 μs | 2.4× slower | First blocking size |
| **128×128** | **166 μs** | **463 μs** | **2.79× FASTER** | ✅ Sweet spot |
| 256×256 | 1371 μs | 574 μs | 2.4× slower | Phase 2: needs 3-level blocking |

**Key Findings**:
- **Sweet spot**: 128×128 matrices (2.79× faster than NumPy)
- **Cache-aware blocking**: Dramatic improvement from L2 cache optimization
- **Scaling**: Trueno scales correctly (8× for 2× size increase), NumPy's BLAS achieves superlinear efficiency at larger sizes
- **Phase 2 needed**: For 256×256+ matrices, requires 3-level blocking (L1/L2/L3)

**Implementation Details**:
- **L2 Block Size**: 64×64 (fits 256KB for 3 matrices)
- **Small Matrix Threshold**: ≤32 elements use simple path (no blocking overhead)
- **Algorithm**: 3-level nested loops (ii/jj/kk) over blocks with SIMD dot products
- **Memory Access**: Transpose B matrix for better cache locality

**Tuning Tips**:
1. Use 128×128 as target size for benchmarks (optimal cache utilization)
2. For matrices <64, expect modest performance (blocking overhead)
3. For matrices >256, consider Phase 2 optimizations or optional BLAS backend
4. Profile with `cargo bench --bench matrix_ops` to measure your workload

## Performance Tuning Tips

### 1. Choose the Right Operation

**Instead of this** (memory-bound):
```rust
// Compute mean: element-wise div + sum (slow path)
let sum = v.sum()?;
let mean = v.iter().map(|x| x / n).sum();  // Memory-bound division
```

**Do this** (compute-intensive):
```rust
// Compute mean: sum then scalar divide (fast path)
let sum = v.sum()?;  // Fast: 315% speedup!
let mean = sum / (v.len() as f32);  // Single scalar operation
```

**Savings**: ~3x faster by avoiding element-wise operation.

### 2. Fuse Operations

**Instead of this** (multiple passes):
```rust
let v1 = a.add(&b)?;    // Pass 1: read a, b; write v1
let v2 = v1.mul(&c)?;   // Pass 2: read v1, c; write v2
let result = v2.sum()?; // Pass 3: read v2
```

**Do this** (single pass):
```rust
// Fused operation (future feature)
let result = a.iter()
    .zip(&b.data)
    .zip(&c.data)
    .map(|((a, b), c)| (a + b) * c)
    .sum();  // Single pass, better cache utilization
```

**Savings**: Reduces memory traffic by 67%.

### 3. Use Backend Auto-Selection

```rust
use trueno::Vector;

// Auto-selects best backend for your CPU
let v = Vector::from_slice(&data);  // Uses SSE2/AVX2/AVX-512/NEON

// Only override for benchmarking/testing
use trueno::Backend;
let v = Vector::from_slice_with_backend(&data, Backend::Scalar);
```

**Why**: Runtime detection ensures optimal backend for the hardware.

### 4. Batch Small Operations

**Instead of this** (overhead-bound):
```rust
for i in 0..1000 {
    let v = Vector::from_slice(&small_data[i]);  // 100 elements each
    let result = v.sum()?;
    results.push(result);
}
```

**Do this** (amortize overhead):
```rust
// Concatenate into larger vector
let large = Vector::from_slice(&concatenated);  // 100K elements
let result = large.sum()?;
// Then split results
```

**Savings**: Reduces function call overhead and improves cache utilization.

### 5. Understand Alignment

```rust
use trueno::{Vector, Backend};

// Vectors are automatically 16-byte aligned (SSE2-friendly)
let v = Vector::from_slice(&data);

// Explicit alignment API (for documentation/future)
let v = Vector::with_alignment(1000, Backend::SSE2, 16)?;

// Check alignment
let ptr = v.as_slice().as_ptr() as usize;
assert!(ptr.is_multiple_of(16));  // Usually true!
```

**Note**: Rust's allocator provides 16-byte alignment by default on most systems.

## Common Performance Pitfalls

### ❌ Pitfall #1: Using Trueno for Simple Element-Wise Ops

```rust
// ❌ BAD: No benefit for simple operations
let result = vector_a.add(&vector_b)?;  // Only 3-5% faster than scalar

// ✅ GOOD: Use for compute-intensive operations
let dot = vector_a.dot(&vector_b)?;  // 340% faster than scalar!
```

### ❌ Pitfall #2: Small Vectors

```rust
// ❌ BAD: Overhead dominates for tiny vectors
let v = Vector::from_slice(&[1.0, 2.0, 3.0, 4.0]);  // 4 elements
let sum = v.sum()?;  // SIMD overhead > benefit

// ✅ GOOD: Use scalar for small data
let sum: f32 = data.iter().sum();  // Simpler and possibly faster
```

**Recommendation**: Use Trueno for vectors with **≥100 elements**.

### ❌ Pitfall #3: Excessive Backend Switching

```rust
// ❌ BAD: Creating vectors in tight loop
for _ in 0..1000 {
    let v = Vector::from_slice(&data);  // Allocates each iteration
    process(v);
}

// ✅ GOOD: Reuse vectors
let mut v = Vector::from_slice(&data);
for _ in 0..1000 {
    process(&v);  // Borrow, don't allocate
}
```

## Performance Comparison Table

| Operation Type | Vector Size | Scalar | SSE2 | AVX2 (est) | When to Use |
|----------------|-------------|--------|------|------------|-------------|
| Dot product | 1K | 575 ns | 131 ns | ~80 ns | Always |
| Sum | 1K | 575 ns | 139 ns | ~85 ns | Always |
| Max | 1K | 395 ns | 88 ns | ~50 ns | Always |
| Add | 1K | 125 ns | 122 ns | ~120 ns | Rarely (memory-bound) |
| Mul | 1K | 119 ns | 113 ns | ~110 ns | Rarely (memory-bound) |

**Legend**:
- **Always**: Use Trueno, significant speedup
- **Rarely**: Limited benefit, use only if already using Trueno for other ops

## Future Optimizations

### Planned Improvements

1. **AVX2 Backend** (Phase 3)
   - 256-bit SIMD (8× f32 elements)
   - Expected 2x over SSE2 for reductions
   - Limited benefit for element-wise ops (still memory-bound)

2. **Fused Operations** (Phase 6)
   - `fused_multiply_add(a, b, c)` → a×b + c
   - Single memory pass instead of multiple
   - Reduces memory bandwidth pressure

3. **GPU Backend** (Phase 4)
   - For very large vectors (>100K elements)
   - Massive parallelism for compute-intensive ops
   - Requires amortizing PCIe transfer overhead

### Won't Fix

1. **Custom Memory Allocators**
   - Standard Vec already provides good alignment (16 bytes)
   - Complex implementations for marginal gains
   - Decision: Accept current performance

2. **Element-Wise Operation Optimizations**
   - Fundamental memory bandwidth limitation
   - AVX2/AVX-512 won't significantly help
   - Decision: Document limitations, focus on compute-intensive ops

## Measurement Methodology

All benchmarks use **Criterion.rs** with:
- 100 samples per benchmark
- 3-second warmup
- 5-second measurement
- Statistical outlier detection

**Hardware**: x86_64 with SSE2 support
**Compiler**: rustc 1.83, -C opt-level=3, LTO=true
**Date**: 2025-11-16

## Reproducing Results

```bash
# Run all benchmarks
cargo bench

# Run specific operation
cargo bench -- dot

# Generate HTML report
cargo bench
open target/criterion/report/index.html
```

## Further Reading

- [Benchmarks Document]BENCHMARKS.md - Detailed benchmark analysis
- [Backend Selection]../README.md#backend-selection - How backends are chosen
- [PROGRESS.md]../PROGRESS.md - Implementation details and learnings

## Summary

**Key Takeaways**:
1. ✅ Use Trueno for **compute-intensive operations** (dot, sum, max): 200-400% speedup
2. ⚠️ Limited benefit for **memory-bound operations** (add, mul): 3-10% speedup
3. 📊 **Understand your workload**: Compute vs memory ratio determines SIMD benefit
4. 🎯 **Sweet spot**: Reductions and operations with data reuse

**Design Philosophy**: Trueno is optimized for **computation**, not **memory movement**.