aprender-compute 0.32.0

High-performance SIMD compute library with GPU support, LLM inference engine, and GGUF model loading (was: trueno)
Documentation
# Trueno v0.4.0 Benchmark Results

Performance measurements for SIMD-accelerated vector operations across different backend implementations.

## Test Environment

- **CPU**: AMD Zen 4 with AVX-512 support
- **Rust Version**: 1.83+ (latest stable)
- **Benchmark Tool**: Criterion.rs
- **Measurement**: Wall-clock time (median of 100 samples)
- **Date**: 2025-11-19

## Summary: Reduction Operations

### sum() - Vector Summation

| Size | Scalar | SSE2 | AVX2 | AVX-512 | AVX2 Speedup | AVX-512 Speedup |
|------|--------|------|------|---------|--------------|-----------------|
| 100 | 36.3 ns | 13.1 ns | 5.61 ns | 5.68 ns | **6.5x** | **6.4x** |
| 1,000 | 600 ns | 146 ns | 55.0 ns | 54.4 ns | **10.9x** | **11.0x** |
| 10,000 | 6.33 µs | 1.58 µs | 768 ns | 767 ns | **8.2x** | **8.3x** |

**Key Insights**:
- **Exceeds 8x target** at size 1,000 (10.9x-11.0x)
- AVX-512 matches AVX2 performance (memory-bound at large sizes)
- SSE2 baseline: 2.8x-4.0x speedup over scalar
- Best performance: 11.0x at size 1,000 (AVX-512)

---

### max() - Maximum Value

| Size | Scalar | SSE2 | AVX2 | AVX-512 | AVX2 Speedup | AVX-512 Speedup |
|------|--------|------|------|---------|--------------|-----------------|
| 100 | 26.9 ns | 4.88 ns | 4.31 ns | 4.24 ns | **6.2x** | **6.3x** |
| 1,000 | 390 ns | 81.6 ns | 39.8 ns | 32.3 ns | **9.8x** | **12.1x** |
| 10,000 | 4.02 µs | 986 ns | 482 ns | 488 ns | **8.3x** | **8.2x** |

**Key Insights**:
- **Exceeds 8x target** at sizes 1,000 and 10,000
- **12.1x speedup** at size 1,000 (AVX-512) - best performance
- AVX-512 advantage visible at size 1,000 (12.1x vs 9.8x AVX2)
- Memory bandwidth limits gains at size 10,000

---

### min() - Minimum Value

| Size | Scalar | SSE2 | AVX2 | AVX-512 | AVX2 Speedup | AVX-512 Speedup |
|------|--------|------|------|---------|--------------|-----------------|
| 100 | 26.1 ns | 4.88 ns | 4.21 ns | 4.23 ns | **6.2x** | **6.2x** |
| 1,000 | 371 ns | 77.1 ns | 31.4 ns | 31.9 ns | **11.8x** | **11.6x** |
| 10,000 | 3.93 µs | 972 ns | 484 ns | 492 ns | **8.1x** | **8.0x** |

**Key Insights**:
- **Exceeds 8x target** at sizes 1,000 and 10,000
- **11.8x speedup** at size 1,000 (AVX2) - matches max() performance
- Consistent performance across AVX2/AVX-512 at all sizes
- Highly compute-bound operation (good SIMD utilization)

---

## Benchmark Categories

### 1. Element-wise Operations (Memory-Bound)

Operations: `add()`, `sub()`, `mul()`, `div()`

**Characteristics**:
- Memory bandwidth limited (~1-2x speedup typical)
- More data movement than computation
- AVX-512 no faster than AVX2 (512-bit reads max out bandwidth)

**Available Benchmarks**:
```bash
# Run element-wise benchmarks
cargo bench --bench vector_ops add
cargo bench --bench vector_ops sub
cargo bench --bench vector_ops mul
cargo bench --bench vector_ops div
```

---

### 2. Reduction Operations (Compute-Bound)

Operations: `sum()`, `dot()`, `max()`, `min()`, `norm_l2()`

**Characteristics**:
- **8-12x speedup** achievable (compute dominates memory access)
- AVX-512 shines at moderate sizes (1,000-10,000)
- Uses horizontal reductions (`_mm512_reduce_add_ps`, etc.)

**Available Benchmarks**:
```bash
# Run reduction benchmarks
cargo bench --bench vector_ops sum
cargo bench --bench vector_ops dot
cargo bench --bench vector_ops max
cargo bench --bench vector_ops min
cargo bench --bench vector_ops norm_l2
```

**Expected Results** (based on v0.3.0 data):
- `dot()`: **11-12x speedup** at size 1,000 (AVX-512)
- `norm_l2()`: **6-9x speedup** (9.3x at size 1,000)

---

### 3. Index-Finding Operations (Hybrid)

Operations: `argmax()`, `argmin()`

**Characteristics**:
- **3.2-3.3x speedup** (limited by scalar index scan)
- SIMD finds max/min value (fast)
- Scalar `.position()` finds index (slow, dominates runtime)

**Available Benchmarks**:
```bash
# Run index-finding benchmarks
cargo bench --bench vector_ops argmax
cargo bench --bench vector_ops argmin
```

---

### 4. Activation Functions (ML Operations)

Operations: `relu()`, `sigmoid()`, `gelu()`, `tanh()`, `swish()`, `exp()`

**Characteristics**:
- Vary by complexity (ReLU simple, GELU complex)
- `relu()`: Memory-bound (just max(0, x))
- `gelu()`, `sigmoid()`: Compute-bound (transcendental functions)

**Available Benchmarks**:
```bash
# Run activation function benchmarks
cargo bench --bench vector_ops relu
cargo bench --bench vector_ops sigmoid
cargo bench --bench vector_ops gelu
cargo bench --bench vector_ops tanh
```

---

### 5. Transformation Operations

Operations: `scale()`, `abs()`, `clamp()`, `lerp()`, `fma()`

**Characteristics**:
- Mix of memory-bound and compute-bound
- `fma()`: Hardware FMA instruction (single-cycle multiply-add)
- `lerp()`: Moderate compute (interpolation)

**Available Benchmarks**:
```bash
# Run transformation benchmarks
cargo bench --bench vector_ops scale
cargo bench --bench vector_ops abs
cargo bench --bench vector_ops clamp
cargo bench --bench vector_ops lerp
cargo bench --bench vector_ops fma
```

---

## Running Benchmarks

### Complete Benchmark Suite
```bash
# Run ALL benchmarks (takes 30-60 minutes)
cargo bench --bench vector_ops

# Generate HTML reports
open target/criterion/report/index.html
```

### Specific Operations
```bash
# Single operation across all sizes/backends
cargo bench --bench vector_ops sum

# Specific backend
cargo bench --bench vector_ops "sum/AVX512"

# Specific size
cargo bench --bench vector_ops "sum/AVX512/1000"
```

### Quick Performance Check
```bash
# Run sum, max, min benchmarks (5 minutes)
cargo bench --bench vector_ops sum
cargo bench --bench vector_ops max
cargo bench --bench vector_ops min
```

### Baseline Comparison
```bash
# Save current performance as baseline
cargo bench --bench vector_ops -- --save-baseline main

# After changes, compare against baseline
cargo bench --bench vector_ops -- --baseline main

# Look for regressions >5%
```

---

## Benchmark Interpretation

### Speedup Targets by Operation Type

| Operation Type | Target Speedup | Typical Range | v0.4.0 Status |
|---------------|----------------|---------------|---------------|
| Element-wise (add/sub/mul/div) | 1-2x | 1.0-1.5x | ✅ Expected |
| Reductions (sum/max/min) | ≥8x | 8-12x |**Exceeds target** |
| Dot product | ≥8x | 11-12x |**Exceeds target** |
| Norms (L2/L1/Linf) | ≥6x | 6-9x |**Meets target** |
| Index finding (argmax/argmin) | 3-4x | 3.2-3.3x | ✅ Expected |
| Activations | Varies | 2-8x | ⏳ To be measured |

### When AVX-512 Wins

**AVX-512 > AVX2** in these scenarios:
1. **Compute-bound reductions** (sum, dot, max, min at size 1,000-10,000)
2. **Moderate data sizes** (100-10,000 elements)
3. **Operations with horizontal reductions** (single-instruction `_mm512_reduce_*`)

**AVX-512 ≈ AVX2** in these scenarios:
1. **Memory-bound operations** (add, sub, mul, div)
2. **Very large datasets** (>100,000 elements - memory bandwidth saturated)
3. **Very small datasets** (<100 elements - overhead dominates)

---

## Historical Benchmark Data

### v0.3.0 (AVX-512 Release)

Complete AVX-512 implementation with 5 phases:

| Operation | Size | Scalar | AVX2 | AVX-512 | Speedup |
|-----------|------|--------|------|---------|---------|
| add() | 1,000 | 600ns | 580ns | 575ns | 1.04x |
| dot() | 1,000 | 4.2µs | 380ns | 352ns | **11.9x** |
| sum() | 1,000 | 600ns | 58ns | 54ns | **11.1x** |
| max() | 1,000 | 390ns | 40ns | 32ns | **12.2x** |
| min() | 1,000 | 371ns | 31ns | 32ns | **11.6x** |
| argmax() | 10,000 | 13.2µs | 4.2µs | 4.0µs | **3.3x** |
| norm_l2() | 1,000 | 4.5µs | 690ns | 486ns | **9.3x** |

**Quality Metrics** (v0.3.0):
- 819 tests passing
- TDG: 92.4/100 (A)
- Zero clippy warnings

### v0.4.0 (Current - Macro Refactoring)

Focus: Code maintainability via dispatch macros

**Quality Metrics** (v0.4.0):
- 827 tests passing (+8 new tests)
- TDG: 88.1/100 (A-) - architectural limit
- Zero clippy warnings
- Eliminated ~1000 lines of duplication

**Performance**: Maintained 100% equivalence to v0.3.0 (no regressions)

---

## Matrix & GPU Benchmarks

### Matrix Operations
```bash
# Run matrix multiplication benchmarks
cargo bench --bench matrix_ops
```

**Expected Results**:
- CPU SIMD: Moderate performance (cache-bound)
- GPU: 10-50x faster for large matrices (>1000x1000)

### GPU Operations
```bash
# Requires GPU feature
cargo bench --bench gpu_ops --features gpu
```

**GPU Dispatch Thresholds**:
- Element-wise: >100,000 elements
- Dot product: >100,000 elements
- Matrix multiply: >1000x1000

---

## Profiling & Analysis

### Identifying Bottlenecks

```bash
# Install Renacer (syscall tracing)
cargo install renacer

# Profile benchmarks
renacer --function-time --source -- cargo bench vector_ops

# Generate flamegraph
renacer --function-time --source -- cargo bench | flamegraph.pl > flame.svg
```

### Key Metrics to Check

1. **I/O Bottlenecks**: Look for syscalls >1ms (file reads, allocations)
2. **Hot Functions**: Top 10 functions by runtime
3. **SIMD Utilization**: Verify vectorized loops dominate
4. **Cache Misses**: Large datasets should show memory bandwidth limits

---

## Performance Validation Checklist

Before releasing new SIMD optimizations:

- ✅ Benchmark at sizes: 100, 1,000, 10,000, 100,000
- ✅ Compare all backends: Scalar, SSE2, AVX2, AVX-512
- ✅ Verify no regressions vs baseline: `--baseline main`
- ✅ Check coefficient of variation: <5% (stable measurements)
- ✅ Validate compute-bound ops: ≥8x speedup
- ✅ Document memory-bound ops: 1-2x speedup (expected)
- ✅ Backend equivalence: All backends produce identical results
- ✅ Profile hot paths: Validate SIMD loops dominate runtime

---

## References

- **Criterion.rs Documentation**: https://bheisler.github.io/criterion.rs/book/
- **Intel Intrinsics Guide**: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
- **Trueno Performance Guide**: docs/PERFORMANCE_GUIDE.md
- **SIMD Performance Analysis**: docs/SIMD_PERFORMANCE.md

---

**Last Updated**: 2025-11-19
**Version**: v0.4.0
**Benchmark Suite**: benches/vector_ops.rs (41KB, 1000+ benchmark cases)