# Matrix Multiplication: From Cache Locality to SIMD - A Complete Journey
## Executive Summary
This document chronicles a complete optimization journey from naive implementation to high-performance SIMD matrix multiplication, achieving:
- 10× speedup from loop reordering (cache optimization)
- 126× total speedup with SIMD, blocking, and multi-threading
- 59 GFLOPS on consumer hardware
- 62% of highly-optimized NumPy performance
All optimizations are explained with performance data, hardware metrics, and theoretical analysis.
---
## Part 1: The Cache Locality Discovery
### The Original Experiment
**Simple loop reordering (i-j-k to i-k-j) achieved 9.5× speedup** with zero additional FLOPs or memory. This demonstrated that memory access patterns often matter more than algorithmic complexity on modern CPUs.
#### Implementation Details
- Language: Rust (with --release optimizations)
- Matrix sizes: 256×256, 512×512, 1024×1024
- Hardware: Modern x86-64 CPU (~4 GHz)
- Measurement: Hardware performance counters (perf)
#### Two Implementations
**Version 1: Naive (i-j-k loop ordering)**
```rust
for i in 0..m {
for j in 0..n {
for p in 0..k {
c[i*n + j] += a[i*k + p] * b[p*n + j];
}
}
}
```
**Version 2: Cache-aware (i-k-j loop ordering)**
```rust
for i in 0..m {
for p in 0..k {
for j in 0..n {
c[i*n + j] += a[i*k + p] * b[p*n + j];
}
}
}
```
**Only difference**: Order of the k and j loops
### Benchmark Results - Phase 1
| 256×256 | 33.90 ms | 4.26 ms | 7.96× | 0.99 | 7.88 |
| 512×512 | 301.86 ms | 33.05 ms | 9.13× | 0.89 | 8.12 |
| 1024×1024 | 2812.44 ms | 294.83 ms | 9.54× | 0.76 | 7.28 |
#### Key Observations
1. **Speedup increases with matrix size**
- Larger matrices create more cache pressure
- Loop ordering matters more at scale
2. **i-j-k performance degrades with size**
- Cache thrashing gets exponentially worse
- Performance drops from 0.99 to 0.76 GFLOPS
3. **i-k-j stays consistent**
- Good cache locality scales
- Stable ~7-8 GFLOPS across sizes
---
## Part 2: Understanding the Memory Hierarchy
### CPU Cache Structure
```
L1 | 32 KB | ~4 cyc | ~1 TB/s
L2 | 256 KB | ~12 cyc | ~500 GB/s
L3 | 6 MB | ~40 cyc | ~200 GB/s
RAM | 8-16 GB | ~200 cyc| ~40 GB/s
```
**Key insight**: A RAM access is 50× slower than L1 cache hit.
### Memory Access Patterns Explained
**i-j-k ordering (BAD):**
```
Inner loop (k): accesses B[k,j] where k changes
Result: B[0,j], B[1,j], B[2,j], ...
Stride: n elements (8KB for n=1024)
Cache line usage: 1/8 (load 64 bytes, use 8)
Miss rate: ~45%
```
**i-k-j ordering (GOOD):**
```
Inner loop (j): accesses B[k,j] where j changes
Result: B[k,0], B[k,1], B[k,2], ...
Stride: 1 element (sequential)
Cache line usage: 8/8 (load 64 bytes, use all)
Miss rate: ~12%
```
### Performance Counter Analysis
| Time (1024×1024) | 2812 ms | 295 ms | 9.54× faster |
| GFLOPS | 0.76 | 7.28 | 9.6× higher |
| L1 miss rate | ~45% | ~12% | 3.8× fewer misses |
| IPC | ~0.5 | ~2.7 | 5.4× more efficient |
---
## Part 3: Cache Blocking - The Next Level
After achieving 7-8 GFLOPS with loop reordering, the next bottleneck was working set size exceeding cache capacity.
### The Problem: Working Set Too Large
For 1024×1024 matrices:
```
Total working set: 3 × 1024² × 8 bytes = 24 MB
L3 cache: 6 MB
Result: Constant cache eviction
```
### Solution: Multi-Level Cache Blocking
#### KC Blocking (L1 Cache)
**Goal**: Keep working set in L1 cache (32 KB)
```rust
let KC = 256; // Tuned for L1
for kk in (0..k).step_by(KC) {
let k_block = min(KC, k - kk);
// Process only KC elements of k dimension
// Working set: (M+N) × KC ≈ 16KB
}
```
#### MC Blocking (L2 Cache)
**Goal**: Reuse data from L2 cache (256 KB)
```rust
let MC = 120; // Tuned for L2
for kk in (0..k).step_by(KC) {
for ii in (0..m).step_by(MC) {
// Pack MC × KC chunk of A → stays in L2
pack_a_panel(...);
for j in (0..n).step_by(4) {
// Reuse A from L2 for all columns!
}
}
}
```
### Data Packing Optimization
**Problem**: Original row-major layout causes strided access
**Solution**: Reorganize into column-major groups for sequential memory access.
---
## Part 4: SIMD Vectorization
After cache blocking achieved ~12 GFLOPS, the next bottleneck was scalar operations.
### The SIMD Advantage
**Scalar code**: 1 operation per instruction
**AVX2**: 4 operations per instruction (256-bit)
**AVX-512**: 8 operations per instruction (512-bit)
### Kernel Implementations
#### 4×4 AVX2 Kernel
```rust
// Process 4 rows × 4 columns per call
let mut c0 = _mm256_loadu_pd(c_ptr); // Load 4 doubles
let a0 = _mm256_broadcast_sd(a_val); // Broadcast scalar
c0 = _mm256_fmadd_pd(a0, b_vec, c0); // 4 FMAs at once
```
#### 12×4 AVX2 Kernel
- Process 12 rows × 4 columns per call
- 12 C accumulators for better instruction pipelining
#### 8×8 AVX-512 Kernel
```rust
// Process 8 rows × 8 columns per call
let mut c0 = _mm512_loadu_pd(c_ptr); // Load 8 doubles
let a0 = _mm512_set1_pd(a_val); // Broadcast scalar
c0 = _mm512_fmadd_pd(a0, b_vec, c0); // 8 FMAs at once
```
### FMA (Fused Multiply-Add)
**Critical discovery**: Must explicitly enable FMA target feature
```rust
#[target_feature(enable = "avx2,fma")]
```
This single change gave **2× improvement** - from separate multiply + add to fused instruction.
---
## Part 5: Multi-Threading
After achieving 37 GFLOPS single-threaded, the final optimization was parallelization.
### Adaptive Threading Strategy
Thread count scales with problem size to avoid overhead on small matrices:
```rust
fn choose_thread_count(m: usize, n: usize, k: usize, max_threads: usize) -> usize {
let flops = 2.0 * (m * n * k) as f64;
if flops < 100_000_000.0 { // < 100M FLOPs
1 // Threading overhead not worth it
} else if flops < 300_000_000.0 { // < 300M FLOPs
2
} else {
max_threads
}
}
```
### Row-Based Parallelization
Each thread processes a subset of rows independently:
- No synchronization needed (disjoint output regions)
- Each thread calls the blocked GEMM on its row range
- B matrix is shared read-only across threads
---
## Part 6: Complete Performance Results
### Intel i7-1185G7 @ 4.8GHz (11th Gen, Tiger Lake, WSL2)
| Naive (i-j-k) | 0.73 GFLOPS | 0.55 GFLOPS | 0.59 GFLOPS | 1× |
| Scalar (i-k-j) | 6.93 GFLOPS | 5.02 GFLOPS | 5.12 GFLOPS | 9× |
| 4×4 AVX2 | 18.54 GFLOPS | 14.66 GFLOPS | 19.27 GFLOPS | 28× |
| 4×4 AVX2 MT | 14.44 GFLOPS | 22.20 GFLOPS | 42.63 GFLOPS | 44× |
| 12×4 AVX2 | 21.27 GFLOPS | 17.73 GFLOPS | 21.89 GFLOPS | 33× |
| 12×4 AVX2 MT | 19.17 GFLOPS | 25.16 GFLOPS | 46.46 GFLOPS | 50× |
| 8×8 AVX-512 | 15.04 GFLOPS | 24.17 GFLOPS | 31.74 GFLOPS | 39× |
| **8×8 AVX-512 MT** | **21.74 GFLOPS** | **34.18 GFLOPS** | **59.11 GFLOPS** | **64×** |
**Peak Performance**: 59.11 GFLOPS (100× speedup over naive)
**Threading Efficiency (8×8 AVX-512, 1024×1024):**
- Single-threaded: 31.74 GFLOPS
- Multi-threaded (4 cores): 59.11 GFLOPS
- Scaling: 1.86× (47% efficiency) - Good for memory-bound workloads
### Intel i5-8257U @ 1.4GHz (8th Gen, Coffee Lake, macOS)
| Naive (i-j-k) | 0.79 GFLOPS | 0.77 GFLOPS | 0.47 GFLOPS | 1× |
| Scalar (i-k-j) | 6.95 GFLOPS | 6.77 GFLOPS | 5.06 GFLOPS | 9× |
| 4×4 AVX2 | 19.43 GFLOPS | 19.22 GFLOPS | 16.43 GFLOPS | 28× |
| 4×4 AVX2 MT | 18.71 GFLOPS | 29.95 GFLOPS | 52.48 GFLOPS | 58× |
| 12×4 AVX2 | 21.33 GFLOPS | 20.77 GFLOPS | 22.37 GFLOPS | 34× |
| **12×4 AVX2 MT** | **20.45 GFLOPS** | **32.80 GFLOPS** | **58.82 GFLOPS** | **65×** |
**Peak Performance**: 58.82 GFLOPS (126× speedup over naive!)
**Threading Efficiency (12×4 AVX2, 1024×1024):**
- Single-threaded: 22.37 GFLOPS
- Multi-threaded (4 cores): 58.82 GFLOPS
- Scaling: 2.63× (66% efficiency) - Excellent!
### Cross-Platform Comparison
| macOS | i5-8257U | 1.4 GHz | 22.37 GFLOPS | **58.82 GFLOPS** | **126×** |
| WSL2 | i7-1185G7 | 4.8 GHz | 31.74 GFLOPS | **59.11 GFLOPS** | **100×** |
**Key Insight**: The Mac at 1.4 GHz achieves nearly identical performance to WSL2 at 4.8 GHz!
**Performance per GHz:**
- macOS: 58.82 / 1.4 = **42.0 GFLOPS/GHz**
- WSL2: 59.11 / 4.8 = **12.3 GFLOPS/GHz**
3.4× efficiency difference due to:
- Superior thermal management (sustained turbo)
- Native OS vs virtualization overhead
- Better memory subsystem utilization
### Comparison with NumPy (OpenBLAS)
| 256×256 | ~84 GFLOPS | ~32 GFLOPS | 38% |
| 512×512 | ~79 GFLOPS | ~49 GFLOPS | **62%** |
| 1024×1024 | ~112 GFLOPS | ~55 GFLOPS | **49%** |
Achieving **49-62% of production BLAS performance** built entirely from scratch.
---
## Part 7: Optimization Journey Summary
| 1 | Naive i-j-k | 0.59 | 1× | - |
| 2 | Loop reorder (i-k-j) | 5.12 | 9× | Strided B access |
| 3 | Cache blocking | 12.09 | 20× | Cache thrashing |
| 4 | 4×4 AVX2 | 19.27 | 33× | Scalar operations |
| 5 | 12×4 AVX2 | 21.89 | 37× | Small kernel |
| 6 | 8×8 AVX-512 | 31.74 | 54× | AVX2 width |
| 7 | **+ Multi-threading** | **59.11** | **100×** | **Single core** |
---
## Part 8: Key Lessons Learned
### 1. Optimization Priority
```
Cache optimization: 0.59 → 12.09 GFLOPS (20× gain)
SIMD vectorization: 12.09 → 31.74 GFLOPS (2.6× gain)
Multi-threading: 31.74 → 59.11 GFLOPS (1.9× gain)
Memory > Compute > Parallelism
```
### 2. Threading Efficiency is Memory-Bound
Multi-threading achieves 47-66% efficiency on 4 cores:
- Theoretical max: 4× (perfect scaling)
- Achieved: 1.86-2.63×
- **Expected for memory-bound workloads**
### 3. Platform Matters More Than Clock Speed
The 1.4 GHz Mac matched the 4.8 GHz WSL2 system because:
- Sustained thermal performance
- Native OS efficiency
- Memory subsystem optimization
### 4. Measure Everything
Used throughout:
- `perf stat` for cache miss rates
- Manual timing for GFLOPS
- Hardware counters for IPC
- NumPy comparison baseline
---
## References
1. Goto & van de Geijn (2008) - "Anatomy of High-Performance Matrix Multiplication"
2. Van Zee & van de Geijn (2015) - "BLIS: A Framework for Rapidly Instantiating BLAS Functionality"
3. Intel Intrinsics Guide
4. Agner Fog's Optimization Manuals
---
**Date**: February 2026
**Author**: Arun
**Repository**: [github.com/ArunTamil21/matmul-simd](https://github.com/ArunTamil21/matmul-simd)