# Performance Optimization Guide
**Comprehensive Guide to Optimizing TensorLogic Backend Performance**
This guide provides detailed strategies and techniques for maximizing the performance of your TensorLogic backend implementation.
## Table of Contents
- [Overview](#overview)
- [Measurement and Profiling](#measurement-and-profiling)
- [CPU Optimizations](#cpu-optimizations)
- [Memory Optimizations](#memory-optimizations)
- [Graph Optimizations](#graph-optimizations)
- [Batch Processing](#batch-processing)
- [Platform-Specific Optimizations](#platform-specific-optimizations)
- [Case Studies](#case-studies)
- [Performance Checklist](#performance-checklist)
## Overview
### Performance Targets
Good performance targets for a TensorLogic backend:
| Matrix multiply (1000x1000) | < 10ms | < 1ms |
| Element-wise (1M elements) | < 1ms | < 0.1ms |
| Reduction (1M elements) | < 2ms | < 0.2ms |
| Graph compilation | < 100ms | < 100ms |
### Optimization Priority
1. **Correctness First** - Verify correctness before optimizing
2. **Measure Everything** - Use profiling to identify bottlenecks
3. **Optimize Hot Paths** - Focus on frequently-called operations
4. **Test After Changes** - Ensure optimizations don't break functionality
## Measurement and Profiling
### Using Built-in Profiling
```rust
use tensorlogic_infer::{TlProfiledExecutor, TimelineProfiler};
let mut executor = MyExecutor::new();
executor.enable_profiling();
// Execute operations
executor.execute(&graph, &inputs)?;
// Get profile data
let profile = executor.get_profile_data();
for (op_name, stats) in &profile.op_profiles {
println!("{}: avg={:.2}ms, count={}",
op_name, stats.avg_time_ms, stats.count);
}
// Identify bottlenecks
let analyzer = BottleneckAnalyzer::new();
let report = analyzer.analyze(&profile);
println!("{}", report);
```
### Using cargo-flamegraph
Install and run:
```bash
cargo install flamegraph
cargo flamegraph --bench my_benchmark
```
This generates an interactive flame graph showing where time is spent.
### Using perf (Linux)
```bash
# Record performance data
perf record --call-graph dwarf ./target/release/my_backend
# Analyze
perf report
```
### Custom Timing
```rust
use std::time::Instant;
fn timed_operation<F, R>(name: &str, f: F) -> R
where
F: FnOnce() -> R,
{
let start = Instant::now();
let result = f();
let elapsed = start.elapsed();
println!("{}: {:.2}ms", name, elapsed.as_secs_f64() * 1000.0);
result
}
// Usage
let result = timed_operation("matmul", || {
executor.einsum("ik,kj->ij", &[a, b])
});
```
## CPU Optimizations
### 1. Use SIMD Instructions
**Example: Vectorized ReLU**
```rust
#[cfg(target_feature = "avx2")]
unsafe fn relu_simd(data: &mut [f64]) {
use std::arch::x86_64::*;
let zeros = _mm256_setzero_pd();
let chunks = data.chunks_exact_mut(4);
for chunk in chunks {
let vals = _mm256_loadu_pd(chunk.as_ptr());
let result = _mm256_max_pd(vals, zeros);
_mm256_storeu_pd(chunk.as_mut_ptr(), result);
}
}
// Fallback for non-AVX2
#[cfg(not(target_feature = "avx2"))]
fn relu_simd(data: &mut [f64]) {
for x in data {
*x = x.max(0.0);
}
}
```
**Enable SIMD at compile time:**
```toml
[profile.release]
opt-level = 3
lto = "fat"
codegen-units = 1
[target.'cfg(target_arch = "x86_64")'.dependencies]
# Enable AVX2
```
Build with:
```bash
RUSTFLAGS="-C target-cpu=native" cargo build --release
```
### 2. Parallelize Operations
**Using Rayon for Data Parallelism**
```rust
use rayon::prelude::*;
fn parallel_batch_execution(
&mut self,
graph: &EinsumGraph,
batch_inputs: Vec<Inputs>,
) -> Result<Vec<Output>, Error> {
batch_inputs
.par_iter()
.map(|inputs| self.clone().execute(graph, inputs))
.collect()
}
```
**Parallel Reduction**
```rust
use rayon::prelude::*;
fn parallel_sum(data: &[f64]) -> f64 {
data.par_chunks(1000)
.map(|chunk| chunk.iter().sum::<f64>())
.sum()
}
```
### 3. Use Specialized BLAS Libraries
**Link against optimized BLAS**
```toml
[dependencies]
ndarray = { version = "0.15", features = ["blas"] }
blas-src = { version = "0.8", features = ["openblas"] }
```
**Or use Intel MKL:**
```toml
[dependencies]
ndarray = { version = "0.15", features = ["blas"] }
blas-src = { version = "0.8", features = ["intel-mkl"] }
```
### 4. Cache-Friendly Memory Access
**Bad: Column-major iteration**
```rust
// ❌ Poor cache locality
for j in 0..n {
for i in 0..m {
process(matrix[[i, j]]);
}
}
```
**Good: Row-major iteration**
```rust
// ✅ Good cache locality
for i in 0..m {
for j in 0..n {
process(matrix[[i, j]]);
}
}
```
**Use contiguous memory layouts:**
```rust
// Ensure C-order (row-major) for cache efficiency
let array = Array2::zeros((m, n).f()); // ❌ Fortran order
let array = Array2::zeros((m, n)); // ✅ C order
```
## Memory Optimizations
### 1. Memory Pooling
**Implement a tensor pool:**
```rust
use std::collections::VecDeque;
pub struct TensorPool {
pool: HashMap<Vec<usize>, VecDeque<ArrayD<f64>>>,
max_size: usize,
}
impl TensorPool {
pub fn new(max_size: usize) -> Self {
Self {
pool: HashMap::new(),
max_size,
}
}
pub fn allocate(&mut self, shape: &[usize]) -> ArrayD<f64> {
let shape_key = shape.to_vec();
if let Some(queue) = self.pool.get_mut(&shape_key) {
if let Some(tensor) = queue.pop_front() {
return tensor;
}
}
// Allocate new tensor
ArrayD::zeros(shape)
}
pub fn deallocate(&mut self, shape: Vec<usize>, tensor: ArrayD<f64>) {
let queue = self.pool.entry(shape).or_insert_with(VecDeque::new);
if queue.len() < self.max_size {
queue.push_back(tensor);
}
// Otherwise, let it drop
}
pub fn clear(&mut self) {
self.pool.clear();
}
}
```
**Usage:**
```rust
impl MyExecutor {
fn elem_op(&mut self, op: ElemOp, x: &Tensor) -> Result<Tensor, Error> {
let shape = x.shape().to_vec();
let mut result = self.pool.allocate(&shape);
// Perform operation on result
// ...
Ok(Tensor::new(result))
}
}
impl Drop for MyExecutor {
fn drop(&mut self) {
self.pool.clear();
}
}
```
### 2. In-Place Operations
**When possible, reuse buffers:**
```rust
// ❌ Allocates new array
fn relu_allocating(x: &ArrayD<f64>) -> ArrayD<f64> {
x.mapv(|v| v.max(0.0))
}
// ✅ Modifies in place
fn relu_inplace(x: &mut ArrayD<f64>) {
x.mapv_inplace(|v| v.max(0.0));
}
```
### 3. Lazy Evaluation
**Delay computation until needed:**
```rust
pub enum TensorView<'a> {
Owned(ArrayD<f64>),
View(ArrayViewD<'a, f64>),
Lazy(Box<dyn Fn() -> ArrayD<f64> + 'a>),
}
impl<'a> TensorView<'a> {
fn materialize(&self) -> ArrayD<f64> {
match self {
TensorView::Owned(arr) => arr.clone(),
TensorView::View(view) => view.to_owned(),
TensorView::Lazy(f) => f(),
}
}
}
```
### 4. Reduce Cloning
**Use Cow (Copy on Write):**
```rust
use std::borrow::Cow;
fn process<'a>(data: Cow<'a, ArrayD<f64>>) -> Cow<'a, ArrayD<f64>> {
if needs_modification {
let mut owned = data.into_owned();
owned.mapv_inplace(|x| x * 2.0);
Cow::Owned(owned)
} else {
data // Zero-copy
}
}
```
## Graph Optimizations
### 1. Fusion Optimization
**Fuse consecutive operations:**
```rust
use tensorlogic_infer::{FusionPlanner, GraphOptimizer};
let optimizer = GraphOptimizer::new();
let result = optimizer.analyze(&graph);
// Apply fusion opportunities
for fusion in result.fusion_opportunities {
if fusion.estimated_speedup > 1.5 {
apply_fusion(&mut graph, &fusion);
}
}
```
**Example: Fuse ReLU + Multiply**
```rust
// Before: Two passes over data
let x = relu(input);
let y = multiply(x, weight);
// After: Single pass
let y = relu_multiply(input, weight);
fn relu_multiply(input: &Array, weight: &Array) -> Array {
input.mapv(|x| (x.max(0.0)) * weight)
}
```
### 2. Dead Code Elimination
```rust
// Remove unused computations
let optimized_graph = remove_dead_nodes(&graph);
```
### 3. Common Subexpression Elimination (CSE)
```rust
// Reuse computed values
let mut cache = HashMap::new();
fn compute_with_cse(expr: &Expr, cache: &mut HashMap<ExprId, Tensor>) -> Tensor {
if let Some(cached) = cache.get(&expr.id) {
return cached.clone();
}
let result = compute(expr);
cache.insert(expr.id, result.clone());
result
}
```
### 4. Graph Compilation
**Pre-compile graphs for reuse:**
```rust
pub struct CompiledGraph {
optimized_ops: Vec<OptimizedOp>,
memory_plan: MemoryPlan,
}
impl MyExecutor {
pub fn compile(&mut self, graph: &EinsumGraph) -> CompiledGraph {
let optimizer = GraphOptimizer::new();
let result = optimizer.analyze(graph);
CompiledGraph {
optimized_ops: result.fused_operations,
memory_plan: plan_memory_allocation(graph),
}
}
pub fn execute_compiled(&mut self, compiled: &CompiledGraph, inputs: &Inputs) -> Output {
// Much faster execution
for op in &compiled.optimized_ops {
self.execute_optimized_op(op, &compiled.memory_plan)?;
}
}
}
```
## Batch Processing
### 1. Optimal Batch Sizing
**Determine optimal batch size:**
```rust
fn find_optimal_batch_size(&self, graph: &EinsumGraph) -> usize {
let memory_per_sample = estimate_memory_usage(graph);
let available_memory = get_available_memory();
let num_threads = num_cpus::get();
// Balance memory and parallelism
let max_by_memory = available_memory / memory_per_sample;
let max_by_threads = num_threads * 2; // 2x for pipelining
max_by_memory.min(max_by_threads).max(1)
}
```
### 2. Batched Operations
**Combine small operations:**
```rust
// ❌ Many small operations
for i in 0..1000 {
execute_single(input[i]);
}
// ✅ Single batched operation
execute_batch(&inputs[0..1000]);
```
### 3. Pipeline Parallelism
**Overlap computation and data loading:**
```rust
use crossbeam::channel;
fn pipeline_execution(batches: Vec<Batch>) -> Vec<Result> {
let (send, recv) = channel::bounded(2); // 2-stage pipeline
// Producer thread
thread::spawn(move || {
for batch in batches {
let preprocessed = preprocess(batch);
send.send(preprocessed).unwrap();
}
});
// Consumer thread (main)
recv.iter()
.map(|batch| execute(batch))
.collect()
}
```
## Platform-Specific Optimizations
### CPU Optimizations
**x86-64 with AVX-512:**
```rust
#[cfg(target_feature = "avx512f")]
fn optimized_matmul(a: &Array2<f64>, b: &Array2<f64>) -> Array2<f64> {
// Use AVX-512 instructions
unsafe {
use std::arch::x86_64::*;
// ... AVX-512 implementation
}
}
```
**ARM with NEON:**
```rust
#[cfg(target_arch = "aarch64")]
fn optimized_reduce_sum(data: &[f64]) -> f64 {
unsafe {
use std::arch::aarch64::*;
// ... NEON implementation
}
}
```
### GPU Optimizations (CUDA/ROCm)
**Using cudarc:**
```rust
use cudarc::driver::*;
pub struct GpuExecutor {
device: Arc<CudaDevice>,
kernels: HashMap<String, CudaFunction>,
}
impl GpuExecutor {
pub fn new() -> Result<Self> {
let device = CudaDevice::new(0)?;
// Compile kernels
let ptx = compile_kernel_ptx()?;
device.load_ptx(ptx, "my_kernels", &["matmul", "relu"])?;
Ok(Self {
device,
kernels: load_kernels(&device)?,
})
}
fn execute_on_gpu(&self, op: &Op, inputs: &[&Tensor]) -> Result<Tensor> {
// Transfer to GPU
let gpu_inputs = inputs.iter()
.map(|t| self.device.htod_copy(t.data()))
.collect::<Result<Vec<_>>>()?;
// Launch kernel
let kernel = &self.kernels[op.name()];
let cfg = LaunchConfig {
grid_dim: (blocks, 1, 1),
block_dim: (threads, 1, 1),
shared_mem_bytes: 0,
};
kernel.launch(cfg, (&gpu_inputs, &mut gpu_output))?;
// Transfer back
let result = self.device.dtoh_sync_copy(&gpu_output)?;
Ok(Tensor::new(result))
}
}
```
### WebAssembly Optimizations
**Use wasm-simd:**
```rust
#[cfg(target_arch = "wasm32")]
#[target_feature(enable = "simd128")]
unsafe fn wasm_simd_add(a: &[f32], b: &[f32], result: &mut [f32]) {
use std::arch::wasm32::*;
for i in (0..a.len()).step_by(4) {
let va = v128_load(&a[i] as *const f32 as *const v128);
let vb = v128_load(&b[i] as *const f32 as *const v128);
let vr = f32x4_add(va, vb);
v128_store(&mut result[i] as *mut f32 as *mut v128, vr);
}
}
```
## Case Studies
### Case Study 1: Matrix Multiplication Optimization
**Problem**: Naive matrix multiplication too slow
**Solution**: Multi-level optimization
```rust
// Level 1: Use BLAS
fn matmul_blas(a: &Array2<f64>, b: &Array2<f64>) -> Array2<f64> {
a.dot(b) // Uses optimized BLAS
}
// Level 2: Blocked algorithm for cache efficiency
fn matmul_blocked(a: &Array2<f64>, b: &Array2<f64>, block_size: usize) -> Array2<f64> {
let (m, k) = a.dim();
let (_, n) = b.dim();
let mut c = Array2::zeros((m, n));
for i in (0..m).step_by(block_size) {
for j in (0..n).step_by(block_size) {
for l in (0..k).step_by(block_size) {
let i_end = (i + block_size).min(m);
let j_end = (j + block_size).min(n);
let l_end = (l + block_size).min(k);
let a_block = a.slice(s![i..i_end, l..l_end]);
let b_block = b.slice(s![l..l_end, j..j_end]);
let mut c_block = c.slice_mut(s![i..i_end, j..j_end]);
c_block += &a_block.dot(&b_block);
}
}
}
c
}
// Level 3: Parallel blocked algorithm
fn matmul_parallel_blocked(a: &Array2<f64>, b: &Array2<f64>) -> Array2<f64> {
use rayon::prelude::*;
let (m, n) = (a.nrows(), b.ncols());
let block_size = 64;
let blocks: Vec<_> = (0..m).step_by(block_size)
.flat_map(|i| (0..n).step_by(block_size).map(move |j| (i, j)))
.collect();
let results: Vec<_> = blocks.par_iter()
.map(|&(i, j)| compute_block(a, b, i, j, block_size))
.collect();
assemble_blocks(results, m, n)
}
```
**Results:**
- Naive: 1000ms
- BLAS: 50ms (20x faster)
- Blocked: 30ms (33x faster)
- Parallel blocked: 10ms (100x faster)
### Case Study 2: Reduce Memory Allocations
**Problem**: Excessive memory allocations during batch processing
**Solution**: Memory pooling
**Before:**
```rust
// Allocates new tensors for each operation
for input in batch {
let temp1 = relu(input); // Allocation
let temp2 = matmul(temp1, w); // Allocation
let output = softmax(temp2); // Allocation
}
```
**After:**
```rust
// Reuse allocated tensors
let mut pool = TensorPool::new();
for input in batch {
let mut temp1 = pool.allocate(input.shape());
relu_inplace(&input, &mut temp1);
let mut temp2 = pool.allocate(output_shape);
matmul_into(&temp1, &w, &mut temp2);
let output = softmax_inplace(&mut temp2);
pool.deallocate(temp1);
// temp2 is moved to output
}
```
**Results:**
- Before: 500ms, 2GB peak memory
- After: 200ms, 500MB peak memory (2.5x faster, 4x less memory)
## Performance Checklist
Use this checklist when optimizing your backend:
### Measurement
- [ ] Profile with built-in profiling tools
- [ ] Generate flame graphs
- [ ] Identify bottleneck operations (80/20 rule)
- [ ] Measure memory usage
- [ ] Set performance targets
### CPU Optimizations
- [ ] Enable SIMD instructions (`target-cpu=native`)
- [ ] Use optimized BLAS library (OpenBLAS/MKL)
- [ ] Parallelize independent operations
- [ ] Optimize memory access patterns (cache locality)
- [ ] Reduce function call overhead (inlining)
### Memory Optimizations
- [ ] Implement memory pooling
- [ ] Use in-place operations where possible
- [ ] Reduce unnecessary cloning
- [ ] Implement lazy evaluation
- [ ] Use appropriate data structures (contiguous arrays)
### Graph Optimizations
- [ ] Implement operation fusion
- [ ] Eliminate dead code
- [ ] Apply common subexpression elimination
- [ ] Compile and cache graphs
- [ ] Optimize execution order
### Batch Processing
- [ ] Determine optimal batch size
- [ ] Implement batched operations
- [ ] Use pipeline parallelism
- [ ] Minimize data transfers
### Platform-Specific
- [ ] Use platform SIMD instructions
- [ ] Leverage GPU when available
- [ ] Optimize for target architecture
- [ ] Test on production hardware
### Validation
- [ ] Verify correctness after optimization
- [ ] Compare against baseline
- [ ] Test edge cases
- [ ] Measure improvement quantitatively
- [ ] Document optimization decisions
## Tools and Resources
### Profiling Tools
- **cargo-flamegraph**: Visual performance profiling
- **perf** (Linux): Low-level CPU profiling
- **Instruments** (macOS): System-wide profiling
- **heaptrack**: Memory profiling
### Benchmarking
- **criterion**: Statistical benchmarking
- **iai**: Cachegrind-based benchmarking
- **dhat**: Dynamic heap analysis
### Libraries
- **rayon**: Data parallelism
- **ndarray**: N-dimensional arrays
- **blas-src**: BLAS bindings
- **cudarc**: CUDA support
### Reading
- [Optimization Guide](https://nnethercote.github.io/perf-book/)
- [Rust Performance Book](https://nnethercote.github.io/perf-book/)
- [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf)
## Conclusion
Performance optimization is an iterative process:
1. Measure to identify bottlenecks
2. Optimize the hot paths
3. Validate correctness
4. Measure improvement
5. Repeat
Focus on the 20% of code that takes 80% of the time. Small improvements in critical paths can lead to dramatic overall performance gains.
---
**Version**: 1.0
****Last Updated**: 2025-12-16
**Part of**: [TensorLogic Ecosystem](https://github.com/cool-japan/tensorlogic)