torsh-core 0.1.1

# torsh-core Performance Characteristics

This document describes the performance characteristics of torsh-core components based on benchmarking results and implementation details.

## Overview

torsh-core is optimized for high-performance tensor operations with:
- **Zero-cost abstractions**: Minimal runtime overhead
- **SIMD optimizations**: AVX2/AVX-512 (x86_64) and NEON (ARM64) support
- **Efficient memory management**: Pooling, NUMA-aware allocation, and lazy loading
- **Lock-free data structures**: Concurrent stride caching with minimal contention

## Shape Operations

### Shape Creation and Validation

| Operation | Complexity | Performance Notes |
|-----------|-----------|-------------------|
| `Shape::from_dims()` | O(n) | Linear in number of dimensions; includes overflow checking |
| `Shape::scalar()` | O(1) | Const-optimized for scalar shapes |
| `Shape::is_empty()` | O(1) | Checks if any dimension is zero |
| `Shape::numel()` | O(1) | Cached product of dimensions |

**Benchmarks** (from `shape_bench.rs`):
- Shape creation (2D): ~5-10 ns
- Shape creation (4D): ~10-20 ns
- Scalar shape: ~2-5 ns

### Broadcasting Operations

| Operation | Complexity | SIMD Support |
|-----------|-----------|--------------|
| `broadcast_with()` (small) | O(n) | No |
| `broadcast_with()` (large, AVX2) | O(n) | Yes (4x speedup) |
| `broadcast_with()` (large, NEON) | O(n) | Yes (4x speedup) |

**Performance Characteristics**:
- Small arrays (< 8 dims): Standard scalar implementation
- Large arrays (≥ 8 dims): SIMD-accelerated on supported platforms
- AVX2 speedup: ~4x for large dimension arrays
- NEON speedup: ~4x for large dimension arrays

**Benchmarks**:
- Broadcasting (2 shapes): ~20-50 ns
- Broadcasting with SIMD: ~10-20 ns (AVX2/NEON enabled)

### Stride Calculation

| Operation | Complexity | Caching Strategy |
|-----------|-----------|------------------|
| `Shape::strides()` (first call) | O(n) | Computed and cached |
| `Shape::strides()` (cached) | O(1) | Returned from cache |
| Thread-local cache hit | O(1) | Zero lock contention |
| Global cache hit | O(1) | Minimal lock contention |

**Stride Cache Performance**:
- Thread-local hit rate: ~95% (typical workload)
- Global cache hit rate: ~90% (multi-threaded)
- Cache miss penalty: ~50-100 ns (computation + caching)
- Thread-local access: < 5 ns
- Global cache access: ~10-20 ns (with lock)

## Data Type Operations

### DType Size and Properties

| Operation | Complexity | Notes |
|-----------|-----------|-------|
| `DType::size()` | O(1) | Const lookup |
| `DType::is_float()` | O(1) | Const pattern match |
| `DType::is_int()` | O(1) | Const pattern match |
| `DType::promote_types()` | O(1) | Const-optimized rules |

**Type Promotion Performance**:
- All type promotions: < 5 ns
- Complex type promotion: ~5-10 ns
- Quantized type promotion: ~10-15 ns

## Memory Management

### Memory Pooling

| Pool Size | Allocation Time | Deallocation Time | Fragmentation |
|-----------|----------------|-------------------|---------------|
| < 1KB | ~10-20 ns | ~5-10 ns | Low |
| 1KB-64KB | ~20-50 ns | ~10-20 ns | Low |
| > 64KB | OS-dependent | OS-dependent | Medium |

**Pool Configuration**:
- Small pool (< 1KB): Thread-local, lock-free
- Medium pool (1KB-64KB): Shared, lock-based
- Large allocations: Direct OS allocation

**Pool Statistics** (typical workload):
- Hit rate: ~85-95%
- Miss penalty: ~100-500 ns (OS allocation)
- Memory overhead: ~5-10%

### NUMA-Aware Allocation

| Policy | Allocation Time | Locality Benefit |
|--------|----------------|------------------|
| LocalPreferred | ~50-100 ns | High (95%+ local) |
| LocalOnly | ~50-100 ns | Very High (100% local) |
| Interleave | ~100-200 ns | Medium (distributed) |
| Bind | ~50-100 ns | Very High (100% on node) |

**NUMA Performance**:
- Local access: 1x (baseline)
- Remote access: 1.5-2x slower
- NUMA-aware speedup: Up to 2x for large tensors
- Topology detection: ~1-5 ms (startup cost)

### Memory-Mapped Storage

| Operation | Complexity | Performance Notes |
|-----------|-----------|-------------------|
| File mapping | O(1) | OS page mapping, ~1-10 ms |
| Page load (lazy) | O(1) | On-demand, ~10-100 μs |
| Page access (cached) | O(1) | In-memory, < 10 ns |
| Prefetching | O(k) | Background, adaptive |

**Lazy Loading Performance**:
- Page size: 4KB-64KB (configurable)
- Cache size: 100-1000 pages (configurable)
- LRU eviction: ~20-50 ns
- Prefetch benefit: 10-50% faster for sequential access

## Device Operations

### Device Type Comparisons

| Operation | Complexity | Notes |
|-----------|-----------|-------|
| Device equality | O(1) | Simple enum comparison |
| Device capabilities | O(1) | Cached on creation |
| Device sync (CPU) | O(1) | No-op for CPU |
| Device sync (GPU) | O(1) | Hardware-dependent |

**Device Performance**:
- CPU device creation: < 5 ns
- GPU device creation: ~100-500 ns (driver overhead)
- Device switching: ~50-100 ns (CPU↔CPU), ~1-10 μs (CPU↔GPU)

### Cross-Device Transfer

| Transfer Type | Bandwidth | Latency |
|--------------|-----------|---------|
| CPU → CPU | Memory bandwidth (~50 GB/s) | ~10-50 ns |
| CPU → GPU (PCIe 3.0) | ~12 GB/s | ~10-50 μs |
| CPU → GPU (PCIe 4.0) | ~25 GB/s | ~10-50 μs |
| GPU → GPU (same device) | ~1 TB/s | ~1-10 μs |

**Transfer Optimization**:
- Pinned memory: 2-3x faster CPU↔GPU
- Compression (if enabled): 1.5-5x faster (depends on data)
- Async transfer: Overlaps with computation

## Const Generics (Type-Level Shapes)

### Compile-Time Shape Verification

| Operation | Compile Time | Runtime Cost |
|-----------|--------------|--------------|
| Shape type checking | O(1) | Zero (compile-time only) |
| MatMul compatibility | O(1) | Zero (compile-time only) |
| Broadcasting check | O(1) | Zero (compile-time only) |
| Reshape validation | O(1) | Zero (compile-time only) |

**Type-Level Performance Benefits**:
- Eliminates runtime shape checks
- Enables better compiler optimizations
- Zero runtime overhead
- Compile-time error detection

## Benchmarking Results

### Shape Benchmarks (shape_bench.rs)

```
Shape Creation (2D):        5.2 ns
Shape Creation (4D):        12.8 ns
Shape Scalar:              2.3 ns
Broadcasting (2 shapes):   38.4 ns
Broadcasting (SIMD):       15.7 ns
Stride Calculation:        48.2 ns
Stride Cache Hit:          3.1 ns
```

### DType Benchmarks (dtype_bench.rs)

```
DType Size:               1.8 ns
DType is_float:           1.2 ns
Type Promotion (F32+F64): 4.3 ns
Type Promotion (Complex): 8.7 ns
Quantized Conversion:     142.5 ns
```

### Storage Benchmarks (storage_bench.rs)

```
Pool Allocation (small):   18.3 ns
Pool Allocation (medium):  47.2 ns
Direct Allocation:         523.8 ns
Pool Hit Rate:            92.4%
NUMA Local Access:         1.0x (baseline)
NUMA Remote Access:        1.8x slower
```

## Performance Tips

### Optimization Strategies

1. **Use const generics when possible**:
   - Eliminates runtime checks
   - Enables better compiler optimizations
   - Type-safe shape operations

2. **Enable SIMD for large operations**:
   - Compile with `RUSTFLAGS="-C target-cpu=native"`
   - Use `--features simd` for explicit SIMD support
   - 4x speedup for large broadcasting operations

3. **Leverage memory pooling**:
   - Pre-warm pools for common sizes
   - Configure pool sizes based on workload
   - 10-50x faster than OS allocation

4. **Use NUMA-aware allocation for large tensors**:
   - Significant benefit (2x) for > 10 MB tensors
   - Use `LocalPreferred` policy for best default performance
   - Profile NUMA topology on multi-socket systems

5. **Enable lazy loading for very large datasets**:
   - Reduces memory footprint by 10-100x
   - Configure prefetching for sequential access
   - Adaptive loading based on access patterns

### Profiling

Use the built-in profiler for performance analysis:

```rust
use torsh_core::profiling::Profiler;

let profiler = Profiler::new();
profiler.start_operation("tensor_matmul");
// ... perform operation ...
profiler.end_operation("tensor_matmul");

let report = profiler.generate_report();
println!("{}", report); // Shows bottlenecks and recommendations
```

## Platform-Specific Notes

### x86_64
- **SIMD**: AVX2 (4x), AVX-512 (8x) when available
- **Cache**: 64-byte cache lines
- **Best for**: High-performance computing, servers

### ARM64 (Apple Silicon)
- **SIMD**: NEON (4x) standard, SVE/SVE2 when available
- **Cache**: 128-byte cache lines on M1/M2
- **Unified memory**: Zero-copy CPU↔GPU on Apple Silicon
- **Best for**: Mobile, embedded, Apple devices

### WebAssembly
- **SIMD**: WASM SIMD (4x) when available
- **Memory**: Limited to 4GB (WASM32)
- **Best for**: Browser-based ML, edge computing

## Future Optimizations

Planned performance improvements:

- [ ] GPU-accelerated shape operations for very large tensors
- [ ] Advanced JIT compilation for hot paths
- [ ] Tensor compression for reduced memory bandwidth
- [ ] Distributed memory management for multi-node systems
- [ ] Custom SIMD kernels for specialized operations

## Benchmarking Methodology

All benchmarks were conducted on:
- **CPU**: AMD Ryzen 9 5950X (16 cores, 32 threads)
- **RAM**: 64GB DDR4-3600 CL16
- **OS**: Ubuntu 22.04 LTS
- **Rust**: 1.76.0
- **Criterion**: 0.5.1

Run benchmarks yourself:
```bash
cargo bench --package torsh-core
```