# torsh-core Performance Characteristics
This document describes the performance characteristics of torsh-core components based on benchmarking results and implementation details.
## Overview
torsh-core is optimized for high-performance tensor operations with:
- **Zero-cost abstractions**: Minimal runtime overhead
- **SIMD optimizations**: AVX2/AVX-512 (x86_64) and NEON (ARM64) support
- **Efficient memory management**: Pooling, NUMA-aware allocation, and lazy loading
- **Lock-free data structures**: Concurrent stride caching with minimal contention
## Shape Operations
### Shape Creation and Validation
| `Shape::from_dims()` | O(n) | Linear in number of dimensions; includes overflow checking |
| `Shape::scalar()` | O(1) | Const-optimized for scalar shapes |
| `Shape::is_empty()` | O(1) | Checks if any dimension is zero |
| `Shape::numel()` | O(1) | Cached product of dimensions |
**Benchmarks** (from `shape_bench.rs`):
- Shape creation (2D): ~5-10 ns
- Shape creation (4D): ~10-20 ns
- Scalar shape: ~2-5 ns
### Broadcasting Operations
| `broadcast_with()` (small) | O(n) | No |
| `broadcast_with()` (large, AVX2) | O(n) | Yes (4x speedup) |
| `broadcast_with()` (large, NEON) | O(n) | Yes (4x speedup) |
**Performance Characteristics**:
- Small arrays (< 8 dims): Standard scalar implementation
- Large arrays (≥ 8 dims): SIMD-accelerated on supported platforms
- AVX2 speedup: ~4x for large dimension arrays
- NEON speedup: ~4x for large dimension arrays
**Benchmarks**:
- Broadcasting (2 shapes): ~20-50 ns
- Broadcasting with SIMD: ~10-20 ns (AVX2/NEON enabled)
### Stride Calculation
| `Shape::strides()` (first call) | O(n) | Computed and cached |
| `Shape::strides()` (cached) | O(1) | Returned from cache |
| Thread-local cache hit | O(1) | Zero lock contention |
| Global cache hit | O(1) | Minimal lock contention |
**Stride Cache Performance**:
- Thread-local hit rate: ~95% (typical workload)
- Global cache hit rate: ~90% (multi-threaded)
- Cache miss penalty: ~50-100 ns (computation + caching)
- Thread-local access: < 5 ns
- Global cache access: ~10-20 ns (with lock)
## Data Type Operations
### DType Size and Properties
| `DType::size()` | O(1) | Const lookup |
| `DType::is_float()` | O(1) | Const pattern match |
| `DType::is_int()` | O(1) | Const pattern match |
| `DType::promote_types()` | O(1) | Const-optimized rules |
**Type Promotion Performance**:
- All type promotions: < 5 ns
- Complex type promotion: ~5-10 ns
- Quantized type promotion: ~10-15 ns
## Memory Management
### Memory Pooling
| < 1KB | ~10-20 ns | ~5-10 ns | Low |
| 1KB-64KB | ~20-50 ns | ~10-20 ns | Low |
| > 64KB | OS-dependent | OS-dependent | Medium |
**Pool Configuration**:
- Small pool (< 1KB): Thread-local, lock-free
- Medium pool (1KB-64KB): Shared, lock-based
- Large allocations: Direct OS allocation
**Pool Statistics** (typical workload):
- Hit rate: ~85-95%
- Miss penalty: ~100-500 ns (OS allocation)
- Memory overhead: ~5-10%
### NUMA-Aware Allocation
| LocalPreferred | ~50-100 ns | High (95%+ local) |
| LocalOnly | ~50-100 ns | Very High (100% local) |
| Interleave | ~100-200 ns | Medium (distributed) |
| Bind | ~50-100 ns | Very High (100% on node) |
**NUMA Performance**:
- Local access: 1x (baseline)
- Remote access: 1.5-2x slower
- NUMA-aware speedup: Up to 2x for large tensors
- Topology detection: ~1-5 ms (startup cost)
### Memory-Mapped Storage
| File mapping | O(1) | OS page mapping, ~1-10 ms |
| Page load (lazy) | O(1) | On-demand, ~10-100 μs |
| Page access (cached) | O(1) | In-memory, < 10 ns |
| Prefetching | O(k) | Background, adaptive |
**Lazy Loading Performance**:
- Page size: 4KB-64KB (configurable)
- Cache size: 100-1000 pages (configurable)
- LRU eviction: ~20-50 ns
- Prefetch benefit: 10-50% faster for sequential access
## Device Operations
### Device Type Comparisons
| Device equality | O(1) | Simple enum comparison |
| Device capabilities | O(1) | Cached on creation |
| Device sync (CPU) | O(1) | No-op for CPU |
| Device sync (GPU) | O(1) | Hardware-dependent |
**Device Performance**:
- CPU device creation: < 5 ns
- GPU device creation: ~100-500 ns (driver overhead)
- Device switching: ~50-100 ns (CPU↔CPU), ~1-10 μs (CPU↔GPU)
### Cross-Device Transfer
| CPU → CPU | Memory bandwidth (~50 GB/s) | ~10-50 ns |
| CPU → GPU (PCIe 3.0) | ~12 GB/s | ~10-50 μs |
| CPU → GPU (PCIe 4.0) | ~25 GB/s | ~10-50 μs |
| GPU → GPU (same device) | ~1 TB/s | ~1-10 μs |
**Transfer Optimization**:
- Pinned memory: 2-3x faster CPU↔GPU
- Compression (if enabled): 1.5-5x faster (depends on data)
- Async transfer: Overlaps with computation
## Const Generics (Type-Level Shapes)
### Compile-Time Shape Verification
| Shape type checking | O(1) | Zero (compile-time only) |
| MatMul compatibility | O(1) | Zero (compile-time only) |
| Broadcasting check | O(1) | Zero (compile-time only) |
| Reshape validation | O(1) | Zero (compile-time only) |
**Type-Level Performance Benefits**:
- Eliminates runtime shape checks
- Enables better compiler optimizations
- Zero runtime overhead
- Compile-time error detection
## Benchmarking Results
### Shape Benchmarks (shape_bench.rs)
```
Shape Creation (2D): 5.2 ns
Shape Creation (4D): 12.8 ns
Shape Scalar: 2.3 ns
Broadcasting (2 shapes): 38.4 ns
Broadcasting (SIMD): 15.7 ns
Stride Calculation: 48.2 ns
Stride Cache Hit: 3.1 ns
```
### DType Benchmarks (dtype_bench.rs)
```
DType Size: 1.8 ns
DType is_float: 1.2 ns
Type Promotion (F32+F64): 4.3 ns
Type Promotion (Complex): 8.7 ns
Quantized Conversion: 142.5 ns
```
### Storage Benchmarks (storage_bench.rs)
```
Pool Allocation (small): 18.3 ns
Pool Allocation (medium): 47.2 ns
Direct Allocation: 523.8 ns
Pool Hit Rate: 92.4%
NUMA Local Access: 1.0x (baseline)
NUMA Remote Access: 1.8x slower
```
## Performance Tips
### Optimization Strategies
1. **Use const generics when possible**:
- Eliminates runtime checks
- Enables better compiler optimizations
- Type-safe shape operations
2. **Enable SIMD for large operations**:
- Compile with `RUSTFLAGS="-C target-cpu=native"`
- Use `--features simd` for explicit SIMD support
- 4x speedup for large broadcasting operations
3. **Leverage memory pooling**:
- Pre-warm pools for common sizes
- Configure pool sizes based on workload
- 10-50x faster than OS allocation
4. **Use NUMA-aware allocation for large tensors**:
- Significant benefit (2x) for > 10 MB tensors
- Use `LocalPreferred` policy for best default performance
- Profile NUMA topology on multi-socket systems
5. **Enable lazy loading for very large datasets**:
- Reduces memory footprint by 10-100x
- Configure prefetching for sequential access
- Adaptive loading based on access patterns
### Profiling
Use the built-in profiler for performance analysis:
```rust
use torsh_core::profiling::Profiler;
let profiler = Profiler::new();
profiler.start_operation("tensor_matmul");
// ... perform operation ...
profiler.end_operation("tensor_matmul");
let report = profiler.generate_report();
println!("{}", report); // Shows bottlenecks and recommendations
```
## Platform-Specific Notes
### x86_64
- **SIMD**: AVX2 (4x), AVX-512 (8x) when available
- **Cache**: 64-byte cache lines
- **Best for**: High-performance computing, servers
### ARM64 (Apple Silicon)
- **SIMD**: NEON (4x) standard, SVE/SVE2 when available
- **Cache**: 128-byte cache lines on M1/M2
- **Unified memory**: Zero-copy CPU↔GPU on Apple Silicon
- **Best for**: Mobile, embedded, Apple devices
### WebAssembly
- **SIMD**: WASM SIMD (4x) when available
- **Memory**: Limited to 4GB (WASM32)
- **Best for**: Browser-based ML, edge computing
## Future Optimizations
Planned performance improvements:
- [ ] GPU-accelerated shape operations for very large tensors
- [ ] Advanced JIT compilation for hot paths
- [ ] Tensor compression for reduced memory bandwidth
- [ ] Distributed memory management for multi-node systems
- [ ] Custom SIMD kernels for specialized operations
## Benchmarking Methodology
All benchmarks were conducted on:
- **CPU**: AMD Ryzen 9 5950X (16 cores, 32 threads)
- **RAM**: 64GB DDR4-3600 CL16
- **OS**: Ubuntu 22.04 LTS
- **Rust**: 1.76.0
- **Criterion**: 0.5.1
Run benchmarks yourself:
```bash
cargo bench --package torsh-core
```