Expand description
Vectorized broadcasting operations
This module provides high-performance broadcasting for aligned tensor shapes. When tensor shapes are compatible and memory is properly aligned, we can use SIMD instructions for dramatic speedups.
§Optimization Strategy
- Detect common broadcasting patterns (scalar, 1D broadcast, etc.)
- Use specialized kernels for each pattern
- Leverage SIMD for aligned, contiguous data
- Fall back to standard broadcasting for complex cases
§Common Patterns
- Scalar broadcast: (1,) + (N,M,K) → (N,M,K)
- Vector broadcast: (N,) + (N,M) → (N,M)
- Matrix broadcast: (N,M) + (N,M,K) → (N,M,K)