Expand description
Tiled (blocked) reductions for large tensors
This module implements cache-friendly reduction operations using tiling/blocking strategies to optimize memory access patterns for large tensors.
§Performance Benefits
- Improved cache locality through data blocking
- Reduced cache misses for large tensors
- Better memory bandwidth utilization
- Parallel reduction with thread-local accumulators
§Tiling Strategy
For a large tensor, we break it into tiles that fit in L1/L2 cache and process each tile independently before combining results.