Module tiled_reductions

Module tiled_reductions 

Source
Expand description

Tiled (blocked) reductions for large tensors

This module implements cache-friendly reduction operations using tiling/blocking strategies to optimize memory access patterns for large tensors.

§Performance Benefits

  • Improved cache locality through data blocking
  • Reduced cache misses for large tensors
  • Better memory bandwidth utilization
  • Parallel reduction with thread-local accumulators

§Tiling Strategy

For a large tensor, we break it into tiles that fit in L1/L2 cache and process each tile independently before combining results.