Module tiled_reductions

tenrso_exec::executor

Module tiled_reductions

Expand description

Tiled (blocked) reductions for large tensors

This module implements cache-friendly reduction operations using tiling/blocking strategies to optimize memory access patterns for large tensors.

§Performance Benefits

Improved cache locality through data blocking
Reduced cache misses for large tensors
Better memory bandwidth utilization
Parallel reduction with thread-local accumulators

§Tiling Strategy

For a large tensor, we break it into tiles that fit in L1/L2 cache and process each tile independently before combining results.