Expand description
High-performance CPU memory reduction kernels.
Reductions are bandwidth-bound (read N elements, write 1 scalar). Single-threaded loops with LLVM auto-vectorization already saturate the memory bus. Rayon is not used.
High-performance CPU memory reduction kernels.
Reductions are bandwidth-bound (read N elements, write 1 scalar). Single-threaded loops with LLVM auto-vectorization already saturate the memory bus. Rayon is not used.