Expand description
GPU compute kernels (WGSL shaders)
Parallel Reduction Algorithm (Harris 2007):
- Each thread loads one element
- Workgroup-local reduction using shared memory
- Global reduction of workgroup results
Performance: O(N/P + log P) where P = num threads