pub fn reduce_broadcast(data: &[f64]) -> Vec<f64>
Reduce data to a scalar and broadcast the result back to all positions.
data
Mimics the GPU pattern: reduce in shared memory → broadcast from lane 0.