pub fn warp_reduce_sum(lanes: &[f64]) -> Vec<f64>
Simulate a warp-level reduce-sum: all lanes get the total sum.