pub fn relu_alloc(input: &[f32]) -> Vec<f32>
ReLU with output allocation. Avoids zero-fill overhead of vec![0.0; n].
vec![0.0; n]
Output Vec is fully initialized by the SIMD/scalar loop before return.