pub fn dot_simd(a: &Array, b: &Array) -> Result<f32>
Dot product SIMD implementation with FMA (fused multiply-add)