pub fn dot_simd(a: &[f32], b: &[f32]) -> f32Expand description
SIMD-accelerated dot product.
At runtime dispatches to the AVX2+FMA fast path on capable x86_64
CPUs (practically every deployment target since ~2014) and falls
back to the portable wide::f32x8 implementation otherwise.
The portable path uses four independent accumulators so modern CPUs with multiple multiply/FMA ports can retire one pair of multiplies per port per cycle; a single accumulator would serialise every add on the dependency chain.