SIMD-accelerated reduction (sum, max, min, mean). For the full-sum case (axis None) we
implement an AVX2 vectorized loop that accumulates into an __m256
register and then horizontally reduces it. For other architectures / when
AVX2 absent we fall back to scalar.