Module simd

Re-exports§

dot_simd: Dot product SIMD implementation with FMA (fused multiply-add)
elementwise_simd: Prototype SIMD path that currently delegates to scalar implementation.
elementwise_simd_supported: Placeholder for SIMD CPU execution strategy. In a full implementation this would contain vectorized loops and checks for AVX/NEON availability.
matmul_simd: SIMD-accelerated matrix multiplication Uses blocked tiled algorithm with SIMD vectorization for inner loops
reduce_simd: SIMD-accelerated reduction (sum, max, min, mean). For the full-sum case (axis None) we implement an AVX2 vectorized loop that accumulates into an __m256 register and then horizontally reduces it. For other architectures / when AVX2 absent we fall back to scalar.