Expand description
SIMD acceleration for tensor operations (AVX2, 4-wide f64).
Provides AVX2-accelerated kernels for:
- Element-wise binary operations (add, sub, mul, div)
- Element-wise unary operations (relu, abs, neg, sqrt)
- Inner loop of tiled matrix multiplication (axpy: c += a * b)
§Determinism
All SIMD paths produce bit-identical results to scalar paths because:
- IEEE 754 mandates identical rounding for scalar and SIMD add/sub/mul/div/sqrt.
- No FMA instructions are used (
_mm256_fmadd_pdchanges rounding vs separate mul+add — we explicitly avoid it). - Element-wise ops are independent — no cross-lane reductions.
- Tiled matmul SIMD processes multiple j-columns simultaneously but each
C[i,j]accumulates the same values in the same order.
§Fallback
On non-x86_64 platforms or CPUs without AVX2, all functions fall back to scalar implementations that produce identical results.
Enums§
- BinOp
- Dispatch tag for SIMD-able binary operations.
- UnaryOp
- Dispatch tag for SIMD-able unary operations.
Functions§
- has_
avx2 - Runtime check for AVX2 support.
- simd_
axpy - SIMD-accelerated AXPY:
c[0..len] += scalar * b[0..len]. - simd_
binop - SIMD-accelerated element-wise binary operation on equal-length slices.
- simd_
unary - SIMD-accelerated element-wise unary operation.