Module elementwise

Expand description

SIMD-accelerated element-wise operations.

AVX2 implementations of ReLU, vector add, and scalar multiply. These are bandwidth-bound at large sizes; SIMD helps at small-to-medium sizes by reducing instruction count and enabling wider stores.

§Algorithm

ReLU: _mm256_max_ps(x, zero) — single instruction per 8 elements Add: _mm256_add_ps(a, b) — single instruction per 8 elements Mul scalar: _mm256_mul_ps(x, scalar_vec) — single instruction per 8 elements

Contract: provable-contracts/contracts/activation-kernel-v1.yaml

Functions§

add: Element-wise add: output_i = a_i + b_i
add_alloc: Element-wise add with output allocation. Avoids zero-fill overhead.
add_inplace: In-place add: a_i += b_i
fused_add_relu: Fused add + ReLU: output_i = max(0, a_i + b_i)
fused_add_relu_inplace: In-place fused add + ReLU: a_i = max(0, a_i + b_i)
fused_mul_add: Fused multiply-add: output_i = a_i * b_i + c_i
fused_scale_bias_relu: Fused scale + bias + ReLU: output_i = max(0, input_i * scale + bias)
mul_scalar: Element-wise scalar multiply: output_i = input_i * scalar
mul_scalar_alloc: Scalar multiply with output allocation. Avoids zero-fill overhead.
relu: ReLU: output_i = max(0, input_i)
relu_alloc: ReLU with output allocation. Avoids zero-fill overhead of vec![0.0; n].
relu_inplace: In-place ReLU: data_i = max(0, data_i)
scale_inplace: In-place scale: data_i *= scalar