Expand description
SIMD-accelerated element-wise operations.
AVX2 implementations of ReLU, vector add, and scalar multiply. These are bandwidth-bound at large sizes; SIMD helps at small-to-medium sizes by reducing instruction count and enabling wider stores.
§Algorithm
ReLU: _mm256_max_ps(x, zero) — single instruction per 8 elements
Add: _mm256_add_ps(a, b) — single instruction per 8 elements
Mul scalar: _mm256_mul_ps(x, scalar_vec) — single instruction per 8 elements
Contract: provable-contracts/contracts/activation-kernel-v1.yaml
Functions§
- add
- Element-wise add: output_i = a_i + b_i
- add_
alloc - Element-wise add with output allocation. Avoids zero-fill overhead.
- add_
inplace - In-place add: a_i += b_i
- fused_
add_ relu - Fused add + ReLU: output_i = max(0, a_i + b_i)
- fused_
add_ relu_ inplace - In-place fused add + ReLU: a_i = max(0, a_i + b_i)
- fused_
mul_ add - Fused multiply-add: output_i = a_i * b_i + c_i
- fused_
scale_ bias_ relu - Fused scale + bias + ReLU: output_i = max(0, input_i * scale + bias)
- mul_
scalar - Element-wise scalar multiply: output_i = input_i * scalar
- mul_
scalar_ alloc - Scalar multiply with output allocation. Avoids zero-fill overhead.
- relu
- ReLU: output_i = max(0, input_i)
- relu_
alloc - ReLU with output allocation. Avoids zero-fill overhead of
vec![0.0; n]. - relu_
inplace - In-place ReLU: data_i = max(0, data_i)
- scale_
inplace - In-place scale: data_i *= scalar