Expand description
Optimized compute kernels.
Provides SIMD-accelerated implementations of core operations. Uses ARM NEON intrinsics on aarch64, falls back to scalar code elsewhere.
The primary bottleneck is matmul (matrix-vector multiply for single-token generation). The NEON version processes 4 f32s per cycle using vfmaq_f32.
Functionsยง
- attention
- Optimized grouped-query attention using NEON dot products.
- elementwise_
add - NEON-accelerated elementwise add
- elementwise_
mul - NEON-accelerated elementwise multiply
- gelu
- GeLU activation (approximate): 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
- matmul
- General matrix multiply: output[m,n] = input[m,k] * weight^T[k,n]
- matmul_
vec - Optimized matrix-vector multiply using NEON dot products.
- rms_
norm - NEON-accelerated RMSNorm: output = (input / rms(input)) * weight
- silu
- Optimized SiLU: x * sigmoid(x) = x / (1 + exp(-x))
- softmax
- Softmax with numerical stability