Skip to main content

Module kernels

Module kernels 

Source
Expand description

Optimized compute kernels.

Provides SIMD-accelerated implementations of core operations. Uses ARM NEON intrinsics on aarch64, falls back to scalar code elsewhere.

The primary bottleneck is matmul (matrix-vector multiply for single-token generation). The NEON version processes 4 f32s per cycle using vfmaq_f32.

Functionsยง

attention
Optimized grouped-query attention using NEON dot products.
elementwise_add
NEON-accelerated elementwise add
elementwise_mul
NEON-accelerated elementwise multiply
gelu
GeLU activation (approximate): 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
matmul
General matrix multiply: output[m,n] = input[m,k] * weight^T[k,n]
matmul_vec
Optimized matrix-vector multiply using NEON dot products.
rms_norm
NEON-accelerated RMSNorm: output = (input / rms(input)) * weight
silu
Optimized SiLU: x * sigmoid(x) = x / (1 + exp(-x))
softmax
Softmax with numerical stability