Module kernels

Expand description

Optimized compute kernels.

Provides SIMD-accelerated implementations of core operations. Uses ARM NEON intrinsics on aarch64, falls back to scalar code elsewhere.

The primary bottleneck is matmul (matrix-vector multiply for single-token generation). The NEON version processes 4 f32s per cycle using vfmaq_f32.

Functions§

attention: Optimized grouped-query attention using NEON dot products.
elementwise_add: NEON-accelerated elementwise add
elementwise_mul: NEON-accelerated elementwise multiply
gelu: GeLU activation (approximate): 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
matmul: General matrix multiply: output[m,n] = input[m,k] * weight^T[k,n]
matmul_vec: Optimized matrix-vector multiply using NEON dot products.
rms_norm: NEON-accelerated RMSNorm: output = (input / rms(input)) * weight
silu: Optimized SiLU: x * sigmoid(x) = x / (1 + exp(-x))
softmax: Softmax with numerical stability