Module kernels

Expand description

SIMD kernels for fused operations.

These are the production kernels extracted from burnembed’s ndarray_fused.rs. Each kernel processes data in-place or into a pre-allocated output buffer (from the arena). No allocation.

Functions§

bias_gelu
conv_transpose2d_nchw: NCHW transposed convolution (PyTorch ConvTranspose2d, no bias). Weight layout [C_in, C_out/groups, kH, kW].
gelu_approx_inplace
gelu_inplace
group_norm_nchw: NCHW group normalization: normalizes each (C/G)×H×W group.
layer_norm2d_nchw: NCHW LayerNorm2d (candle / SAM semantics): normalize across channels at each spatial position. gamma/beta are per-channel [C].
layer_norm_row
neon_sgemm_bias_small
neon_sgemm_small
neon_softmax
par_bias_gelu: Parallel bias + GELU across thread pool.
par_gelu_approx_inplace
par_gelu_inplace
par_residual_bias_ln: Parallel residual + bias + LayerNorm.
par_silu_inplace: Parallel SiLU in-place. Same threshold reasoning as par_gelu_inplace.
residual_bias_layer_norm: Fused residual + bias + LayerNorm on [n, h] buffers. Computes: output[row] = LN(a[row] + b[row] + bias, gamma, beta)
residual_bias_rms_norm: Fused residual + bias + RMSNorm on [n, h] buffers. Computes: output[row] = RmsNorm(a[row] + b[row] + bias, gamma, beta)
resize_nearest_2x_nchw: Nearest-neighbor 2× upsample on planar NCHW.
scalar_gelu_approx: Tanh-approximation GELU (matches PyTorch/candle Tensor::gelu): y = 0.5 x (1 + tanh(√(2/π) · (x + 0.044715 x³)))
silu_inplace

Module kernels

Module kernels Copy item path

Functions§

Module kernels