Expand description
SIMD kernels for fused operations.
These are the production kernels extracted from burnembed’s ndarray_fused.rs. Each kernel processes data in-place or into a pre-allocated output buffer (from the arena). No allocation.
Functions§
- bias_
gelu - conv_
transpose2d_ nchw - NCHW transposed convolution (PyTorch
ConvTranspose2d, no bias). Weight layout[C_in, C_out/groups, kH, kW]. - gelu_
approx_ inplace - gelu_
inplace - group_
norm_ nchw - NCHW group normalization: normalizes each
(C/G)×H×Wgroup. - layer_
norm2d_ nchw - NCHW LayerNorm2d (candle / SAM semantics): normalize across channels at
each spatial position.
gamma/betaare per-channel[C]. - layer_
norm_ row - neon_
sgemm_ bias_ small - neon_
sgemm_ small - neon_
softmax - par_
bias_ gelu - Parallel bias + GELU across thread pool.
- par_
gelu_ approx_ inplace - par_
gelu_ inplace - par_
residual_ bias_ ln - Parallel residual + bias + LayerNorm.
- par_
silu_ inplace - Parallel SiLU in-place. Same threshold reasoning as
par_gelu_inplace. - residual_
bias_ layer_ norm - Fused residual + bias + LayerNorm on [n, h] buffers. Computes: output[row] = LN(a[row] + b[row] + bias, gamma, beta)
- residual_
bias_ rms_ norm - Fused residual + bias + RMSNorm on [n, h] buffers. Computes: output[row] = RmsNorm(a[row] + b[row] + bias, gamma, beta)
- resize_
nearest_ 2x_ nchw - Nearest-neighbor 2× upsample on planar NCHW.
- scalar_
gelu_ approx - Tanh-approximation GELU (matches PyTorch/candle
Tensor::gelu): y = 0.5 x (1 + tanh(√(2/π) · (x + 0.044715 x³))) - silu_
inplace