Expand description
GPU kernel host-side dispatch functions.
Each submodule implements dispatch for a specific kernel family.
Modules§
- adam_
update - Adam optimizer step kernel + Rust dispatch.
- add_
bias_ row_ 2d - ADR-021 helper:
out[m, n] = a[m, n] + bias[n]for a[M, N]row-major f32 matrix and a[N]f32 bias vector. - argmax
- Greedy argmax GPU dispatch — finds the index of the maximum value in a float array entirely on the GPU.
- argsort
- GPU-accelerated argsort (descending) for MoE top-K routing.
- bilinear_
resize_ 2d - ADR-021 K2: GPU antialiased bilinear resize for the Qwen3-VL ViT position-embedding table.
- block_
merge_ 2x2 - ADR-021 K4: GPU 2×2 block-merge reshape for the Qwen3-VL ViT prelude.
- chunk_
gated_ delta_ rule - Wave 5b.1 iter 4 — chunk_gated_delta_rule_fwd orchestrator.
- chunk_
gated_ delta_ rule_ tri_ solve_ invert - Wave 5b.1 iter 4 — per-chunk-block tri-solve invert wrapper.
- compute_
g_ beta - Fused GPU kernel for DeltaNet g and beta computation.
- conv1d_
depthwise_ causal - ADR-020 iter-11h-b — depthwise causal 1D convolution forward + backward kernels for the GpuTape autograd pipeline.
- copy
- GPU-accelerated strided copy for making tensors contiguous.
- cumsum
- Cumulative sum (inclusive prefix sum) along the last axis.
- dense_
gemm - Dense F16 matrix multiply for the lm_head vocabulary projection.
- dense_
gemv_ bf16 - Dense bf16 × f32 → f32 GEMV (matrix-vector multiply) for M == 1 decode.
- dense_
mm_ bf16 - Dense bf16 × f32 → f32 matmul using Apple M3+ tensor cores
(
mpp::tensor_ops::matmul2d). - dense_
mm_ f16 - Dense f16 × f32 → f32 matmul using Apple M3+ tensor cores
(
mpp::tensor_ops::matmul2d). - dense_
mm_ f32_ f32 - Dense f32 × f32 → f32 matmul using Apple M3+ tensor cores
(
mpp::tensor_ops::matmul2d). - dequant_
to_ f16 - divide_
elementwise - ADR-020 iter-11h-misc-1 — elementwise division forward + backward.
- elementwise
- GPU-accelerated elementwise operations: add, multiply, and dtype cast.
- embedding
- GPU-accelerated quantized embedding table lookup.
- embedding_
autograd - FP32 embedding-table lookup with reverse-mode autograd backward.
- encode_
helpers - Helper utilities for encoding compute dispatches with inline constant parameters (bytes) alongside buffer bindings.
- exp_
elementwise - ADR-020 iter-11h-c1 — elementwise exp forward + backward.
- feature_
concat - ADR-021 K5: GPU feature-axis concat (single-chunk strided copy).
- flash_
attn_ prefill - Flash-attention-style tiled prefill kernel — host dispatch.
- flash_
attn_ prefill_ blk - Flash-attention tile-skip pre-pass — host dispatch.
- flash_
attn_ prefill_ d512 - Flash-attention-style tiled prefill kernel — NSG=8 D=512 host dispatch.
- flash_
attn_ prefill_ mask - SWA / causal attention-mask builder for the flash_attn_prefill kernels.
- flash_
attn_ train - Flash-attention training forward kernel — host dispatch.
- flash_
attn_ vec - Flash attention vector kernel dispatch — SIMD-vectorized decode-path SDPA.
- flash_
attn_ vec_ hybrid - Flash attention vector kernel dispatch for hybrid F16-K + TQ-HB-V KV cache.
- flash_
attn_ vec_ peer_ port_ f16 - Flash attention vector kernel — verbatim peer port (f16-K + f16-V, DK=DV=256).
- flash_
attn_ vec_ reduce_ tq_ hb_ undo - ADR-028 §iter-485 H3 — Fused FA-vec-TQ-HB reduce + FWHT-sign-undo.
- flash_
attn_ vec_ tq - Flash attention vector kernel dispatch for TurboQuant-compressed KV cache.
- flash_
attn_ vec_ tq_ hb - Flash attention vector kernel dispatch for higher-bit TurboQuant KV cache.
- fused_
head_ norm_ rope - Fused per-head RMS normalization + NeoX RoPE GPU dispatch (bf16).
- fused_
norm_ add - Fused RMS normalization + residual addition GPU dispatch (bf16).
- fused_
residual_ norm - Fused residual addition + RMS normalization GPU dispatch (bf16).
- fwht_
standalone - Standalone Fast Walsh-Hadamard Transform dispatch (SIMD shuffle, zero barriers).
- gated_
delta_ net - Gated DeltaNet fused GPU dispatch — the centerpiece of Qwen3.5 linear-attention layers.
- gated_
delta_ net_ chunk - Wave 5b — chunk-parallel Gated DeltaNet inter-chunk state-recurrence kernel.
- gated_
delta_ net_ chunk_ o - Wave 5b.1 iter 3 — chunk_fwd_o Metal kernel host dispatch.
- gated_
delta_ net_ decode - Decode-only fused Gated DeltaNet kernel —
simd_sum-based variant. - gated_
delta_ net_ kkt - Wave 5b.1 iter 2 — chunk_scaled_dot_kkt kernel host dispatch.
- gated_
delta_ net_ recompute_ wu - Wave 5b.1 iter 2 — recompute_w_u_fwd Metal kernel host dispatch.
- gather
- GPU-accelerated gather / index_select along dim=0.
- gather_
bench - Gather throughput microbenchmark dispatch.
- gelu
- GELU activation (pytorch_tanh variant) GPU dispatch.
- hadamard
- Fast Walsh-Hadamard Transform (FWHT) GPU kernel dispatch.
- hadamard_
quantize_ kv - Hadamard-quantize KV cache kernel dispatch (ADR-007 Phase 1.1).
- im2col_
2d_ 3ch - ADR-021 K1: GPU im2col for the Qwen3-VL ViT dual-stem patch embed.
- kv_
cache_ copy - KV cache GPU copy dispatch.
- l2_norm
- L2 Normalization GPU dispatch.
- log_
elementwise - Elementwise natural log forward + backward.
- moe_
dispatch - GPU-accelerated MoE expert dispatch (Stage 1: loop over selected experts).
- moe_
gate - GPU-accelerated MoE gating: parallel top-K expert selection with softmax routing.
- moe_
softmax_ topk - GPU fused softmax + top-K + renorm for MoE routing.
- moe_
weighted_ reduce - GPU MoE weighted accumulate + shared expert add + optional residual.
- mul_
mv_ ext - ADR-022 Phase 1 P1.7 —
mul_mv_extr1 family for Q5_1 + IQ4_NL. - outer_
product - ADR-020 iter-11h-c2 — vector outer product forward + backward.
- qdq_
affine - Differentiable affine quantize-dequantize primitives — ADR-020 iter-13b Track 2 DWQ-proper training loop.
- qdq_
legacy - GGUF-legacy quantize-dequantize round-trip primitives (Q4_0, Q8_0).
- qkv_
split - GPU-accelerated split of a fused QKV tensor into separate Q/K/V outputs.
- qmm_
affine - Fused affine quantized matmul (
Y = X @ dequant(W)^T) — ADR-020 iter-15 DWQ inference primitive. - quantized_
matmul - Quantized matrix multiplication host-side dispatch.
- quantized_
matmul_ ggml - GGML block-format quantized matrix-vector multiply dispatch.
- quantized_
matmul_ id - Expert-routed (MoE) quantized matrix-vector multiply dispatch.
- quantized_
matmul_ id_ ggml - GGML block-format expert-routed (MoE) quantized matrix-vector multiply dispatch.
- repeat_
tiled - GPU-accelerated tiled-GQA broadcast:
[T, Hg, K]→[T, H, K]F32. - rms_
norm - RMS Normalization GPU dispatch.
- rms_
norm_ backward - Backward pass for RMS Normalization (
rms_norm_f32forward). - rope
- Rotary Position Embedding (RoPE) GPU dispatch.
- rope_
multi - Multi-section Rotary Position Embedding with optional interleaved mode.
- rope_
train - Differentiable Rotary Position Embedding — forward + backward.
- row_sum
- Per-row sum reduction along the last dimension of a 2-D tensor + its broadcast-along-cols backward.
- scale_
mask_ softmax - Fused scale-mask-softmax for non-flash-attention prefill.
- sdpa
- Scaled dot-product attention (SDPA) host dispatch.
- sdpa_
decode - GPU SDPA decode kernel — F32 Q/K/V, multi-simdgroup tiled, single-token decode.
- sdpa_
sliding - Sliding-window scaled dot-product attention host dispatch.
- sigmoid_
mul - Elementwise sigmoid-gated multiply:
out[i] = x[i] * sigmoid(gate[i]). - silu_
backward - Elementwise SiLU (swish) forward + reverse-mode backward.
- silu_
mul - Fused SiLU-gated multiply:
out[i] = gate[i] * sigmoid(gate[i]) * up[i]. - slice_
concat_ 2d - 2-D row-major slice + concat-by-column primitives.
- softcap
- Softcap (tanh-based logit capping) GPU dispatch.
- softmax
- Numerically stable softmax GPU dispatch.
- softmax_
backward - Backward pass for row-wise softmax.
- softmax_
sample - Temperature-scaled softmax + categorical sample, entirely on GPU.
- sqrt_
elementwise - ADR-020 iter-11h-misc-3 — elementwise sqrt forward + backward.
- ssm_
conv - SSM depthwise causal 1D conv + SiLU GPU dispatch.
- ssm_
norm_ gate - Fused per-head RMSNorm + SiLU gate kernel for DeltaNet op 8.
- take_
along_ axis - ADR-020 iter-11h-e1 — take_along_axis (gather) + scatter-backward for the GpuTape autograd pipeline. Forward gathers values along the last axis using a precomputed (non-differentiable) index buffer; backward scatters gradients back into a zero-initialised dx buffer.
- top_k
- GPU top-K dispatch — returns the K largest elements of a float array.
- tq_
dequantize_ kv - TQ KV dequantize kernel dispatch — iter-20 Leg F ablation.
- transpose
- GPU-accelerated 2D matrix transpose.
- tri_
solve - Lower-triangular unit-diagonal solve:
X = L \ B. - vision_
2d_ rope - 2-D NeoX RoPE for ViT vision towers (Gemma 4 Vision).