Module ops

Source

Expand description

GPU kernel host-side dispatch functions.

Each submodule implements dispatch for a specific kernel family.

Modules§

adam_update: Adam optimizer step kernel + Rust dispatch.
add_bias_row_2d: ADR-021 helper: out[m, n] = a[m, n] + bias[n] for a [M, N] row-major f32 matrix and a [N] f32 bias vector.
argmax: Greedy argmax GPU dispatch — finds the index of the maximum value in a float array entirely on the GPU.
argsort: GPU-accelerated argsort (descending) for MoE top-K routing.
bilinear_resize_2d: ADR-021 K2: GPU antialiased bilinear resize for the Qwen3-VL ViT position-embedding table.
block_merge_2x2: ADR-021 K4: GPU 2×2 block-merge reshape for the Qwen3-VL ViT prelude.
chunk_gated_delta_rule: Wave 5b.1 iter 4 — chunk_gated_delta_rule_fwd orchestrator.
chunk_gated_delta_rule_tri_solve_invert: Wave 5b.1 iter 4 — per-chunk-block tri-solve invert wrapper.
compute_g_beta: Fused GPU kernel for DeltaNet g and beta computation.
conv1d_depthwise_causal: ADR-020 iter-11h-b — depthwise causal 1D convolution forward + backward kernels for the GpuTape autograd pipeline.
copy: GPU-accelerated strided copy for making tensors contiguous.
cumsum: Cumulative sum (inclusive prefix sum) along the last axis.
dense_gemm: Dense F16 matrix multiply for the lm_head vocabulary projection.
dense_gemv_bf16: Dense bf16 × f32 → f32 GEMV (matrix-vector multiply) for M == 1 decode.
dense_mm_bf16: Dense bf16 × f32 → f32 matmul using Apple M3+ tensor cores (mpp::tensor_ops::matmul2d).
dense_mm_f16: Dense f16 × f32 → f32 matmul using Apple M3+ tensor cores (mpp::tensor_ops::matmul2d).
dense_mm_f32_f32: Dense f32 × f32 → f32 matmul using Apple M3+ tensor cores (mpp::tensor_ops::matmul2d).
dequant_to_f16
divide_elementwise: ADR-020 iter-11h-misc-1 — elementwise division forward + backward.
elementwise: GPU-accelerated elementwise operations: add, multiply, and dtype cast.
embedding: GPU-accelerated quantized embedding table lookup.
embedding_autograd: FP32 embedding-table lookup with reverse-mode autograd backward.
encode_helpers: Helper utilities for encoding compute dispatches with inline constant parameters (bytes) alongside buffer bindings.
exp_elementwise: ADR-020 iter-11h-c1 — elementwise exp forward + backward.
feature_concat: ADR-021 K5: GPU feature-axis concat (single-chunk strided copy).
flash_attn_prefill: Flash-attention-style tiled prefill kernel — host dispatch.
flash_attn_prefill_blk: Flash-attention tile-skip pre-pass — host dispatch.
flash_attn_prefill_d512: Flash-attention-style tiled prefill kernel — NSG=8 D=512 host dispatch.
flash_attn_prefill_mask: SWA / causal attention-mask builder for the flash_attn_prefill kernels.
flash_attn_train: Flash-attention training forward kernel — host dispatch.
flash_attn_vec: Flash attention vector kernel dispatch — SIMD-vectorized decode-path SDPA.
flash_attn_vec_hybrid: Flash attention vector kernel dispatch for hybrid F16-K + TQ-HB-V KV cache.
flash_attn_vec_peer_port_f16: Flash attention vector kernel — verbatim peer port (f16-K + f16-V, DK=DV=256).
flash_attn_vec_reduce_tq_hb_undo: ADR-028 §iter-485 H3 — Fused FA-vec-TQ-HB reduce + FWHT-sign-undo.
flash_attn_vec_tq: Flash attention vector kernel dispatch for TurboQuant-compressed KV cache.
flash_attn_vec_tq_hb: Flash attention vector kernel dispatch for higher-bit TurboQuant KV cache.
fused_head_norm_rope: Fused per-head RMS normalization + NeoX RoPE GPU dispatch (bf16).
fused_norm_add: Fused RMS normalization + residual addition GPU dispatch (bf16).
fused_residual_norm: Fused residual addition + RMS normalization GPU dispatch (bf16).
fwht_standalone: Standalone Fast Walsh-Hadamard Transform dispatch (SIMD shuffle, zero barriers).
gated_delta_net: Gated DeltaNet fused GPU dispatch — the centerpiece of Qwen3.5 linear-attention layers.
gated_delta_net_chunk: Wave 5b — chunk-parallel Gated DeltaNet inter-chunk state-recurrence kernel.
gated_delta_net_chunk_o: Wave 5b.1 iter 3 — chunk_fwd_o Metal kernel host dispatch.
gated_delta_net_decode: Decode-only fused Gated DeltaNet kernel — simd_sum-based variant.
gated_delta_net_kkt: Wave 5b.1 iter 2 — chunk_scaled_dot_kkt kernel host dispatch.
gated_delta_net_recompute_wu: Wave 5b.1 iter 2 — recompute_w_u_fwd Metal kernel host dispatch.
gather: GPU-accelerated gather / index_select along dim=0.
gather_bench: Gather throughput microbenchmark dispatch.
gelu: GELU activation (pytorch_tanh variant) GPU dispatch.
hadamard: Fast Walsh-Hadamard Transform (FWHT) GPU kernel dispatch.
hadamard_quantize_kv: Hadamard-quantize KV cache kernel dispatch (ADR-007 Phase 1.1).
im2col_2d_3ch: ADR-021 K1: GPU im2col for the Qwen3-VL ViT dual-stem patch embed.
kv_cache_copy: KV cache GPU copy dispatch.
l2_norm: L2 Normalization GPU dispatch.
log_elementwise: Elementwise natural log forward + backward.
moe_dispatch: GPU-accelerated MoE expert dispatch (Stage 1: loop over selected experts).
moe_gate: GPU-accelerated MoE gating: parallel top-K expert selection with softmax routing.
moe_softmax_topk: GPU fused softmax + top-K + renorm for MoE routing.
moe_weighted_reduce: GPU MoE weighted accumulate + shared expert add + optional residual.
mul_mv_ext: ADR-022 Phase 1 P1.7 — mul_mv_ext r1 family for Q5_1 + IQ4_NL.
outer_product: ADR-020 iter-11h-c2 — vector outer product forward + backward.
qdq_affine: Differentiable affine quantize-dequantize primitives — ADR-020 iter-13b Track 2 DWQ-proper training loop.
qdq_legacy: GGUF-legacy quantize-dequantize round-trip primitives (Q4_0, Q8_0).
qkv_split: GPU-accelerated split of a fused QKV tensor into separate Q/K/V outputs.
qmm_affine: Fused affine quantized matmul (Y = X @ dequant(W)^T) — ADR-020 iter-15 DWQ inference primitive.
quantized_matmul: Quantized matrix multiplication host-side dispatch.
quantized_matmul_ggml: GGML block-format quantized matrix-vector multiply dispatch.
quantized_matmul_id: Expert-routed (MoE) quantized matrix-vector multiply dispatch.
quantized_matmul_id_ggml: GGML block-format expert-routed (MoE) quantized matrix-vector multiply dispatch.
repeat_tiled: GPU-accelerated tiled-GQA broadcast: [T, Hg, K] → [T, H, K] F32.
rms_norm: RMS Normalization GPU dispatch.
rms_norm_backward: Backward pass for RMS Normalization (rms_norm_f32 forward).
rope: Rotary Position Embedding (RoPE) GPU dispatch.
rope_multi: Multi-section Rotary Position Embedding with optional interleaved mode.
rope_train: Differentiable Rotary Position Embedding — forward + backward.
row_sum: Per-row sum reduction along the last dimension of a 2-D tensor + its broadcast-along-cols backward.
scale_mask_softmax: Fused scale-mask-softmax for non-flash-attention prefill.
sdpa: Scaled dot-product attention (SDPA) host dispatch.
sdpa_decode: GPU SDPA decode kernel — F32 Q/K/V, multi-simdgroup tiled, single-token decode.
sdpa_sliding: Sliding-window scaled dot-product attention host dispatch.
sigmoid_mul: Elementwise sigmoid-gated multiply: out[i] = x[i] * sigmoid(gate[i]).
silu_backward: Elementwise SiLU (swish) forward + reverse-mode backward.
silu_mul: Fused SiLU-gated multiply: out[i] = gate[i] * sigmoid(gate[i]) * up[i].
slice_concat_2d: 2-D row-major slice + concat-by-column primitives.
softcap: Softcap (tanh-based logit capping) GPU dispatch.
softmax: Numerically stable softmax GPU dispatch.
softmax_backward: Backward pass for row-wise softmax.
softmax_sample: Temperature-scaled softmax + categorical sample, entirely on GPU.
sqrt_elementwise: ADR-020 iter-11h-misc-3 — elementwise sqrt forward + backward.
ssm_conv: SSM depthwise causal 1D conv + SiLU GPU dispatch.
ssm_norm_gate: Fused per-head RMSNorm + SiLU gate kernel for DeltaNet op 8.
take_along_axis: ADR-020 iter-11h-e1 — take_along_axis (gather) + scatter-backward for the GpuTape autograd pipeline. Forward gathers values along the last axis using a precomputed (non-differentiable) index buffer; backward scatters gradients back into a zero-initialised dx buffer.
top_k: GPU top-K dispatch — returns the K largest elements of a float array.
tq_dequantize_kv: TQ KV dequantize kernel dispatch — iter-20 Leg F ablation.
transpose: GPU-accelerated 2D matrix transpose.
tri_solve: Lower-triangular unit-diagonal solve: X = L \ B.
vision_2d_rope: 2-D NeoX RoPE for ViT vision towers (Gemma 4 Vision).

Module ops

Module ops Copy item path

Modules§

Module ops