Expand description
GPU kernel host-side dispatch functions.
Each submodule implements dispatch for a specific kernel family.
Modules§
- argmax
- Greedy argmax GPU dispatch — finds the index of the maximum value in a float array entirely on the GPU.
- argsort
- GPU-accelerated argsort (descending) for MoE top-K routing.
- compute_
g_ beta - Fused GPU kernel for DeltaNet g and beta computation.
- copy
- GPU-accelerated strided copy for making tensors contiguous.
- cumsum
- Cumulative sum (inclusive prefix sum) along the last axis.
- dense_
gemm - Dense F16 matrix multiply for the lm_head vocabulary projection.
- dense_
gemv_ bf16 - Dense bf16 × f32 → f32 GEMV (matrix-vector multiply) for M == 1 decode.
- dense_
mm_ bf16 - Dense bf16 × f32 → f32 matmul using Apple M3+ tensor cores
(
mpp::tensor_ops::matmul2d). - dense_
mm_ f32_ f32 - Dense f32 × f32 → f32 matmul using Apple M3+ tensor cores
(
mpp::tensor_ops::matmul2d). - elementwise
- GPU-accelerated elementwise operations: add, multiply, and dtype cast.
- embedding
- GPU-accelerated quantized embedding table lookup.
- encode_
helpers - Helper utilities for encoding compute dispatches with inline constant parameters (bytes) alongside buffer bindings.
- flash_
attn_ prefill - Flash-attention-style tiled prefill kernel — host dispatch.
- flash_
attn_ prefill_ blk - Flash-attention tile-skip pre-pass — host dispatch.
- flash_
attn_ prefill_ d512 - Flash-attention-style tiled prefill kernel — NSG=8 D=512 host dispatch.
- flash_
attn_ prefill_ mask - SWA / causal attention-mask builder for the flash_attn_prefill kernels.
- flash_
attn_ vec - Flash attention vector kernel dispatch — SIMD-vectorized decode-path SDPA.
- flash_
attn_ vec_ tq - Flash attention vector kernel dispatch for TurboQuant-compressed KV cache.
- flash_
attn_ vec_ tq_ hb - Flash attention vector kernel dispatch for higher-bit TurboQuant KV cache.
- fused_
head_ norm_ rope - Fused per-head RMS normalization + NeoX RoPE GPU dispatch (bf16).
- fused_
norm_ add - Fused RMS normalization + residual addition GPU dispatch (bf16).
- fused_
residual_ norm - Fused residual addition + RMS normalization GPU dispatch (bf16).
- fwht_
standalone - Standalone Fast Walsh-Hadamard Transform dispatch (SIMD shuffle, zero barriers).
- gated_
delta_ net - Gated DeltaNet fused GPU dispatch — the centerpiece of Qwen3.5 linear-attention layers.
- gather
- GPU-accelerated gather / index_select along dim=0.
- gather_
bench - Gather throughput microbenchmark dispatch.
- gelu
- GELU activation (pytorch_tanh variant) GPU dispatch.
- hadamard
- Fast Walsh-Hadamard Transform (FWHT) GPU kernel dispatch.
- hadamard_
quantize_ kv - Hadamard-quantize KV cache kernel dispatch (ADR-007 Phase 1.1).
- kv_
cache_ copy - KV cache GPU copy dispatch.
- l2_norm
- L2 Normalization GPU dispatch.
- moe_
dispatch - GPU-accelerated MoE expert dispatch (Stage 1: loop over selected experts).
- moe_
gate - GPU-accelerated MoE gating: parallel top-K expert selection with softmax routing.
- moe_
softmax_ topk - GPU fused softmax + top-K + renorm for MoE routing.
- moe_
weighted_ reduce - GPU MoE weighted accumulate + shared expert add + optional residual.
- quantized_
matmul - Quantized matrix multiplication host-side dispatch.
- quantized_
matmul_ ggml - GGML block-format quantized matrix-vector multiply dispatch.
- quantized_
matmul_ id - Expert-routed (MoE) quantized matrix-vector multiply dispatch.
- quantized_
matmul_ id_ ggml - GGML block-format expert-routed (MoE) quantized matrix-vector multiply dispatch.
- rms_
norm - RMS Normalization GPU dispatch.
- rope
- Rotary Position Embedding (RoPE) GPU dispatch.
- rope_
multi - Multi-section Rotary Position Embedding with optional interleaved mode.
- scale_
mask_ softmax - Fused scale-mask-softmax for non-flash-attention prefill.
- sdpa
- Scaled dot-product attention (SDPA) host dispatch.
- sdpa_
decode - GPU SDPA decode kernel — F32 Q/K/V, multi-simdgroup tiled, single-token decode.
- sdpa_
sliding - Sliding-window scaled dot-product attention host dispatch.
- sigmoid_
mul - Elementwise sigmoid-gated multiply:
out[i] = x[i] * sigmoid(gate[i]). - silu_
mul - Fused SiLU-gated multiply:
out[i] = gate[i] * sigmoid(gate[i]) * up[i]. - softcap
- Softcap (tanh-based logit capping) GPU dispatch.
- softmax
- Numerically stable softmax GPU dispatch.
- softmax_
sample - Temperature-scaled softmax + categorical sample, entirely on GPU.
- ssm_
conv - SSM depthwise causal 1D conv + SiLU GPU dispatch.
- ssm_
norm_ gate - Fused per-head RMSNorm + SiLU gate kernel for DeltaNet op 8.
- top_k
- GPU top-K dispatch — returns the K largest elements of a float array.
- tq_
dequantize_ kv - TQ KV dequantize kernel dispatch — iter-20 Leg F ablation.
- transpose
- GPU-accelerated 2D matrix transpose.
- tri_
solve - Lower-triangular unit-diagonal solve:
X = L \ B. - vision_
2d_ rope - 2-D NeoX RoPE for ViT vision towers (Gemma 4 Vision).