Module ops

Source

Expand description

GPU kernel host-side dispatch functions.

Each submodule implements dispatch for a specific kernel family.

Modules§

argmax: Greedy argmax GPU dispatch — finds the index of the maximum value in a float array entirely on the GPU.
argsort: GPU-accelerated argsort (descending) for MoE top-K routing.
compute_g_beta: Fused GPU kernel for DeltaNet g and beta computation.
copy: GPU-accelerated strided copy for making tensors contiguous.
cumsum: Cumulative sum (inclusive prefix sum) along the last axis.
dense_gemm: Dense F16 matrix multiply for the lm_head vocabulary projection.
dense_gemv_bf16: Dense bf16 × f32 → f32 GEMV (matrix-vector multiply) for M == 1 decode.
dense_mm_bf16: Dense bf16 × f32 → f32 matmul using Apple M3+ tensor cores (mpp::tensor_ops::matmul2d).
dense_mm_f32_f32: Dense f32 × f32 → f32 matmul using Apple M3+ tensor cores (mpp::tensor_ops::matmul2d).
elementwise: GPU-accelerated elementwise operations: add, multiply, and dtype cast.
embedding: GPU-accelerated quantized embedding table lookup.
encode_helpers: Helper utilities for encoding compute dispatches with inline constant parameters (bytes) alongside buffer bindings.
flash_attn_prefill: Flash-attention-style tiled prefill kernel — host dispatch.
flash_attn_prefill_blk: Flash-attention tile-skip pre-pass — host dispatch.
flash_attn_prefill_d512: Flash-attention-style tiled prefill kernel — NSG=8 D=512 host dispatch.
flash_attn_prefill_mask: SWA / causal attention-mask builder for the flash_attn_prefill kernels.
flash_attn_vec: Flash attention vector kernel dispatch — SIMD-vectorized decode-path SDPA.
flash_attn_vec_tq: Flash attention vector kernel dispatch for TurboQuant-compressed KV cache.
flash_attn_vec_tq_hb: Flash attention vector kernel dispatch for higher-bit TurboQuant KV cache.
fused_head_norm_rope: Fused per-head RMS normalization + NeoX RoPE GPU dispatch (bf16).
fused_norm_add: Fused RMS normalization + residual addition GPU dispatch (bf16).
fused_residual_norm: Fused residual addition + RMS normalization GPU dispatch (bf16).
fwht_standalone: Standalone Fast Walsh-Hadamard Transform dispatch (SIMD shuffle, zero barriers).
gated_delta_net: Gated DeltaNet fused GPU dispatch — the centerpiece of Qwen3.5 linear-attention layers.
gather: GPU-accelerated gather / index_select along dim=0.
gather_bench: Gather throughput microbenchmark dispatch.
gelu: GELU activation (pytorch_tanh variant) GPU dispatch.
hadamard: Fast Walsh-Hadamard Transform (FWHT) GPU kernel dispatch.
hadamard_quantize_kv: Hadamard-quantize KV cache kernel dispatch (ADR-007 Phase 1.1).
kv_cache_copy: KV cache GPU copy dispatch.
l2_norm: L2 Normalization GPU dispatch.
moe_dispatch: GPU-accelerated MoE expert dispatch (Stage 1: loop over selected experts).
moe_gate: GPU-accelerated MoE gating: parallel top-K expert selection with softmax routing.
moe_softmax_topk: GPU fused softmax + top-K + renorm for MoE routing.
moe_weighted_reduce: GPU MoE weighted accumulate + shared expert add + optional residual.
quantized_matmul: Quantized matrix multiplication host-side dispatch.
quantized_matmul_ggml: GGML block-format quantized matrix-vector multiply dispatch.
quantized_matmul_id: Expert-routed (MoE) quantized matrix-vector multiply dispatch.
quantized_matmul_id_ggml: GGML block-format expert-routed (MoE) quantized matrix-vector multiply dispatch.
rms_norm: RMS Normalization GPU dispatch.
rope: Rotary Position Embedding (RoPE) GPU dispatch.
rope_multi: Multi-section Rotary Position Embedding with optional interleaved mode.
scale_mask_softmax: Fused scale-mask-softmax for non-flash-attention prefill.
sdpa: Scaled dot-product attention (SDPA) host dispatch.
sdpa_decode: GPU SDPA decode kernel — F32 Q/K/V, multi-simdgroup tiled, single-token decode.
sdpa_sliding: Sliding-window scaled dot-product attention host dispatch.
sigmoid_mul: Elementwise sigmoid-gated multiply: out[i] = x[i] * sigmoid(gate[i]).
silu_mul: Fused SiLU-gated multiply: out[i] = gate[i] * sigmoid(gate[i]) * up[i].
softcap: Softcap (tanh-based logit capping) GPU dispatch.
softmax: Numerically stable softmax GPU dispatch.
softmax_sample: Temperature-scaled softmax + categorical sample, entirely on GPU.
ssm_conv: SSM depthwise causal 1D conv + SiLU GPU dispatch.
ssm_norm_gate: Fused per-head RMSNorm + SiLU gate kernel for DeltaNet op 8.
top_k: GPU top-K dispatch — returns the K largest elements of a float array.
tq_dequantize_kv: TQ KV dequantize kernel dispatch — iter-20 Leg F ablation.
transpose: GPU-accelerated 2D matrix transpose.
tri_solve: Lower-triangular unit-diagonal solve: X = L \ B.
vision_2d_rope: 2-D NeoX RoPE for ViT vision towers (Gemma 4 Vision).

Module ops

Module ops Copy item path

Modules§

Module ops