Skip to main content

Module ops

Module ops 

Source
Expand description

GPU kernel host-side dispatch functions.

Each submodule implements dispatch for a specific kernel family.

Modules§

argmax
Greedy argmax GPU dispatch — finds the index of the maximum value in a float array entirely on the GPU.
argsort
GPU-accelerated argsort (descending) for MoE top-K routing.
copy
GPU-accelerated strided copy for making tensors contiguous.
cumsum
Cumulative sum (inclusive prefix sum) along the last axis.
dense_gemm
Dense F16 matrix multiply for the lm_head vocabulary projection.
dense_mm_bf16
Dense bf16 × f32 → f32 matmul using Apple M3+ tensor cores (mpp::tensor_ops::matmul2d).
elementwise
GPU-accelerated elementwise operations: add, multiply, and dtype cast.
embedding
GPU-accelerated quantized embedding table lookup.
encode_helpers
Helper utilities for encoding compute dispatches with inline constant parameters (bytes) alongside buffer bindings.
flash_attn_prefill
Flash-attention-style tiled prefill kernel — host dispatch.
flash_attn_prefill_blk
Flash-attention tile-skip pre-pass — host dispatch.
flash_attn_prefill_d512
Flash-attention-style tiled prefill kernel — NSG=8 D=512 host dispatch.
flash_attn_prefill_mask
SWA / causal attention-mask builder for the flash_attn_prefill kernels.
flash_attn_vec
Flash attention vector kernel dispatch — SIMD-vectorized decode-path SDPA.
flash_attn_vec_tq
Flash attention vector kernel dispatch for TurboQuant-compressed KV cache.
flash_attn_vec_tq_hb
Flash attention vector kernel dispatch for higher-bit TurboQuant KV cache.
fused_head_norm_rope
Fused per-head RMS normalization + NeoX RoPE GPU dispatch (bf16).
fused_norm_add
Fused RMS normalization + residual addition GPU dispatch (bf16).
fused_residual_norm
Fused residual addition + RMS normalization GPU dispatch (bf16).
fwht_standalone
Standalone Fast Walsh-Hadamard Transform dispatch (SIMD shuffle, zero barriers).
gated_delta_net
Gated DeltaNet fused GPU dispatch — the centerpiece of Qwen3.5 linear-attention layers.
gather
GPU-accelerated gather / index_select along dim=0.
gather_bench
Gather throughput microbenchmark dispatch.
gelu
GELU activation (pytorch_tanh variant) GPU dispatch.
hadamard
Fast Walsh-Hadamard Transform (FWHT) GPU kernel dispatch.
hadamard_quantize_kv
Hadamard-quantize KV cache kernel dispatch (ADR-007 Phase 1.1).
kv_cache_copy
KV cache GPU copy dispatch.
l2_norm
L2 Normalization GPU dispatch.
moe_dispatch
GPU-accelerated MoE expert dispatch (Stage 1: loop over selected experts).
moe_gate
GPU-accelerated MoE gating: parallel top-K expert selection with softmax routing.
quantized_matmul
Quantized matrix multiplication host-side dispatch.
quantized_matmul_ggml
GGML block-format quantized matrix-vector multiply dispatch.
quantized_matmul_id
Expert-routed (MoE) quantized matrix-vector multiply dispatch.
quantized_matmul_id_ggml
GGML block-format expert-routed (MoE) quantized matrix-vector multiply dispatch.
rms_norm
RMS Normalization GPU dispatch.
rope
Rotary Position Embedding (RoPE) GPU dispatch.
rope_multi
Multi-section Rotary Position Embedding with optional interleaved mode.
scale_mask_softmax
Fused scale-mask-softmax for non-flash-attention prefill.
sdpa
Scaled dot-product attention (SDPA) host dispatch.
sdpa_sliding
Sliding-window scaled dot-product attention host dispatch.
sigmoid_mul
Elementwise sigmoid-gated multiply: out[i] = x[i] * sigmoid(gate[i]).
softcap
Softcap (tanh-based logit capping) GPU dispatch.
softmax
Numerically stable softmax GPU dispatch.
softmax_sample
Temperature-scaled softmax + categorical sample, entirely on GPU.
ssm_conv
SSM depthwise causal 1D conv + SiLU GPU dispatch.
top_k
GPU top-K dispatch — returns the K largest elements of a float array.
tq_dequantize_kv
TQ KV dequantize kernel dispatch — iter-20 Leg F ablation.
transpose
GPU-accelerated 2D matrix transpose.
tri_solve
Lower-triangular unit-diagonal solve: X = L \ B.