Module moe_dispatch

Expand description

GPU-accelerated MoE expert dispatch (Stage 1: loop over selected experts).

For each of the K selected experts, runs: gate_out = gate_proj_e(x) [input_dim -> intermediate_dim] up_out = up_proj_e(x) [input_dim -> intermediate_dim] hidden = GELU(gate_out) * up_out expert_out = down_proj_e(hidden) [intermediate_dim -> input_dim] result += routing_weight_e * expert_out

Stage 1 uses individual kernel dispatches per expert and per projection. The projections use float matmul (caller dequantizes or provides float weights). Stage 2 optimization (Epic 6) would fuse these.

This module provides the high-level moe_dispatch function that orchestrates the per-expert loop, using the fused_gelu_mul and moe_accumulate shaders from moe_dispatch.metal.

Structs§

ExpertWeights: A single expert’s weight matrices (float32, pre-dequantized or float).
MoeDispatchParams: Parameters for MoE dispatch.

Functions§

fused_gelu_mul_bf16_encode: Encode a fused GELU-multiply on bf16 buffers.
moe_accumulate_encode: Encode a weighted accumulation: accumulator[i] += routing_weight * expert_output[i].
moe_accumulate_encode_offset: Like moe_accumulate_encode but reads expert_output from src_byte_offset.
moe_dispatch: Encode MoE dispatch: loop over selected experts, run FFN, accumulate.
moe_gather_topk_weights_encode: Encode a GPU-side MoE top-K routing gather.
moe_swiglu_batch_encode: Encode a batched SwiGLU across all top_k expert slots in one dispatch.
moe_swiglu_fused_encode: Encode a fused SwiGLU on a [2*N] gate_up buffer, producing [N] output.
moe_swiglu_fused_encode_offset: Like moe_swiglu_fused_encode but reads from gate_up at gu_byte_offset and writes to output at out_byte_offset.
moe_swiglu_seq_backward_encode: Backward of moe_swiglu_seq — single fused kernel writes both gate and up gradients into the supplied d_gate_up buffer (same layout as forward gate_up).
moe_swiglu_seq_bf16_encode: Multi-token SwiGLU for batched prefill (bf16 I/O, f32 accumulator).
moe_swiglu_seq_encode: Multi-token SwiGLU for batched prefill.
moe_weighted_sum_encode: Encode a weighted sum of all top_k expert outputs in one dispatch.
moe_weighted_sum_seq_backward_outputs_encode: Backward of moe_weighted_sum_seq w.r.t. expert_outputs.
moe_weighted_sum_seq_backward_weights_encode: Backward of moe_weighted_sum_seq w.r.t. weights.
moe_weighted_sum_seq_bf16_input_encode: Multi-token weighted sum of expert outputs for batched prefill (bf16 inputs).
moe_weighted_sum_seq_encode: Multi-token weighted sum of expert outputs for batched prefill.
moe_zero_buffer_encode: Zero-initialize an f32 GPU buffer using the zero_buffer kernel.