Module moe_gate

Expand description

GPU-accelerated MoE gating: parallel top-K expert selection with softmax routing.

One threadgroup per token (grid = seq_len × 1 × 1), 128 threads per group. Supports bf16 hidden state input, f32 router weights, and per-expert scale.