#[non_exhaustive]#[repr(u16)]pub enum MoeKind {
ScalarGguf = 0,
Wmma = 1,
WmmaGguf = 2,
}Expand description
Mixture-of-Experts (MoE) variant selector — used as the op
discriminant for kernel SKUs whose crate::OpCategory is
crate::OpCategory::Moe. Phase 8 Milestone 8.5 wires the three
fused per-token-dispatch + expert-matmul + accumulate kernels.
MoE forward pass shape:
- Input activations
[T, D_model]. - Per-token top-k expert indices
[T, top_k](i32). - Per-token top-k expert weights
[T, top_k](FP). - Per-expert weight matrices
[num_experts, D_model, D_expert](dtype depends on the variant: FP forWmma, GGUF-packed bytes forScalarGguf/WmmaGguf). - Output
[T, D_model](after expert mixing).
All three variants are inference-only by convention; backward passes are not shipped (MoE training uses higher-level autograd surfaces that compose the per-expert FFN ops manually).
Lineage: vendored from attention.rs via fuel-cuda-kernels. See
crates/baracuda-kernels-sys/LICENSE-thirdparty.md for the full
attribution chain.
Variants (Non-exhaustive)§
This enum is marked as non-exhaustive
ScalarGguf = 0
Scalar dispatch path operating on GGUF-quantized expert weights
staged through a q8_1 intermediate (FP32 activations in, FP32
output out). No tensor cores. Used as a portability fallback
and as the slower-but-simpler reference for the WMMA + GGUF
hot path. Block formats: Q8_0, Q2_K, Q3_K, Q4_K,
Q5_K, Q6_K (matches Fuel’s moe_gemm_gguf switch).
Wmma = 1
WMMA tensor-core path operating on dense FP expert weights (f16 / bf16). The FP MoE hot path used when full-precision expert weights are available — typically training-time or FP-deployment inference. sm_70+ required.
WmmaGguf = 2
Combined WMMA tensor-core + GGUF-quantized weight path. The
dispatcher dequantizes one GGUF block per N-row into shared
memory, then issues a 16×16×16 WMMA mma.sync against the
dense activation tile. The production hot path for quantized
LLM inference. Activation dtype: f16 / bf16. Weight block
formats: same set as Self::ScalarGguf. sm_70+ required.