Expand description
Expert-routed (MoE) quantized matrix-vector multiply dispatch.
Encodes a GPU compute command that performs, for each (token, expert-slot): expert_id = ids[token * n_expert_used + slot] output[token][slot][col] = sum_k(dequant(expert_weight[expert_id][col][k]) * input[token][k])
This is the _id variant of quantized_matmul: same dequantization logic but with per-token expert selection via an ids buffer, enabling fused MoE dispatch.
Portions derived from candle-metal-kernels v0.10.2 (Apache-2.0). See src/shaders/quantized_matmul_id.metal for full attribution.
Structs§
- Quantized
Matmul IdParams - Parameters describing the expert-routed quantized matmul dimensions.
Functions§
- quantized_
matmul_ id - Encode an expert-routed quantized matrix multiplication onto the command encoder.