Module quantized_matmul_id

Expand description

Expert-routed (MoE) quantized matrix-vector multiply dispatch.

Encodes a GPU compute command that performs, for each (token, expert-slot): expert_id = ids[token * n_expert_used + slot] output[token][slot][col] = sum_k(dequant(expert_weight[expert_id][col][k]) * input[token][k])

This is the _id variant of quantized_matmul: same dequantization logic but with per-token expert selection via an ids buffer, enabling fused MoE dispatch.

Portions derived from candle-metal-kernels v0.10.2 (Apache-2.0). See src/shaders/quantized_matmul_id.metal for full attribution.

Structs§

QuantizedMatmulIdParams: Parameters describing the expert-routed quantized matmul dimensions.

Functions§

quantized_matmul_id: Encode an expert-routed quantized matrix multiplication onto the command encoder.