Expand description
Quantized matrix multiplication host-side dispatch.
Encodes a GPU compute command that performs: output[row][col] = sum_k(dequant(weight[col][k]) * input[row][k])
Weights are stored in packed quantized format (4-bit or 6-bit) with per-group bf16 scales and biases for affine dequantization.
Structs§
- Quantized
Matmul Params - Parameters describing the quantized matmul dimensions and format.
Functions§
- dispatch_
quantized_ matmul_ simd_ bf16 - Dispatch the bf16 I/O variant of the SIMD quantized matmul kernel.
- dispatch_
quantized_ matmul_ simd_ bf16_ expert - Dispatch bf16 quantized matmul with expert offset for MoE inference.
- quantized_
matmul - Encode a quantized matrix multiplication onto the given command encoder.
- quantized_
matmul_ simd - Encode a quantized matrix-vector multiply using the SIMD-cooperative kernel
that matches MLX’s
qmv_fastaccumulation pattern exactly.