Skip to main content

Module quantized_matmul

Module quantized_matmul 

Source
Expand description

Quantized matrix multiplication host-side dispatch.

Encodes a GPU compute command that performs: output[row][col] = sum_k(dequant(weight[col][k]) * input[row][k])

Weights are stored in packed quantized format (4-bit or 6-bit) with per-group bf16 scales and biases for affine dequantization.

Structs§

QuantizedMatmulParams
Parameters describing the quantized matmul dimensions and format.

Functions§

dispatch_quantized_matmul_simd_bf16
Dispatch the bf16 I/O variant of the SIMD quantized matmul kernel.
dispatch_quantized_matmul_simd_bf16_expert
Dispatch bf16 quantized matmul with expert offset for MoE inference.
quantized_matmul
Encode a quantized matrix multiplication onto the given command encoder.
quantized_matmul_simd
Encode a quantized matrix-vector multiply using the SIMD-cooperative kernel that matches MLX’s qmv_fast accumulation pattern exactly.