Module quantized_matmul

Expand description

Quantized matrix multiplication host-side dispatch.

Encodes a GPU compute command that performs: output[row][col] = sum_k(dequant(weight[col][k]) * input[row][k])

Weights are stored in packed quantized format (4-bit or 6-bit) with per-group bf16 scales and biases for affine dequantization.

Structs§

QuantizedMatmulParams: Parameters describing the quantized matmul dimensions and format.

dispatch_quantized_matmul_simd_bf16: Dispatch the bf16 I/O variant of the SIMD quantized matmul kernel.
dispatch_quantized_matmul_simd_bf16_expert: Dispatch bf16 quantized matmul with expert offset for MoE inference.
quantized_matmul: Encode a quantized matrix multiplication onto the given command encoder.
quantized_matmul_simd: Encode a quantized matrix-vector multiply using the SIMD-cooperative kernel that matches MLX’s qmv_fast accumulation pattern exactly.