Skip to main content

Module marlin_matmul

Module marlin_matmul 

Source
Expand description

marlin_matmul (GPTQ INT4) op-diff harness — PARTIAL: planning stub.

The full op needs:

  • A: fp16 input [m, k]
  • B: packed INT4 weight in Marlin tile layout [k / pack_factor, n]
  • scales: fp16 [k / group_size, n]
  • zeros: int32 optional, [k / group_size, n / pack_factor]
  • g_idx: int32 optional, [k] for desc_act

Setup needs a Marlin packer that converts a reference fp32 weight matrix into the specific tile layout (pack_factor=8, tile_size=16, interleaved nibbles). The packer lives in ferrum-quantization / ferrum-kernels/quantization/gptq_marlin/ but isn’t exposed as a testkit-callable helper.

Reference impl: CPU backend’s gemm_quant for QuantKind::Gptq dequantizes the packed B back to fp32 then runs a regular sgemm. That’s what we’d compare CUDA’s hand-tuned Marlin kernel against.

Punted to follow-up: needs marlin_pack_fixture(fp32 weight) -> QuantWeights<B> helper that all backends agree on. Without it the test would be testing the PACKER not the matmul.

Structs§

MarlinMatmulOp