Expand description
marlin_matmul (GPTQ INT4) op-diff harness — PARTIAL: planning stub.
The full op needs:
- A: fp16 input
[m, k] - B: packed INT4 weight in Marlin tile layout
[k / pack_factor, n] - scales: fp16
[k / group_size, n] - zeros: int32 optional,
[k / group_size, n / pack_factor] - g_idx: int32 optional,
[k]for desc_act
Setup needs a Marlin packer that converts a reference fp32 weight
matrix into the specific tile layout (pack_factor=8, tile_size=16,
interleaved nibbles). The packer lives in ferrum-quantization /
ferrum-kernels/quantization/gptq_marlin/ but isn’t exposed as a
testkit-callable helper.
Reference impl: CPU backend’s gemm_quant for QuantKind::Gptq
dequantizes the packed B back to fp32 then runs a regular sgemm.
That’s what we’d compare CUDA’s hand-tuned Marlin kernel against.
Punted to follow-up: needs marlin_pack_fixture(fp32 weight) -> QuantWeights<B> helper that all backends agree on. Without it
the test would be testing the PACKER not the matmul.