Skip to main content

ferrum_testkit/op_diff/
marlin_matmul.rs

1//! `marlin_matmul` (GPTQ INT4) op-diff harness — **PARTIAL: planning stub**.
2//!
3//! The full op needs:
4//!   - A: fp16 input `[m, k]`
5//!   - B: packed INT4 weight in Marlin tile layout `[k / pack_factor, n]`
6//!   - scales: fp16 `[k / group_size, n]`
7//!   - zeros: int32 optional, `[k / group_size, n / pack_factor]`
8//!   - g_idx: int32 optional, `[k]` for desc_act
9//!
10//! Setup needs a Marlin packer that converts a reference fp32 weight
11//! matrix into the specific tile layout (`pack_factor=8`, `tile_size=16`,
12//! interleaved nibbles). The packer lives in `ferrum-quantization` /
13//! `ferrum-kernels/quantization/gptq_marlin/` but isn't exposed as a
14//! testkit-callable helper.
15//!
16//! Reference impl: CPU backend's `gemm_quant` for `QuantKind::Gptq`
17//! dequantizes the packed B back to fp32 then runs a regular sgemm.
18//! That's what we'd compare CUDA's hand-tuned Marlin kernel against.
19//!
20//! Punted to follow-up: needs `marlin_pack_fixture(fp32 weight) ->
21//! QuantWeights<B>` helper that all backends agree on. Without it
22//! the test would be testing the PACKER not the matmul.
23
24#![allow(dead_code)]
25
26pub struct MarlinMatmulOp {
27    pub m: usize,
28    pub n: usize,
29    pub k: usize,
30    pub group_size: usize,
31}
32
33// impl OpUnderTest for MarlinMatmulOp — pending marlin_pack_fixture helper.