Expand description
GPTQ linear projection.
GPTQ packs f16 weights as int4 groups, each group sharing a scale + zero_point. On-disk layout from AutoGPTQ / gptq-for-llama:
qweight: [in_features / 8, out_features] i32 — 8 int4s per int32
qzeros: [in_features / group_size, out_features / 8] i32
scales: [in_features / group_size, out_features] f16
g_idx: [in_features] i32 — per-row scale-group map (desc_act only)
GptqLinear<B> stores a backend-specific B::GptqStore produced by
Backend::load_gptq. The store holds whatever format the backend
needs (CUDA: Marlin-repacked tiles; CPU: dequantized f32 weights;
Metal: unsupported).