Skip to main content

Module gptq

Module gptq 

Source
Expand description

GPTQ linear projection.

GPTQ packs f16 weights as int4 groups, each group sharing a scale + zero_point. On-disk layout from AutoGPTQ / gptq-for-llama:

qweight: [in_features / 8, out_features] i32 — 8 int4s per int32 qzeros: [in_features / group_size, out_features / 8] i32 scales: [in_features / group_size, out_features] f16 g_idx: [in_features] i32 — per-row scale-group map (desc_act only)

GptqLinear<B> stores a backend-specific B::GptqStore produced by Backend::load_gptq. The store holds whatever format the backend needs (CUDA: Marlin-repacked tiles; CPU: dequantized f32 weights; Metal: unsupported).

Structs§

GptqLinear