Module quant_linear

Expand description

QuantLinear<B> — keeps Q4_K_M (or future k-quant) weights quantised in backend memory and dequants on-demand per forward call.

Contrast with GgufLinear<B> which eagerly dequants Q4_K_M to fp32/fp16 at load time. That eager path inflates an 8B model from ~5 GB on disk to 16-32 GB in RAM — fine for safetensors-fp16 sources but wasteful for GGUF Q4_K_M and a non-starter for 30B-A3B on a 32 GB Mac.

The Q4 → fp16 conversion happens inside Backend::gemm_q4_k, into a transient buffer that’s freed after the matmul. Memory footprint is the on-disk Q4 size + a per-call transient ~= one weight matrix’s worth of fp16.

Phase 1D scope: direct (un-fused) Q4_K_M projections only — o_proj, down_proj, lm_head, embed_tokens, etc. Fused projections (qkv_proj, gate_up_proj) keep falling through to GgufLinear’s eager-dequant path; the loader’s split-fusion logic already concatenates the dequanted parts into one dense weight.

Structs§

QuantLinear: Linear projection backed by a GGUF k-quant weight kept quantised in backend memory.