Expand description
QuantLinear<B> — keeps Q4_K_M (or future k-quant) weights quantised
in backend memory and dequants on-demand per forward call.
Contrast with GgufLinear<B> which eagerly dequants Q4_K_M to
fp32/fp16 at load time. That eager path inflates an 8B model from
~5 GB on disk to 16-32 GB in RAM — fine for safetensors-fp16 sources
but wasteful for GGUF Q4_K_M and a non-starter for 30B-A3B on a
32 GB Mac.
The Q4 → fp16 conversion happens inside Backend::gemm_q4_k, into a
transient buffer that’s freed after the matmul. Memory footprint is
the on-disk Q4 size + a per-call transient ~= one weight matrix’s
worth of fp16.
Phase 1D scope: direct (un-fused) Q4_K_M projections only —
o_proj, down_proj, lm_head, embed_tokens, etc. Fused
projections (qkv_proj, gate_up_proj) keep falling through to
GgufLinear’s eager-dequant path; the loader’s split-fusion logic
already concatenates the dequanted parts into one dense weight.
Structs§
- Quant
Linear - Linear projection backed by a GGUF k-quant weight kept quantised in backend memory.