Expand description
Block-quantized K/V cache — store decode-time history as q8_0,
q4_0, or q5_0 GGUF-encoded blocks instead of f32/f16. Memory
saving vs f16 is roughly:
| scheme | bits/elem | ratio vs f16 |
|---|---|---|
| f16 | 16 | 1.0× |
| q8_0 | 8.5 | 0.53× |
| q5_0 | 5.5 | 0.34× |
| q4_0 | 4.5 | 0.28× |
Trade-off: quantization adds noise to attention scores. q8_0 is near-lossless for most decoder LMs; q4_0 typically costs ~0.3 ppl at 4× memory savings.
§Layout per layer
Each layer’s K and V buffer is a flat Vec<u8> of past_len
quantized rows. Every “row” is kv_dim f32 elements when
dequantized; rows are stored back-to-back. kv_dim must be a
multiple of the scheme’s block size (32 for all three schemes).
On read, callers materialize a window of rows to f32 via
[dequant_rows]. On write, freshly produced f32 K/V is quantized
one row at a time via [quant_rows] before being appended. The
quantization wrappers route to the rlx_gguf::quantize /
dequant_* kernels for parity with on-disk GGUF blocks.
Structs§
- Quantized
KvCache - All layers of a quantized KV cache.
- Quantized
KvLayer - One layer’s quantized K/V buffers.
Enums§
- KvQuant
- Quantization scheme for cache rows. Restricted to the three q-formats whose blocks are 32 elements wide and stable across llama.cpp versions. The K-quants (Q4_K etc.) require 256-element blocks, which doesn’t compose cleanly with typical kv_dim values (e.g. 128 head dim) so we don’t expose them here.