Module quantized_kv

Expand description

Block-quantized K/V cache — store decode-time history as q8_0, q4_0, or q5_0 GGUF-encoded blocks instead of f32/f16. Memory saving vs f16 is roughly:

scheme	bits/elem	ratio vs f16
f16	16	1.0×
q8_0	8.5	0.53×
q5_0	5.5	0.34×
q4_0	4.5	0.28×

Trade-off: quantization adds noise to attention scores. q8_0 is near-lossless for most decoder LMs; q4_0 typically costs ~0.3 ppl at 4× memory savings.

§Layout per layer

Each layer’s K and V buffer is a flat Vec<u8> of past_len quantized rows. Every “row” is kv_dim f32 elements when dequantized; rows are stored back-to-back. kv_dim must be a multiple of the scheme’s block size (32 for all three schemes).

On read, callers materialize a window of rows to f32 via [dequant_rows]. On write, freshly produced f32 K/V is quantized one row at a time via [quant_rows] before being appended. The quantization wrappers route to the rlx_gguf::quantize / dequant_* kernels for parity with on-disk GGUF blocks.

Structs§

QuantizedKvCache: All layers of a quantized KV cache.
QuantizedKvLayer: One layer’s quantized K/V buffers.

Enums§

KvQuant: Quantization scheme for cache rows. Restricted to the three q-formats whose blocks are 32 elements wide and stable across llama.cpp versions. The K-quants (Q4_K etc.) require 256-element blocks, which doesn’t compose cleanly with typical kv_dim values (e.g. 128 head dim) so we don’t expose them here.