Skip to main content

Module kv_dtype

Module kv_dtype 

Source
Expand description

KV cache element-type markers (Dim 5 of the 5-dimension architecture).

These are pure marker types with no GPU dependencies, so they live in ferrum-interfaces rather than ferrum-kernels. The capability trait that links them to a backend (BackendKvDtype<K>: BackendPagedKv) does need GPU types, so it stays in ferrum-kernels::backend.

Each model’s KV cache has its own precision independent of the model’s compute precision. vLLM 0.6+ ships INT8 / FP8 KV caches that halve KV memory at small (<1%) accuracy hit. ferrum’s type system exposes this axis via the K: KvDtypeKind parameter on KvCache<B, K> (default K = KvFp16).

Structs§

KvBf16
BF16 KV cache (drop-in replacement for FP16 on Ampere+ / Apple Silicon).
KvFp8
FP8 KV cache — E4M3 by default. Hopper+ on CUDA, future on Metal.
KvFp16
FP16 KV cache (the existing default on CUDA + Metal).
KvInt8
INT8 KV cache — half the memory of FP16 with per-token / per-channel scale factors. CUDA path planned via vLLM’s quant_kv kernels.

Traits§

KvDtypeKind
Marker trait + metadata for a KV cache element type.