Expand description
KV cache element-type markers (Dim 5 of the 5-dimension architecture).
These are pure marker types with no GPU dependencies, so they live
in ferrum-interfaces rather than ferrum-kernels. The capability
trait that links them to a backend (BackendKvDtype<K>: BackendPagedKv)
does need GPU types, so it stays in ferrum-kernels::backend.
Each model’s KV cache has its own precision independent of the
model’s compute precision. vLLM 0.6+ ships INT8 / FP8 KV caches that
halve KV memory at small (<1%) accuracy hit. ferrum’s type system
exposes this axis via the K: KvDtypeKind parameter on
KvCache<B, K> (default K = KvFp16).
Structs§
- KvBf16
- BF16 KV cache (drop-in replacement for FP16 on Ampere+ / Apple Silicon).
- KvFp8
- FP8 KV cache — E4M3 by default. Hopper+ on CUDA, future on Metal.
- KvFp16
- FP16 KV cache (the existing default on CUDA + Metal).
- KvInt8
- INT8 KV cache — half the memory of FP16 with per-token / per-channel scale factors. CUDA path planned via vLLM’s quant_kv kernels.
Traits§
- KvDtype
Kind - Marker trait + metadata for a KV cache element type.