1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
//! KV cache element-type markers (Dim 5 of the 5-dimension architecture).
//!
//! These are pure marker types with no GPU dependencies, so they live
//! in `ferrum-interfaces` rather than `ferrum-kernels`. The capability
//! trait that links them to a backend (`BackendKvDtype<K>: BackendPagedKv`)
//! does need GPU types, so it stays in `ferrum-kernels::backend`.
//!
//! Each model's KV cache has its own precision independent of the
//! model's compute precision. vLLM 0.6+ ships INT8 / FP8 KV caches that
//! halve KV memory at small (<1%) accuracy hit. ferrum's type system
//! exposes this axis via the `K: KvDtypeKind` parameter on
//! `KvCache<B, K>` (default `K = KvFp16`).
/// Marker trait + metadata for a KV cache element type.
/// FP16 KV cache (the existing default on CUDA + Metal).
;
/// BF16 KV cache (drop-in replacement for FP16 on Ampere+ / Apple Silicon).
;
/// INT8 KV cache — half the memory of FP16 with per-token / per-channel
/// scale factors. CUDA path planned via vLLM's quant_kv kernels.
;
/// FP8 KV cache — E4M3 by default. Hopper+ on CUDA, future on Metal.
;