pub struct KvCache<B: Backend, K: KvDtypeKind = KvFp16> {
pub k: B::Buffer,
pub v: B::Buffer,
pub len: usize,
pub capacity: usize,
pub num_kv_heads: usize,
pub head_dim: usize,
pub block_size: usize,
pub block_table: Option<B::Buffer>,
pub context_lens: Option<B::Buffer>,
pub paged_block_indices: Vec<u32>,
pub _kv_dtype: PhantomData<K>,
}Expand description
Per-layer KV cache. Each model owns its own Vec<KvCache<B, K>> per
sequence. The K: KvDtypeKind parameter selects the cache element
type — defaults to KvFp16 so existing call sites that wrote
KvCache<B> keep compiling unchanged.
Two layouts are supported, selected at allocation time:
- Contiguous (default):
k/vare[num_kv_heads, capacity, head_dim]f32 buffers.block_size == 0andblock_table/context_lensareNone. Original ferrum layout — used whenFERRUM_METAL_PAGED_KVis unset. - Paged (vLLM-style):
k/vare[num_blocks, num_kv_heads, block_size, head_dim]block pools.block_size > 0andblock_table(u32[max_num_blocks_per_seq]) +context_lens(u32[1]single-seq for now) are populated. Multi-seq sharing is a Phase 4 concern; today every paged cache_id has its own pool but the kernel-level indirection works.
The K parameter is currently a phantom-type marker — the buffer
fields stay B::Buffer regardless. Future PRs will switch backends
to BackendKvDtype<KvInt8> etc. and the kernel dispatch will read
K::NAME / K::BYTES_PER_ELEM to pick the right append / attention
kernel without any KvCache struct change.
Fields§
§k: B::Buffer§v: B::Buffer§len: usize§capacity: usize§num_kv_heads: usize§head_dim: usize§block_size: usizePaged: KV positions per physical block. 0 => contiguous layout.
block_table: Option<B::Buffer>Paged: [max_num_blocks_per_seq] u32 — logical → physical block.
context_lens: Option<B::Buffer>Paged: [1] u32 — current context length for the kernel to read.
paged_block_indices: Vec<u32>Paged: host-side mirror of the physical block indices owned by this cache. Lets the model’s release path return blocks to the shared allocator without reading them back from device.
_kv_dtype: PhantomData<K>Marker — KV cache element type. Zero-sized.