pub struct KvCache<B: Backend> {
pub k: B::Buffer,
pub v: B::Buffer,
pub len: usize,
pub capacity: usize,
pub num_kv_heads: usize,
pub head_dim: usize,
pub block_size: usize,
pub block_table: Option<B::Buffer>,
pub context_lens: Option<B::Buffer>,
pub paged_block_indices: Vec<u32>,
}Expand description
Per-layer KV cache. Each model owns its own Vec<KvCache<B>> per sequence.
Two layouts are supported, selected at allocation time:
- Contiguous (default):
k/vare[num_kv_heads, capacity, head_dim]f32 buffers.block_size == 0andblock_table/context_lensareNone. Original ferrum layout — used whenFERRUM_METAL_PAGED_KVis unset. - Paged (vLLM-style):
k/vare[num_blocks, num_kv_heads, block_size, head_dim]block pools.block_size > 0andblock_table(u32[max_num_blocks_per_seq]) +context_lens(u32[1]single-seq for now) are populated. Multi-seq sharing is a Phase 4 concern; today every paged cache_id has its own pool but the kernel-level indirection works.
Fields§
§k: B::Buffer§v: B::Buffer§len: usize§capacity: usize§num_kv_heads: usize§head_dim: usize§block_size: usizePaged: KV positions per physical block. 0 ⇒ contiguous layout.
block_table: Option<B::Buffer>Paged: [max_num_blocks_per_seq] u32 — logical → physical block.
context_lens: Option<B::Buffer>Paged: [1] u32 — current context length for the kernel to read.
paged_block_indices: Vec<u32>Paged: host-side mirror of the physical block indices owned by this cache. Lets the model’s release path return blocks to the shared allocator without reading them back from device.
Auto Trait Implementations§
impl<B> Freeze for KvCache<B>
impl<B> RefUnwindSafe for KvCache<B>
impl<B> Send for KvCache<B>
impl<B> Sync for KvCache<B>
impl<B> Unpin for KvCache<B>
impl<B> UnsafeUnpin for KvCache<B>
impl<B> UnwindSafe for KvCache<B>
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more