Struct KvCache

Source

pub struct KvCache<B: Backend> {
    pub k: B::Buffer,
    pub v: B::Buffer,
    pub len: usize,
    pub capacity: usize,
    pub num_kv_heads: usize,
    pub head_dim: usize,
    pub block_size: usize,
    pub block_table: Option<B::Buffer>,
    pub context_lens: Option<B::Buffer>,
    pub paged_block_indices: Vec<u32>,
}

Expand description

Per-layer KV cache. Each model owns its own Vec<KvCache> per sequence.

Two layouts are supported, selected at allocation time:

Contiguous (default): k/v are [num_kv_heads, capacity, head_dim] f32 buffers. block_size == 0 and block_table / context_lens are None. Original ferrum layout — used when FERRUM_METAL_PAGED_KV is unset.
Paged (vLLM-style): k/v are [num_blocks, num_kv_heads, block_size, head_dim] block pools. block_size > 0 and block_table (u32[max_num_blocks_per_seq]) + context_lens (u32[1] single-seq for now) are populated. Multi-seq sharing is a Phase 4 concern; today every paged cache_id has its own pool but the kernel-level indirection works.

Fields§

§k: B::Buffer§v: B::Buffer§len: usize§capacity: usize§num_kv_heads: usize§head_dim: usize§block_size: usize

Paged: KV positions per physical block. 0 ⇒ contiguous layout.

§block_table: Option<B::Buffer>

Paged: [max_num_blocks_per_seq] u32 — logical → physical block.

§context_lens: Option<B::Buffer>

Paged: [1] u32 — current context length for the kernel to read.

§paged_block_indices: Vec<u32>

Paged: host-side mirror of the physical block indices owned by this cache. Lets the model’s release path return blocks to the shared allocator without reading them back from device.