Skip to main content

KvFp16

Struct KvFp16 

Source
pub struct KvFp16;
Expand description

FP16 KV cache (the existing default on CUDA + Metal).

Trait Implementations§

Source§

impl BackendKvDtype<KvFp16> for CpuBackend

Source§

type KvBuffer = <CpuBackend as Backend>::Buffer

Per-layer K/V element storage.
Source§

type KvScales = ()

Per-token per-kv-head scale storage. () for FP16 (no scales).
Source§

impl KvDtypeKind for KvFp16

Source§

const NAME: &'static str = "fp16"

Stable name for logging / debug (e.g. “fp16”, “int8”).
Source§

const BYTES_PER_ELEM: usize = 2

Bytes per element on disk + in cache memory.
Source§

impl<B: Backend + BackendPagedKv> KvLayer<B> for KvFp16

Source§

type Layer = KvCache<B>

Per-layer cache type (FP16 → KvCache, INT8 → KvCacheQuant).
Source§

fn alloc_paged( max_blocks_per_seq: usize, block_size: usize, num_kv_heads: usize, head_dim: usize, ) -> Self::Layer

Allocate a paged cache layer for one sequence.
Source§

fn alloc_contig( capacity: usize, num_kv_heads: usize, head_dim: usize, ) -> Self::Layer

Allocate a contiguous cache layer (FP16 only; INT8 panics).
Source§

fn len(layer: &Self::Layer) -> usize

Source§

fn set_len(layer: &mut Self::Layer, new_len: usize)

Source§

fn capacity(layer: &Self::Layer) -> usize

Source§

fn block_size(layer: &Self::Layer) -> usize

Source§

fn num_kv_heads(layer: &Self::Layer) -> usize

Source§

fn head_dim(layer: &Self::Layer) -> usize

Source§

fn block_table(layer: &Self::Layer) -> Option<&B::Buffer>

Source§

fn block_table_mut(layer: &mut Self::Layer) -> Option<&mut B::Buffer>

Source§

fn context_lens(layer: &Self::Layer) -> Option<&B::Buffer>

Source§

fn context_lens_mut(layer: &mut Self::Layer) -> Option<&mut B::Buffer>

Source§

fn paged_block_indices(layer: &Self::Layer) -> &[u32]

Source§

fn paged_block_indices_mut(layer: &mut Self::Layer) -> &mut Vec<u32>

Source§

fn paged_write( ctx: &mut B::Context, layer: &mut Self::Layer, qkv: &B::Buffer, q_norm_w: &B::Buffer, k_norm_w: &B::Buffer, cos: &B::Buffer, sin: &B::Buffer, q_out: &mut B::Buffer, _k_scratch: &mut B::Buffer, _v_scratch: &mut B::Buffer, pool_k: &mut B::Buffer, pool_v: &mut B::Buffer, tokens: usize, num_q_heads: usize, num_kv_heads: usize, head_dim: usize, pos_offset: usize, eps: f32, qk_mode: i32, ) -> Result<()>

Paged write: split QKV → norm → RoPE → write K/V into the paged pool. FP16 uses B::split_qkv_norm_rope_into_paged_cache. INT8 uses B::split_qkv_norm_rope + B::int8_kv_append_paged.
Source§

fn paged_decode_attention( ctx: &mut B::Context, layer: &mut Self::Layer, q: &B::Buffer, pool_k: &B::Buffer, pool_v: &B::Buffer, output: &mut B::Buffer, num_q_heads: usize, num_kv_heads: usize, head_dim: usize, final_kv_len: usize, tokens: usize, ) -> Result<()>

Paged decode attention. Reads from the per-layer cache, writes the attended output to output. FP16 reads from pool_k/pool_v; INT8 reads from layer-internal INT8 buffers (pool args ignored).
Source§

fn contig_write( ctx: &mut B::Context, layer: &mut Self::Layer, qkv: &B::Buffer, q_norm_w: &B::Buffer, k_norm_w: &B::Buffer, cos: &B::Buffer, sin: &B::Buffer, q_out: &mut B::Buffer, k_scratch: &mut B::Buffer, v_scratch: &mut B::Buffer, q_buf: &mut B::Buffer, k_buf: &mut B::Buffer, v_buf: &mut B::Buffer, tokens: usize, num_q_heads: usize, num_kv_heads: usize, head_dim: usize, pos_offset: usize, eps: f32, qk_mode: i32, ) -> Result<()>

Contig write: FP16 only. INT8 inherits the panic default — KvInt8::alloc_contig panics in ensure_kv, so this branch is dead code on the INT8 path.
Source§

fn contig_decode_attention( ctx: &mut B::Context, layer: &Self::Layer, q: &B::Buffer, output: &mut B::Buffer, attn_cfg: AttnConfig, tokens: usize, pos_offset: usize, ) -> Result<()>

Contig decode attention: FP16 only.
Source§

fn is_paged(layer: &Self::Layer) -> bool

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more