Skip to main content

KvFp16

ferrum_kernels::backend

Struct KvFp16

pub struct KvFp16;

Expand description

FP16 KV cache (the existing default on CUDA + Metal).

Trait Implementations§

impl BackendKvDtype<KvFp16> for CpuBackend

type KvBuffer = <CpuBackend as Backend>::Buffer

Per-layer K/V element storage.

type KvScales = ()

Per-token per-kv-head scale storage. () for FP16 (no scales).

impl KvDtypeKind for KvFp16

const NAME: &'static str = "fp16"

Stable name for logging / debug (e.g. “fp16”, “int8”).

const BYTES_PER_ELEM: usize = 2

Bytes per element on disk + in cache memory.

impl<B: Backend + BackendPagedKv> KvLayer<B> for KvFp16

type Layer = KvCache<B>

Per-layer cache type (FP16 → KvCache, INT8 → KvCacheQuant).

fn alloc_paged( max_blocks_per_seq: usize, block_size: usize, num_kv_heads: usize, head_dim: usize, ) -> Self::Layer

Allocate a paged cache layer for one sequence.

fn alloc_contig( capacity: usize, num_kv_heads: usize, head_dim: usize, ) -> Self::Layer

Allocate a contiguous cache layer (FP16 only; INT8 panics).

fn len(layer: &Self::Layer) -> usize

fn set_len(layer: &mut Self::Layer, new_len: usize)

fn capacity(layer: &Self::Layer) -> usize

fn block_size(layer: &Self::Layer) -> usize

fn num_kv_heads(layer: &Self::Layer) -> usize

fn head_dim(layer: &Self::Layer) -> usize

fn block_table(layer: &Self::Layer) -> Option<&B::Buffer>

fn block_table_mut(layer: &mut Self::Layer) -> Option<&mut B::Buffer>

fn context_lens(layer: &Self::Layer) -> Option<&B::Buffer>

fn context_lens_mut(layer: &mut Self::Layer) -> Option<&mut B::Buffer>

fn paged_block_indices(layer: &Self::Layer) -> &[u32]

fn paged_block_indices_mut(layer: &mut Self::Layer) -> &mut Vec<u32>

fn paged_write( ctx: &mut B::Context, layer: &mut Self::Layer, qkv: &B::Buffer, q_norm_w: &B::Buffer, k_norm_w: &B::Buffer, cos: &B::Buffer, sin: &B::Buffer, q_out: &mut B::Buffer, _k_scratch: &mut B::Buffer, _v_scratch: &mut B::Buffer, pool_k: &mut B::Buffer, pool_v: &mut B::Buffer, tokens: usize, num_q_heads: usize, num_kv_heads: usize, head_dim: usize, pos_offset: usize, eps: f32, qk_mode: i32, ) -> Result<()>

Paged write: split QKV → norm → RoPE → write K/V into the paged pool. FP16 uses B::split_qkv_norm_rope_into_paged_cache. INT8 uses B::split_qkv_norm_rope + B::int8_kv_append_paged.

fn paged_decode_attention( ctx: &mut B::Context, layer: &mut Self::Layer, q: &B::Buffer, pool_k: &B::Buffer, pool_v: &B::Buffer, output: &mut B::Buffer, num_q_heads: usize, num_kv_heads: usize, head_dim: usize, final_kv_len: usize, tokens: usize, ) -> Result<()>

Paged decode attention. Reads from the per-layer cache, writes the attended output to output. FP16 reads from pool_k/pool_v; INT8 reads from layer-internal INT8 buffers (pool args ignored).

fn contig_write( ctx: &mut B::Context, layer: &mut Self::Layer, qkv: &B::Buffer, q_norm_w: &B::Buffer, k_norm_w: &B::Buffer, cos: &B::Buffer, sin: &B::Buffer, q_out: &mut B::Buffer, k_scratch: &mut B::Buffer, v_scratch: &mut B::Buffer, q_buf: &mut B::Buffer, k_buf: &mut B::Buffer, v_buf: &mut B::Buffer, tokens: usize, num_q_heads: usize, num_kv_heads: usize, head_dim: usize, pos_offset: usize, eps: f32, qk_mode: i32, ) -> Result<()>

Contig write: FP16 only. INT8 inherits the panic default — KvInt8::alloc_contig panics in ensure_kv, so this branch is dead code on the INT8 path.

fn contig_decode_attention( ctx: &mut B::Context, layer: &Self::Layer, q: &B::Buffer, output: &mut B::Buffer, attn_cfg: AttnConfig, tokens: usize, pos_offset: usize, ) -> Result<()>

Contig decode attention: FP16 only.

fn is_paged(layer: &Self::Layer) -> bool

Auto Trait Implementations§

impl Freeze for KvFp16

impl RefUnwindSafe for KvFp16

impl Send for KvFp16

impl Sync for KvFp16

impl Unpin for KvFp16

impl UnsafeUnpin for KvFp16

impl UnwindSafe for KvFp16

Blanket Implementations§

impl<T> Any for T
where T: 'static + ?Sized,

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

impl<T> Borrow<T> for T
where T: ?Sized,

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

impl<T> BorrowMut<T> for T
where T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

impl<T> From<T> for T

fn from(t: T) -> T

Returns the argument unchanged.

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more

impl<T, U> Into<U> for T
where U: From<T>,

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

impl<T> Same for T

type Output = T

Should always be Self

impl<T, U> TryFrom<U> for T
where U: Into<T>,

type Error = Infallible

The type returned in the event of a conversion error.

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

fn vzip(self) -> V

impl<T> WithSubscriber for T

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more