pub struct QuantizedKvLayer {
pub num_kv_heads: usize,
pub head_dim: usize,
pub capacity: usize,
pub len: usize,
/* private fields */
}Expand description
A single layer’s INT8-quantized KV cache.
Memory layout for the INT8 data arrays uses the token-major order
[token_pos * num_kv_heads * head_dim], so sequential decode steps
append contiguous blocks. Scale arrays use [token_pos * num_kv_heads].
Fields§
§num_kv_heads: usizeNumber of KV attention heads.
head_dim: usizeDimension of each attention head.
capacity: usizeMaximum number of token positions pre-allocated.
len: usizeNumber of token positions actually stored so far.
Implementations§
Source§impl QuantizedKvLayer
impl QuantizedKvLayer
Sourcepub fn new(capacity: usize, num_kv_heads: usize, head_dim: usize) -> Self
pub fn new(capacity: usize, num_kv_heads: usize, head_dim: usize) -> Self
Allocate an empty quantized KV layer with the given dimensions.
Pre-allocates all storage so that subsequent push calls
do not allocate.
Sourcepub fn push(&mut self, keys: &[f32], values: &[f32]) -> Result<(), QuantKvError>
pub fn push(&mut self, keys: &[f32], values: &[f32]) -> Result<(), QuantKvError>
Append keys and values for the next token position.
keys must be a flat slice of shape [num_kv_heads * head_dim] (heads
first, then dims). values must have the same shape.
Each head’s row is quantized independently with its own scale.
§Errors
QuantKvError::CapacityExceededifself.len == self.capacity.QuantKvError::ShapeMismatchifkeysorvalueslength is wrong.
Sourcepub fn get_key(
&self,
token_pos: usize,
head: usize,
) -> Result<Vec<f32>, QuantKvError>
pub fn get_key( &self, token_pos: usize, head: usize, ) -> Result<Vec<f32>, QuantKvError>
Get dequantized keys for a specific token position and head.
Returns a Vec<f32> of length head_dim.
§Errors
QuantKvError::PositionOutOfRangeiftoken_pos >= self.len.QuantKvError::HeadOutOfRangeifhead >= self.num_kv_heads.
Sourcepub fn get_value(
&self,
token_pos: usize,
head: usize,
) -> Result<Vec<f32>, QuantKvError>
pub fn get_value( &self, token_pos: usize, head: usize, ) -> Result<Vec<f32>, QuantKvError>
Get dequantized values for a specific token position and head.
Returns a Vec<f32> of length head_dim.
§Errors
QuantKvError::PositionOutOfRangeiftoken_pos >= self.len.QuantKvError::HeadOutOfRangeifhead >= self.num_kv_heads.
Sourcepub fn get_keys_at(&self, token_pos: usize) -> Result<Vec<f32>, QuantKvError>
pub fn get_keys_at(&self, token_pos: usize) -> Result<Vec<f32>, QuantKvError>
Get all dequantized keys for a token position (all heads, interleaved).
Returns a flat Vec<f32> of length num_kv_heads * head_dim.
§Errors
QuantKvError::PositionOutOfRangeiftoken_pos >= self.len.
Sourcepub fn get_values_at(&self, token_pos: usize) -> Result<Vec<f32>, QuantKvError>
pub fn get_values_at(&self, token_pos: usize) -> Result<Vec<f32>, QuantKvError>
Get all dequantized values for a token position (all heads, interleaved).
Returns a flat Vec<f32> of length num_kv_heads * head_dim.
§Errors
QuantKvError::PositionOutOfRangeiftoken_pos >= self.len.
Sourcepub fn memory_bytes(&self) -> usize
pub fn memory_bytes(&self) -> usize
Memory used by this layer in bytes (INT8 data + f32 scales).
Only accounts for the pre-allocated storage slabs, not struct overhead.
Sourcepub fn fp32_memory_bytes(&self) -> usize
pub fn fp32_memory_bytes(&self) -> usize
Equivalent memory if the same data were stored as FP32 (no scales).
2 * capacity * num_kv_heads * head_dim * 4 bytes
Sourcepub fn compression_ratio(&self) -> f32
pub fn compression_ratio(&self) -> f32
Compression ratio versus FP32 storage.
Values approaching 4.0 indicate near-ideal INT8 compression. The ratio is slightly below 4.0 because per-row f32 scales add overhead.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for QuantizedKvLayer
impl RefUnwindSafe for QuantizedKvLayer
impl Send for QuantizedKvLayer
impl Sync for QuantizedKvLayer
impl Unpin for QuantizedKvLayer
impl UnsafeUnpin for QuantizedKvLayer
impl UnwindSafe for QuantizedKvLayer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more