pub struct QuantizedKvCache {
pub num_layers: usize,
pub num_kv_heads: usize,
pub head_dim: usize,
/* private fields */
}Expand description
Full multi-layer INT8-quantized KV cache for autoregressive decoding.
Wraps one QuantizedKvLayer per transformer layer and exposes a
unified decode-step interface through push_step.
Fields§
§num_layers: usizeNumber of transformer layers.
num_kv_heads: usizeNumber of KV attention heads per layer.
head_dim: usizeDimension of each attention head.
Implementations§
Source§impl QuantizedKvCache
impl QuantizedKvCache
Sourcepub fn new(
num_layers: usize,
capacity: usize,
num_kv_heads: usize,
head_dim: usize,
) -> Self
pub fn new( num_layers: usize, capacity: usize, num_kv_heads: usize, head_dim: usize, ) -> Self
Allocate a new quantized KV cache for num_layers transformer layers.
Each layer is pre-allocated for capacity token positions.
Sourcepub fn push_step(
&mut self,
all_keys: &[Vec<f32>],
all_values: &[Vec<f32>],
) -> Result<(), QuantKvError>
pub fn push_step( &mut self, all_keys: &[Vec<f32>], all_values: &[Vec<f32>], ) -> Result<(), QuantKvError>
Append KV tensors for all layers at the current decode step.
all_keys[layer] must be a flat slice of shape [num_kv_heads * head_dim].
all_values[layer] must have the same shape.
§Errors
QuantKvError::LayerOutOfRangeifall_keys.len() != self.num_layers.- Propagates
QuantKvErrorfrom each layer’spush.
Sourcepub fn get_key(
&self,
layer: usize,
token_pos: usize,
head: usize,
) -> Result<Vec<f32>, QuantKvError>
pub fn get_key( &self, layer: usize, token_pos: usize, head: usize, ) -> Result<Vec<f32>, QuantKvError>
Get dequantized keys for a specific layer, token position, and head.
§Errors
QuantKvError::LayerOutOfRangeiflayer >= self.num_layers.- Propagates position/head errors from the underlying layer.
Sourcepub fn get_value(
&self,
layer: usize,
token_pos: usize,
head: usize,
) -> Result<Vec<f32>, QuantKvError>
pub fn get_value( &self, layer: usize, token_pos: usize, head: usize, ) -> Result<Vec<f32>, QuantKvError>
Get dequantized values for a specific layer, token position, and head.
§Errors
QuantKvError::LayerOutOfRangeiflayer >= self.num_layers.- Propagates position/head errors from the underlying layer.
Sourcepub fn total_memory_bytes(&self) -> usize
pub fn total_memory_bytes(&self) -> usize
Total memory used across all layers in bytes.
Sourcepub fn total_fp32_memory_bytes(&self) -> usize
pub fn total_fp32_memory_bytes(&self) -> usize
FP32-equivalent memory across all layers.
Sourcepub fn compression_ratio(&self) -> f32
pub fn compression_ratio(&self) -> f32
Overall compression ratio vs FP32.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for QuantizedKvCache
impl RefUnwindSafe for QuantizedKvCache
impl Send for QuantizedKvCache
impl Sync for QuantizedKvCache
impl Unpin for QuantizedKvCache
impl UnsafeUnpin for QuantizedKvCache
impl UnwindSafe for QuantizedKvCache
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more