pub struct KvCacheQuant<B: BackendKvDtype<K>, K: KvDtypeKind> {Show 13 fields
pub k: <B as BackendKvDtype<K>>::KvBuffer,
pub v: <B as BackendKvDtype<K>>::KvBuffer,
pub k_scales: <B as BackendKvDtype<K>>::KvScales,
pub v_scales: <B as BackendKvDtype<K>>::KvScales,
pub len: usize,
pub capacity: usize,
pub num_kv_heads: usize,
pub head_dim: usize,
pub block_size: usize,
pub block_table: Option<B::Buffer>,
pub context_lens: Option<B::Buffer>,
pub paged_block_indices: Vec<u32>,
pub _kv_dtype: PhantomData<K>,
}Expand description
Quantized-KV cache (Dim 5 INT8 / future FP8 paths). Sibling of
KvCache for backends that store K/V in a non-FP16 element type
plus per-token per-kv-head scales.
Why a separate struct: the FP16 KvCache<B, K> uses B::Buffer
uniformly, which is FP16 on every concrete backend. Stuffing INT8
storage into that buffer would require unsafe transmutes; making
the FP16 struct generic over the storage type would force every
existing call site (4 model files, ~20 functions) to pick up an
equality-bound on the associated type. Keeping a parallel struct
for INT8 is the cheaper trade — the kernel launchers in
[crate::int8_kv] take cudarc primitives directly anyway.
KStorage and ScaleStorage come from BackendKvDtype<K>::KvBuffer
and BackendKvDtype<K>::KvScales. On CUDA they wrap CudaSlice<i8>
and CudaSlice<f16>.
Fields§
§k: <B as BackendKvDtype<K>>::KvBuffer§v: <B as BackendKvDtype<K>>::KvBuffer§k_scales: <B as BackendKvDtype<K>>::KvScales§v_scales: <B as BackendKvDtype<K>>::KvScales§len: usize§capacity: usize§num_kv_heads: usize§head_dim: usize§block_size: usize§block_table: Option<B::Buffer>§context_lens: Option<B::Buffer>§paged_block_indices: Vec<u32>§_kv_dtype: PhantomData<K>