Skip to main content

Crate mistralrs_kv_cache

Crate mistralrs_kv_cache 

Source
Expand description

Trait interface for compressed KV-cache implementations.

This crate defines CompressedKVCache — the contract between inference engines (like mistral.rs) and KV-cache compression libraries (like turboquant).

Implementing this trait is the only requirement for integrating a new cache compression method. The inference engine calls prefill() during multi-token processing and decode() during single-token generation. All compression decisions (fused vs dequantized, lazy vs immediate, CPU vs GPU) are made internally by the implementation.

Structs§

AttendConfig
Configuration for attention computation during decode.
DequantResult
Result of decompressing cached KV data.
Tensor
The core struct for manipulating tensors.

Enums§

DType
The different types of elements allowed in tensors.
DecodeOutput
Result of a decode step.
Device
Cpu, Cuda, or Metal

Traits§

CompressedKVCache
Trait for compressed KV-cache implementations.

Type Aliases§

Result