Expand description
Trait interface for compressed KV-cache implementations.
This crate defines CompressedKVCache — the contract between inference
engines (like mistral.rs) and KV-cache compression libraries (like turboquant).
Implementing this trait is the only requirement for integrating a new
cache compression method. The inference engine calls prefill() during
multi-token processing and decode() during single-token generation.
All compression decisions (fused vs dequantized, lazy vs immediate,
CPU vs GPU) are made internally by the implementation.
Structs§
- Attend
Config - Configuration for attention computation during decode.
- Dequant
Result - Result of decompressing cached KV data.
- Tensor
- The core struct for manipulating tensors.
Enums§
- DType
- The different types of elements allowed in tensors.
- Decode
Output - Result of a decode step.
- Device
- Cpu, Cuda, or Metal
Traits§
- CompressedKV
Cache - Trait for compressed KV-cache implementations.