Crate mistralrs_kv_cache

Expand description

Trait interface for compressed KV-cache implementations.

This crate defines CompressedKVCache — the contract between inference engines (like mistral.rs) and KV-cache compression libraries (like turboquant).

Implementing this trait is the only requirement for integrating a new cache compression method. The inference engine calls prefill() during multi-token processing and decode() during single-token generation. All compression decisions (fused vs dequantized, lazy vs immediate, CPU vs GPU) are made internally by the implementation.

Structs§

AttendConfig: Configuration for attention computation during decode.
DequantResult: Result of decompressing cached KV data.
Tensor: The core struct for manipulating tensors.

Enums§

DType: The different types of elements allowed in tensors.
DecodeOutput: Result of a decode step.
Device: Cpu, Cuda, or Metal

Traits§

CompressedKVCache: Trait for compressed KV-cache implementations.

Type Aliases§

Result