# mistralrs-kv-cache
Trait interface for compressed KV-cache implementations in [mistral.rs](https://github.com/EricLBuehler/mistral.rs).
## Overview
This crate defines `CompressedKVCache` — the contract between inference engines and KV-cache compression libraries. Implementing this trait is the only requirement for integrating a new cache compression method.
```text
Inference Engine (mistral.rs)
|
v
CompressedKVCache trait <-- this crate
^
|
Compression Library (e.g. turboquant-rs)
```
## The Trait
```rust
pub trait CompressedKVCache: Send + Sync {
/// Prefill: store multiple tokens, return decompressed KV for attention.
fn prefill(&self, layer: usize, k: &Tensor, v: &Tensor, q: &Tensor)
-> Result<DequantResult>;
/// Decode: store one token, compute or prepare attention.
fn decode(&self, layer: usize, k: &Tensor, v: &Tensor, q: &Tensor,
config: &AttendConfig) -> Result<DecodeOutput>;
/// Tokens currently stored for a layer.
fn seq_len(&self, layer: usize) -> usize;
/// Reset all layers.
fn reset(&self) -> Result<()>;
/// Persistent memory usage in bytes.
fn memory_usage(&self) -> usize;
}
```
All methods take `&self`. Implementations are responsible for interior
synchronization (e.g. per-layer locks) so that calls for different layers
may proceed in parallel — this is what enables use cases like speculative
decoding where draft and target models run concurrently.
## Key Types
- **`DequantResult`** — Decompressed K, V tensors + optional `logit_bias` (e.g. QJL correction)
- **`DecodeOutput`** — Either `Fused(Tensor)` (implementation computed attention) or `Dequantized(DequantResult)` (caller runs SDPA)
- **`AttendConfig`** — Softmax scale + GQA group count
## Implementing a New Compression Method
1. Add this crate as a dependency
2. Implement `CompressedKVCache` for your cache struct
3. Register it in the inference engine's cache factory
4. All model implementations automatically benefit — no per-model changes needed
## Known Implementations
- [turboquant-rs](https://github.com/SaschaOnTour/turboquant) — 3-4 bit PolarQuant with optional QJL correction
## License
Licensed under either of [Apache License, Version 2.0](LICENSE-APACHE) or [MIT License](LICENSE-MIT) at your option.