mistralrs-kv-cache
Trait interface for compressed KV-cache implementations in mistral.rs.
Overview
This crate defines CompressedKVCache — the contract between inference engines and KV-cache compression libraries. Implementing this trait is the only requirement for integrating a new cache compression method.
Inference Engine (mistral.rs)
|
v
CompressedKVCache trait <-- this crate
^
|
Compression Library (e.g. turboquant-rs)
The Trait
Key Types
DequantResult— Decompressed K, V tensors + optionallogit_bias(e.g. QJL correction)DecodeOutput— EitherFused(Tensor)(implementation computed attention) orDequantized(DequantResult)(caller runs SDPA)AttendConfig— Softmax scale + GQA group count
Implementing a New Compression Method
- Add this crate as a dependency
- Implement
CompressedKVCachefor your cache struct - Register it in the inference engine's cache factory
- All model implementations automatically benefit — no per-model changes needed
Known Implementations
- turboquant-rs — 3-4 bit PolarQuant with optional QJL correction
License
Licensed under either of Apache License, Version 2.0 or MIT License at your option.