mistralrs-kv-cache

Trait interface for compressed KV-cache implementations in mistral.rs.

Overview

This crate defines CompressedKVCache — the contract between inference engines and KV-cache compression libraries. Implementing this trait is the only requirement for integrating a new cache compression method.

Inference Engine (mistral.rs)
        |
        v
  CompressedKVCache trait  <-- this crate
        ^
        |
Compression Library (e.g. turboquant-rs)

The Trait

pub trait CompressedKVCache: Send + Sync {
    /// Prefill: store multiple tokens, return decompressed KV for attention.
    fn prefill(&mut self, layer: usize, k: &Tensor, v: &Tensor, q: &Tensor)
        -> Result<DequantResult>;

    /// Decode: store one token, compute or prepare attention.
    fn decode(&mut self, layer: usize, k: &Tensor, v: &Tensor, q: &Tensor,
        config: &AttendConfig) -> Result<DecodeOutput>;

    /// Tokens currently stored for a layer.
    fn seq_len(&self, layer: usize) -> usize;

    /// Reset all layers.
    fn reset(&mut self) -> Result<()>;

    /// Persistent memory usage in bytes.
    fn memory_usage(&self) -> usize;
}

Key Types

DequantResult — Decompressed K, V tensors + optional logit_bias (e.g. QJL correction)
DecodeOutput — Either Fused(Tensor) (implementation computed attention) or Dequantized(DequantResult) (caller runs SDPA)
AttendConfig — Softmax scale + GQA group count

Implementing a New Compression Method

Add this crate as a dependency
Implement CompressedKVCache for your cache struct
Register it in the inference engine's cache factory
All model implementations automatically benefit — no per-model changes needed

Known Implementations

turboquant-rs — 3-4 bit PolarQuant with optional QJL correction

License

Licensed under either of Apache License, Version 2.0 or MIT License at your option.