mistralrs-kv-cache 0.3.0

Trait interface for compressed KV-cache implementations in mistral.rs
Documentation

mistralrs-kv-cache

Trait interface for compressed KV-cache implementations in mistral.rs.

Overview

This crate defines CompressedKVCache — the contract between inference engines and KV-cache compression libraries. Implementing this trait is the only requirement for integrating a new cache compression method.

Inference Engine (mistral.rs)
        |
        v
  CompressedKVCache trait  <-- this crate
        ^
        |
Compression Library (e.g. turboquant-rs)

The Trait

pub trait CompressedKVCache: Send + Sync {
    /// Prefill: store multiple tokens, return decompressed KV for attention.
    fn prefill(&self, layer: usize, k: &Tensor, v: &Tensor, q: &Tensor)
        -> Result<DequantResult>;

    /// Decode: store one token, compute or prepare attention.
    fn decode(&self, layer: usize, k: &Tensor, v: &Tensor, q: &Tensor,
        config: &AttendConfig) -> Result<DecodeOutput>;

    /// Tokens currently stored for a layer.
    fn seq_len(&self, layer: usize) -> usize;

    /// Reset all layers.
    fn reset(&self) -> Result<()>;

    /// Persistent memory usage in bytes.
    fn memory_usage(&self) -> usize;
}

All methods take &self. Implementations are responsible for interior synchronization (e.g. per-layer locks) so that calls for different layers may proceed in parallel — this is what enables use cases like speculative decoding where draft and target models run concurrently.

Key Types

  • DequantResult — Decompressed K, V tensors + optional logit_bias (e.g. QJL correction)
  • DecodeOutput — Either Fused(Tensor) (implementation computed attention) or Dequantized(DequantResult) (caller runs SDPA)
  • AttendConfig — Softmax scale + GQA group count

Implementing a New Compression Method

  1. Add this crate as a dependency
  2. Implement CompressedKVCache for your cache struct
  3. Register it in the inference engine's cache factory
  4. All model implementations automatically benefit — no per-model changes needed

Known Implementations

  • turboquant-rs — 3-4 bit PolarQuant with optional QJL correction

License

Licensed under either of Apache License, Version 2.0 or MIT License at your option.