mistralrs-kv-cache 0.3.0

Trait interface for compressed KV-cache implementations in mistral.rs
Documentation
# mistralrs-kv-cache

Trait interface for compressed KV-cache implementations in [mistral.rs](https://github.com/EricLBuehler/mistral.rs).

## Overview

This crate defines `CompressedKVCache` — the contract between inference engines and KV-cache compression libraries. Implementing this trait is the only requirement for integrating a new cache compression method.

```text
Inference Engine (mistral.rs)
        |
        v
  CompressedKVCache trait  <-- this crate
        ^
        |
Compression Library (e.g. turboquant-rs)
```

## The Trait

```rust
pub trait CompressedKVCache: Send + Sync {
    /// Prefill: store multiple tokens, return decompressed KV for attention.
    fn prefill(&self, layer: usize, k: &Tensor, v: &Tensor, q: &Tensor)
        -> Result<DequantResult>;

    /// Decode: store one token, compute or prepare attention.
    fn decode(&self, layer: usize, k: &Tensor, v: &Tensor, q: &Tensor,
        config: &AttendConfig) -> Result<DecodeOutput>;

    /// Tokens currently stored for a layer.
    fn seq_len(&self, layer: usize) -> usize;

    /// Reset all layers.
    fn reset(&self) -> Result<()>;

    /// Persistent memory usage in bytes.
    fn memory_usage(&self) -> usize;
}
```

All methods take `&self`. Implementations are responsible for interior
synchronization (e.g. per-layer locks) so that calls for different layers
may proceed in parallel — this is what enables use cases like speculative
decoding where draft and target models run concurrently.

## Key Types

- **`DequantResult`** — Decompressed K, V tensors + optional `logit_bias` (e.g. QJL correction)
- **`DecodeOutput`** — Either `Fused(Tensor)` (implementation computed attention) or `Dequantized(DequantResult)` (caller runs SDPA)
- **`AttendConfig`** — Softmax scale + GQA group count

## Implementing a New Compression Method

1. Add this crate as a dependency
2. Implement `CompressedKVCache` for your cache struct
3. Register it in the inference engine's cache factory
4. All model implementations automatically benefit — no per-model changes needed

## Known Implementations

- [turboquant-rs]https://github.com/SaschaOnTour/turboquant — 3-4 bit PolarQuant with optional QJL correction

## License

Licensed under either of [Apache License, Version 2.0](LICENSE-APACHE) or [MIT License](LICENSE-MIT) at your option.