1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
//! Trait interface for compressed KV-cache implementations.
//!
//! This crate defines [`CompressedKVCache`] — the contract between inference
//! engines (like mistral.rs) and KV-cache compression libraries (like turboquant).
//!
//! Implementing this trait is the only requirement for integrating a new
//! cache compression method. The inference engine calls `prefill()` during
//! multi-token processing and `decode()` during single-token generation.
//! All compression decisions (fused vs dequantized, lazy vs immediate,
//! CPU vs GPU) are made internally by the implementation.
pub use ;
/// Result of decompressing cached KV data.
///
/// Contains the full (decompressed) key and value tensors for all cached
/// tokens, plus an optional bias to be added to attention logits before
/// softmax (e.g. QJL correction in TurboQuant).
/// Result of a decode step.
///
/// The cache implementation decides internally whether to compute attention
/// via a fused kernel or return decompressed data for the caller's SDPA.
/// Configuration for attention computation during decode.
///
/// Passed to [`CompressedKVCache::decode`]. Extensible without breaking
/// the trait signature — new fields can be added here.
/// Trait for compressed KV-cache implementations.
///
/// Two methods for the two phases of LLM inference:
/// - [`prefill`](Self::prefill): Store multiple tokens, return decompressed KV for Flash Attention.
/// - [`decode`](Self::decode): Store single token, compute or prepare attention.
///
/// The implementation makes **all** internal decisions:
/// - Fused kernel vs full decompression (based on device, kernel availability)
/// - Immediate vs deferred compression during prefill
/// - QJL bias computation (only when needed)
///
/// # Synchronization
///
/// All methods take `&self`. Implementations are responsible for interior
/// synchronization (e.g. per-layer locks) so that calls for different layers
/// may proceed in parallel. This enables use cases like speculative decoding
/// where draft and target models run concurrently.
///
/// # Adding a new compression method
///
/// 1. Implement this trait for your cache struct
/// 2. Add a match arm in the inference engine's cache factory
/// 3. Done — no model code changes needed