turboquant
Rust implementation of Google's TurboQuant algorithm (Zandieh et al., ICLR 2026) for extreme KV-cache compression in LLM inference.
What is TurboQuant?
TurboQuant compresses the key-value cache of large language models to 3-4 bits per value with zero accuracy loss. It is training-free and data-oblivious -- no calibration data required. This reduces KV-cache memory by up to 5x while maintaining full model quality.
The algorithm combines two stages:
- PolarQuant (Stage 1): Random rotation (Walsh-Hadamard Transform) + optimal scalar quantization via pre-computed Lloyd-Max codebooks
- QJL (Stage 2): 1-bit bias correction via Quantized Johnson-Lindenstrauss projection, ensuring unbiased inner product estimates for attention computation
Project Status
| Metric | Value |
|---|---|
| Quality Score | 100.0% (rustqual) |
| Tests | 327 |
| Functions | ~343 |
| Dependencies | half + thiserror (2 total) |
Quick Start
use ;
use ;
// Configure for 3-bit quantization, head dimension 128
let config = new.unwrap.with_seed;
// Quantize a key vector
let packed: PackedBlock = quantize_vec.unwrap;
// Dequantize back
let recovered: = dequantize_vec.unwrap;
See examples/ for runnable demos.
API Overview
| Type | Module | Description |
|---|---|---|
TurboQuantConfig |
packed |
Configuration: bit-width (2/3/4), head dimension, seed |
PackedBlock |
packed |
Bit-packed quantized block (unified format for all bit-widths) |
QuantizedKVCache |
attention |
High-level cache with attention score computation |
QjlBlock |
qjl |
QJL-enhanced quantized block for unbiased inner products |
EstimationContext |
qjl |
Pre-fetched context for efficient batch attention scoring |
Compression & Accuracy
| Variant | Bits/Value | Compression vs FP16 | Normalized MSE | Paper Target |
|---|---|---|---|---|
| TQ3 (3-bit) | 3 | 4.9x | ~0.034 | 0.034 |
| TQ4 (4-bit) | 4 | ~3.5x | ~0.009 | 0.009 |
MSE measured over 10,000 random vectors at d=128, matching paper values exactly.
Performance (d=128, release build)
| Operation | Latency |
|---|---|
| PolarQuant quantize | ~1.1 us |
| PolarQuant dequantize | ~0.8 us |
| QJL quantize | ~19 us |
| QJL inner product (batch, per key) | ~1.1 us |
| Attention over 1024 keys | ~1.1 ms |
| Estimated 100k context / 32 layers | ~3.5 s |
mistral.rs Integration
turboquant integrates transparently into mistral.rs as a KV-cache quantization backend. All models are supported.
# Run any model with TurboQuant TQ3 KV-cache compression
Integration Benchmarks (CPU-only, Qwen3-0.6B, 512 prompt + 128 decode)
| Normal | TQ3 | Overhead | |
|---|---|---|---|
| Prefill | 129.7 tok/s | 130.0 tok/s | 0% overhead |
| Decode | 10.5 tok/s | 8.4 tok/s | ~20% (amortized, includes one-time flush) |
| KV-Cache Memory | 1x | ~4.9x compression |
The decode overhead is amortized over 128 tokens and includes a one-time lazy quantization flush. A future GPU kernel implementation (Approach B) would eliminate this overhead entirely. See Approach B Roadmap.
Optimizations
The following optimizations were implemented to achieve near-zero overhead:
- Delta dequantization: Avoids O(N^2) redundant work by only dequantizing newly added heads
- Pre-allocated GPU tensor buffer: Uses
slice_set/narrowfor O(1) per-step tensor updates instead of creating new tensors - Lazy quantization: Defers quantization from prefill to first decode step, keeping prefill at full speed
- Parallel head processing: Uses rayon for multi-threaded quantization/dequantization across attention heads
- Batch quantize: Shares codebook and sign_pattern setup across heads in a batch
- Zero-copy tensor data extraction: Extracts tensor data without unnecessary allocations
- Reusable Vec buffers: Pre-allocated buffers reused across decode steps to avoid repeated allocation
Improvements over llama.cpp TurboQuant (tq3_0)
This implementation differs from the llama.cpp tq3_0 branch in several important ways:
1. QJL Bias Correction (mandatory, not omitted)
llama.cpp tq3_0 implements only PolarQuant (Stage 1) and omits QJL entirely. Without QJL, inner product estimates carry a systematic multiplicative bias of 2/pi that accumulates across all keys in the softmax during attention. This bias is not visible in short-context benchmarks but degrades quality at long contexts (8k+ tokens), which is the primary use case for KV-cache compression.
Our implementation includes the full TURBOQUANTprod algorithm (Algorithm 2 from the paper) with QJL bias correction, guaranteeing E[<y,x>_est] = <y,x> (mathematically unbiased).
2. Dimension-Specific Codebooks (exact Beta distribution)
| Aspect | llama.cpp tq3_0 | turboquant (this crate) |
|---|---|---|
| Distribution | Gaussian N(0,1) approximation | Exact Beta distribution per dimension |
| Codebooks | Single fixed set for all dimensions | Pre-computed per (bits, dim) pair |
| 3-bit centroids (d=128) | [-2.15, -1.34, -0.76, -0.25, +0.25, +0.76, +1.34, +2.15] | [-0.189, -0.118, -0.067, -0.022, +0.022, +0.067, +0.118, +0.189] |
| Relationship | Centroids for normalized coordinates | llama_centroid ~= ours * sqrt(d) |
After rotation, the coordinates of a d-dimensional unit vector follow a Beta-type distribution on [-1, 1], not a Gaussian. The Gaussian is the limiting distribution as d approaches infinity. For practical dimensions (d=64-256), the Beta distribution is a better fit. Our codebooks are optimal Lloyd-Max quantizers for the exact distribution, yielding slightly lower MSE.
3. Flexible Block Sizes
llama.cpp uses a fixed block size of 32 values. Our implementation supports variable dimensions (64, 128, 256) matching common LLM head dimensions directly, avoiding padding waste.
4. Hash-Based Rademacher (no crypto RNG needed)
The QJL projection matrix uses deterministic hash-based sign generation instead of requiring a cryptographic RNG, making it fast and reproducible across platforms.
5. Bit-Packing Compatible with llama.cpp
The 3-bit packing layout is identical to llama.cpp tq3_0 (8 indices into 3 bytes, same byte order), ensuring potential interoperability at the data level.
Installation
[]
= { = "https://github.com/nicosql/turboquant.git" }
Building with Native CPU Optimizations
The crate is configured to use native CPU features (AVX2, FMA, etc.) automatically via .cargo/config.toml. For maximum performance:
Running Examples
References
- Paper: TurboQuant (Zandieh et al., ICLR 2026)
- Blog: Google Research Blog
- llama.cpp reference: tq3_0 branch
License
Licensed under either of Apache License, Version 2.0 or MIT license, at your option.