turboquant-rs 0.2.0

TurboQuant KV-Cache Quantization — 3-bit compression with zero accuracy loss (Zandieh et al., ICLR 2026)
Documentation

turboquant

Rust implementation of Google's TurboQuant algorithm (Zandieh et al., ICLR 2026) for extreme KV-cache compression in LLM inference.

CI docs.rs crates.io

What is TurboQuant?

TurboQuant compresses the key-value cache of large language models to 3-4 bits per value with zero accuracy loss. It is training-free and data-oblivious -- no calibration data required. This reduces KV-cache memory by up to 5x while maintaining full model quality.

The algorithm combines two stages:

  • PolarQuant (Stage 1): Random rotation (Walsh-Hadamard Transform) + optimal scalar quantization via pre-computed Lloyd-Max codebooks
  • QJL (Stage 2): 1-bit bias correction via Quantized Johnson-Lindenstrauss projection, ensuring unbiased inner product estimates for attention computation

Project Status

Metric Value
Quality Score 100.0% (rustqual)
Tests 327
Functions ~343
Dependencies half + thiserror (2 total)

Quick Start

use turboquant::packed::{TurboQuantConfig, PackedBlock};
use turboquant::quantize::{quantize_vec, dequantize_vec};

// Configure for 3-bit quantization, head dimension 128
let config = TurboQuantConfig::new(3, 128).unwrap().with_seed(42);

// Quantize a key vector
let packed: PackedBlock = quantize_vec(&config, &key_vec).unwrap();

// Dequantize back
let recovered: Vec<f32> = dequantize_vec(&config, &packed).unwrap();

See examples/ for runnable demos.

API Overview

Type Module Description
TurboQuantConfig packed Configuration: bit-width (2/3/4), head dimension, seed
PackedBlock packed Bit-packed quantized block (unified format for all bit-widths)
QuantizedKVCache attention High-level cache with attention score computation
QjlBlock qjl QJL-enhanced quantized block for unbiased inner products
EstimationContext qjl Pre-fetched context for efficient batch attention scoring

Compression & Accuracy

Variant Bits/Value Compression vs FP16 Normalized MSE Paper Target
TQ3 (3-bit) 3 4.9x ~0.034 0.034
TQ4 (4-bit) 4 ~3.5x ~0.009 0.009

MSE measured over 10,000 random vectors at d=128, matching paper values exactly.

Performance (d=128, release build)

Operation Latency
PolarQuant quantize ~1.1 us
PolarQuant dequantize ~0.8 us
QJL quantize ~19 us
QJL inner product (batch, per key) ~1.1 us
Attention over 1024 keys ~1.1 ms
Estimated 100k context / 32 layers ~3.5 s

mistral.rs Integration

turboquant integrates transparently into mistral.rs as a KV-cache quantization backend. All models are supported.

# Run any model with TurboQuant TQ3 KV-cache compression
mistralrs run --pa-cache-type tq3 -m Qwen/Qwen3-0.6B
mistralrs run --pa-cache-type tq4 -m mistralai/Mistral-7B-Instruct-v0.3

Integration Benchmarks (CPU-only, Qwen3-0.6B, 512 prompt + 128 decode)

Normal TQ3 Overhead
Prefill 129.7 tok/s 130.0 tok/s 0% overhead
Decode 10.5 tok/s 8.4 tok/s ~20% (amortized, includes one-time flush)
KV-Cache Memory 1x ~4.9x compression

The decode overhead is amortized over 128 tokens and includes a one-time lazy quantization flush. A future GPU kernel implementation (Approach B) would eliminate this overhead entirely. See Approach B Roadmap.

Optimizations

The following optimizations were implemented to achieve near-zero overhead:

  • Delta dequantization: Avoids O(N^2) redundant work by only dequantizing newly added heads
  • Pre-allocated GPU tensor buffer: Uses slice_set/narrow for O(1) per-step tensor updates instead of creating new tensors
  • Lazy quantization: Defers quantization from prefill to first decode step, keeping prefill at full speed
  • Parallel head processing: Uses rayon for multi-threaded quantization/dequantization across attention heads
  • Batch quantize: Shares codebook and sign_pattern setup across heads in a batch
  • Zero-copy tensor data extraction: Extracts tensor data without unnecessary allocations
  • Reusable Vec buffers: Pre-allocated buffers reused across decode steps to avoid repeated allocation

Improvements over llama.cpp TurboQuant (tq3_0)

This implementation differs from the llama.cpp tq3_0 branch in several important ways:

1. QJL Bias Correction (mandatory, not omitted)

llama.cpp tq3_0 implements only PolarQuant (Stage 1) and omits QJL entirely. Without QJL, inner product estimates carry a systematic multiplicative bias of 2/pi that accumulates across all keys in the softmax during attention. This bias is not visible in short-context benchmarks but degrades quality at long contexts (8k+ tokens), which is the primary use case for KV-cache compression.

Our implementation includes the full TURBOQUANTprod algorithm (Algorithm 2 from the paper) with QJL bias correction, guaranteeing E[<y,x>_est] = <y,x> (mathematically unbiased).

2. Dimension-Specific Codebooks (exact Beta distribution)

Aspect llama.cpp tq3_0 turboquant (this crate)
Distribution Gaussian N(0,1) approximation Exact Beta distribution per dimension
Codebooks Single fixed set for all dimensions Pre-computed per (bits, dim) pair
3-bit centroids (d=128) [-2.15, -1.34, -0.76, -0.25, +0.25, +0.76, +1.34, +2.15] [-0.189, -0.118, -0.067, -0.022, +0.022, +0.067, +0.118, +0.189]
Relationship Centroids for normalized coordinates llama_centroid ~= ours * sqrt(d)

After rotation, the coordinates of a d-dimensional unit vector follow a Beta-type distribution on [-1, 1], not a Gaussian. The Gaussian is the limiting distribution as d approaches infinity. For practical dimensions (d=64-256), the Beta distribution is a better fit. Our codebooks are optimal Lloyd-Max quantizers for the exact distribution, yielding slightly lower MSE.

3. Flexible Block Sizes

llama.cpp uses a fixed block size of 32 values. Our implementation supports variable dimensions (64, 128, 256) matching common LLM head dimensions directly, avoiding padding waste.

4. Hash-Based Rademacher (no crypto RNG needed)

The QJL projection matrix uses deterministic hash-based sign generation instead of requiring a cryptographic RNG, making it fast and reproducible across platforms.

5. Bit-Packing Compatible with llama.cpp

The 3-bit packing layout is identical to llama.cpp tq3_0 (8 indices into 3 bytes, same byte order), ensuring potential interoperability at the data level.

Installation

[dependencies]
turboquant = { git = "https://github.com/nicosql/turboquant.git" }

Building with Native CPU Optimizations

The crate is configured to use native CPU features (AVX2, FMA, etc.) automatically via .cargo/config.toml. For maximum performance:

cargo build --release

Running Examples

cargo run --example basic_quantize
cargo run --example kv_cache_demo

References

License

Licensed under either of Apache License, Version 2.0 or MIT license, at your option.