turboquant-rs

Rust port of turboquant_plus — KV cache compression for LLM inference via PolarQuant + QJL.

Background

This project implements the algorithms described in the TurboQuant paper (ICLR 2026, arXiv 2504.19874) and the related PolarQuant paper (AISTATS 2026, arXiv 2502.02617).

The original Python reference implementation by TheTom demonstrated that KV cache tensors in transformer models can be aggressively compressed with minimal quality loss, achieving 3.8–6.4× compression ratios. This crate is a faithful Rust port of that work.

Key ideas

V compression is nearly free — value cache compression down to 2 bits shows negligible quality impact when key precision is maintained.
K precision dominates — all observed quality degradation stems from key cache compression.
Boundary layers matter — protecting the first and last two layers at higher precision recovers 37–91% of the quality gap.

Compression formats

Format	Bits/value	Compression	PPL vs q8_0
turbo4	4.25	3.8×	+0.23%
turbo3	3.50	4.6×	+1.06%
turbo2	2.50	6.4×	+6.48%

Architecture

The two-stage TurboQuant algorithm (Algorithm 2 from the paper):

PolarQuant (b−1 bits) — random rotation (Haar-distributed or fast Walsh-Hadamard) followed by optimal scalar quantization using Lloyd's algorithm on the post-rotation Gaussian distribution.
QJL (1 bit) — Quantized Johnson-Lindenstrauss sign projection on the residual to eliminate inner-product bias.

For V cache, only the MSE-optimal PolarQuant stage is used (Algorithm 1), since value reconstruction prioritises MSE over inner-product preservation.

Modules

Module	Description
`rotation`	Haar-distributed rotation via QR, fast Walsh-Hadamard transform
`codebook`	Optimal MSE centroids — closed-form (1–2 bit) and Lloyd's algorithm (3+ bit)
`polar_quant`	PolarQuant quantizer with norm extraction and correction
`qjl`	QJL 1-bit sign quantizer preserving inner products
`turboquant`	Combined TurboQuant and MSE-only variant
`kv_cache`	KV cache compressor — TurboQuant for K, PolarQuant-MSE for V
`outlier`	Non-integer bit rates via outlier/normal channel splitting
`utils`	Bit packing, index packing, memory footprint calculation

Usage

use turboquant::{TurboQuant, KVCacheCompressor};
use ndarray::Array1;

// Quantize a single vector at 3 bits
let tq = TurboQuant::new(128, 3, 42, true);
let x = Array1::from_shape_fn(128, |i| (i as f64) / 128.0);
let compressed = tq.quantize(&x);
let x_hat = tq.dequantize(&compressed);

// KV cache memory savings
let compressor = KVCacheCompressor::new(128, 3, 3, 42, true);
let stats = compressor.memory_stats(1024, 32, 32);
println!("Compression ratio: {:.1}×", stats.compression_ratio);

Building and testing

cargo build
cargo test

References

TurboQuant: Ankur Moitra, Mark Sellke, TurboQuant: Online Vector Quantization for Efficient KV Cache Compression, ICLR 2026. arXiv:2504.19874
PolarQuant: Ankur Moitra, Mark Sellke, PolarQuant: Provably Good Quantization via Polar Coordinates, AISTATS 2026. arXiv:2502.02617
Reference implementation: TheTom/turboquant_plus (Python)

License

MIT

turboquant-plus-rs 0.1.0