turbo-quant

Rust implementation of TurboQuant, PolarQuant, and QJL -- zero-overhead vector quantization for semantic search and KV cache compression.

Compress high-dimensional vectors (embeddings, KV cache entries) to 3-8 bits per value with zero accuracy loss and no dataset-specific calibration. Based on the algorithms from Google Research published at ICLR 2026, AISTATS 2026, and AAAI 2025.

Why turbo-quant?
Key Properties
Architecture
Getting Started
- Requirements
- Installation
Usage
Choosing Parameters
How It Works
API Reference
Performance Characteristics
Testing
Project Structure
References
License

Why turbo-quant?

Traditional vector quantization methods like Product Quantization (PQ) and Optimized Product Quantization (OPQ) require an expensive offline training phase over a representative dataset before a single vector can be indexed. This means:

You can't index vectors as they arrive in a streaming setting
Quantizer quality degrades when the data distribution shifts
Training on sensitive data may violate privacy constraints

turbo-quant eliminates all of these problems. The quantizer is entirely defined by four integers (dim, bits, projections, seed) -- no codebooks, no centroids, no calibration data. Vectors can be compressed the instant they arrive, making it ideal for real-time indexing, federated search, and privacy-sensitive deployments.

Key Properties

Property	Description
Data-oblivious	No k-means training, no codebook, no calibration set. The rotation is seeded once and works on any distribution.
Deterministic	Identical `(dim, bits, projections, seed)` always produces the same quantizer. State can be fully reconstructed from four integers.
Zero accuracy loss	At 3+ bits, inner product estimates are provably unbiased and achieve near-optimal distortion (within ~2.7x of the Shannon limit).
Instant indexing	Unlike Product Quantization, there is no offline training phase. Vectors can be indexed as they arrive.
Compact state	The entire quantizer is defined by `(dim, bits, projections, seed)`. No model files to ship or version.

Architecture

turbo-quant is organized as a layered system where each algorithm builds on the previous:

+------------------------------------------------------------------+
|                        KvCacheCompressor                         |
|  Online KV cache for transformer attention (compress_token,      |
|  attention_scores, attend)                                       |
+------------------------------------------------------------------+
                              |
                              v
+------------------------------------------------------------------+
|                        TurboQuantizer                            |
|  Two-stage compression: PolarQuant (b-1 bits) + QJL (1 bit)     |
|  Provably unbiased inner product estimation                      |
+------------------------------------------------------------------+
                      /                  \
                     v                    v
+-------------------------------+  +-------------------------------+
|       PolarQuantizer          |  |        QjlQuantizer           |
|  Polar coordinate encoding    |  |  1-bit JL sketching of the   |
|  of rotated vectors           |  |  reconstruction residual      |
+-------------------------------+  +-------------------------------+
                     \                    /
                      v                  v
+------------------------------------------------------------------+
|                       StoredRotation                             |
|  Haar-distributed random orthogonal matrix via QR decomposition  |
|  Deterministic from (dim, seed) using ChaCha8 RNG               |
+------------------------------------------------------------------+

Getting Started

Requirements

Rust 1.75.0 or later
Cargo (included with Rust)

Installation

Add to your Cargo.toml:

[dependencies]
turbo-quant = { git = "https://github.com/RecursiveIntell/turbo-quant" }

Or clone and build from source:

git clone https://github.com/RecursiveIntell/turbo-quant.git
cd turbo-quant
cargo build --release

Usage

TurboQuant (Two-Stage Compression)

The flagship algorithm. Combines PolarQuant with a QJL residual sketch for near-optimal distortion at any bit width.

use turbo_quant::TurboQuantizer;

// Create a quantizer for 1536-dimensional embeddings at 8 bits.
let q = TurboQuantizer::new(
    1536,   // dimension (must be even)
    8,      // bits per value (1-16)
    384,    // QJL projections (dim/4 recommended for search)
    42,     // seed (any u64)
).unwrap();

// Compress a database vector.
let embedding: Vec<f32> = vec![0.1; 1536];
let code = q.encode(&embedding).unwrap();

// At query time: estimate inner product without decompressing.
let query: Vec<f32> = vec![0.05; 1536];
let score = q.inner_product_estimate(&code, &query).unwrap();

// Estimate L2 distance.
let dist = q.l2_distance_estimate(&code, &query).unwrap();

// Approximate reconstruction (lossy).
let reconstructed = q.decode_approximate(&code).unwrap();

// Compression metrics.
println!("Encoded size: {} bytes", code.encoded_bytes());
println!("Compression ratio: {:.2}x", code.compression_ratio());

// Batch statistics over multiple codes.
let codes = vec![code];
let stats = q.batch_stats(&codes);
println!("Average compression: {:.2}x", stats.compression_ratio);

PolarQuant (Single-Stage Compression)

A simpler, single-stage compressor. Lower overhead than TurboQuant but slightly higher distortion at low bit widths.

use turbo_quant::PolarQuantizer;

let pq = PolarQuantizer::new(1536, 8, 42).unwrap();

let vector: Vec<f32> = vec![0.1; 1536];
let code = pq.encode(&vector).unwrap();

// Lossless decode (radii stored as f32, only angles are quantized).
let decoded = pq.decode(&code).unwrap();

// Inner product and L2 distance estimation.
let query: Vec<f32> = vec![0.05; 1536];
let ip = pq.inner_product_estimate(&code, &query).unwrap();
let l2 = pq.l2_distance_estimate(&code, &query).unwrap();

QJL (1-bit Sketching)

The Quantized Johnson-Lindenstrauss transform. Produces a 1-bit-per-projection sketch for unbiased inner product estimation. Used internally by TurboQuant for residual correction, but also available standalone.

use turbo_quant::QjlQuantizer;

let qjl = QjlQuantizer::new(1536, 384, 42).unwrap();

let vector: Vec<f32> = vec![0.1; 1536];
let sketch = qjl.sketch(&vector).unwrap();

// Estimate inner product using sketch + raw query.
let query: Vec<f32> = vec![0.05; 1536];
let ip = qjl.inner_product_estimate(&sketch, &query).unwrap();

println!("Sketch size: {} bytes", sketch.encoded_bytes());

KV Cache Compression

Online KV cache compressor for transformer attention layers. Compresses key-value pairs as tokens are generated and computes attention scores directly on compressed data.

use turbo_quant::kv::{KvCacheCompressor, KvCacheConfig};

let config = KvCacheConfig {
    head_dim: 128,       // per-attention-head dimension
    bits: 6,             // 4-6 bits recommended for KV cache
    projections: 16,     // QJL projections for residual correction
    seed: 42,
};

let mut cache = KvCacheCompressor::new(config).unwrap();

// Compress tokens as they are generated.
for step in 0..sequence_length {
    let key = compute_key(step);     // Vec<f32> of length head_dim
    let value = compute_value(step); // Vec<f32> of length head_dim
    cache.compress_token(&key, &value).unwrap();
}

// Compute attention logits against all compressed keys.
let query = compute_query();
let scores = cache.attention_scores(&query).unwrap();

// Full softmax-weighted attention output.
let output = cache.attend(&query).unwrap();

// Decode specific token values (e.g., top-k).
let top_k_indices = vec![0, 5, 12];
let values = cache.decode_values(&top_k_indices).unwrap();

// Compression metrics.
println!("Tokens: {}", cache.len());
println!("Compressed bytes: {}", cache.compressed_bytes());
println!("Compression ratio: {:.2}x", cache.compression_ratio());

Serialization and Persistence

All compressed codes support JSON serialization via serde. Quantizers themselves don't need to be serialized -- they can be reconstructed from their parameters.

use turbo_quant::{TurboQuantizer, TurboCode};

let q = TurboQuantizer::new(64, 8, 16, 42).unwrap();
let code = q.encode(&vec![0.1; 64]).unwrap();

// Serialize to JSON.
let json = serde_json::to_string(&code).unwrap();

// Deserialize back.
let restored: TurboCode = serde_json::from_str(&json).unwrap();

// Quantizer reconstruction: just save the 4 parameters.
let q2 = TurboQuantizer::new(64, 8, 16, 42).unwrap();
// q2 is identical to q -- same rotation, same behavior.

Choosing Parameters

Use Case	Bits	Projections	Notes
Semantic search (recall@10)	8	dim / 4	Best accuracy; suitable for production retrieval
KV cache compression	4-6	dim / 8	Balances memory savings with attention quality
Maximum compression	3	dim / 16	Aggressive; still provably unbiased
Lightweight similarity	-	dim / 4	Use QJL standalone for 1-bit sketches

Guidelines:

Bits controls the PolarQuant fidelity. Higher bits = less distortion but larger codes. The QJL stage always uses 1 bit per projection.
Projections controls the QJL residual correction precision. More projections = better bias correction but larger residual sketches. Diminishing returns beyond dim/4.
Seed must be the same for encoding and querying. Different seeds produce incompatible quantizers.
Dimension must be even (required for polar pair encoding).

How It Works

Stage 1: Random Rotation (Whitening)

Real-world embeddings have non-uniform energy distributions across coordinates. A random orthogonal rotation (Haar-distributed, generated via QR decomposition of a Gaussian matrix) spreads the energy uniformly, making simple scalar quantization near-optimal.

x_rotated = R * x     where R is a d x d orthogonal matrix

The rotation is deterministic: the same (dim, seed) always produces the same R via ChaCha8 RNG.

Stage 2: PolarQuant (Polar Coordinate Encoding)

After rotation, the d-dimensional vector is grouped into d/2 coordinate pairs. Each pair is converted to polar form:

(a, b) -> (r, theta)     where r = sqrt(a^2 + b^2), theta = atan2(b, a)

Radii are stored losslessly as f32 values
Angles are uniformly quantized to b bits on [-pi, pi], yielding 2^b quantization levels

This preserves directional information with controllable precision while retaining exact magnitude information.

Stage 3: QJL Residual Correction

TurboQuant computes the reconstruction residual (original minus PolarQuant approximation) and sketches it using the Quantized Johnson-Lindenstrauss transform:

residual = x - decode(polar_encode(x))
sketch   = [sign(g_1 . residual), ..., sign(g_m . residual)]

where g_1, ..., g_m are random Gaussian projection vectors. Each sketch entry is a single bit (sign).

Inner Product Estimation

The combined estimator is provably unbiased:

<x, y> ~ IP_polar(code, y) + IP_qjl(residual_sketch, y)

IP_polar: Computed directly from quantized angles and stored radii against the rotated query
IP_qjl: (pi / 2m) * sum_i sign(g_i . residual) * (g_i . query) -- an unbiased estimator by the JL lemma

At 3+ bits this achieves distortion within ~2.7x of the Shannon rate-distortion limit.

API Reference

Core Types

Type	Description
`TurboQuantizer`	Two-stage quantizer (PolarQuant + QJL). The primary entry point.
`TurboCode`	Compressed vector output from TurboQuantizer.
`PolarQuantizer`	Single-stage polar coordinate quantizer.
`PolarCode`	Compressed vector output from PolarQuantizer.
`QjlQuantizer`	Standalone 1-bit JL sketch quantizer.
`QjlSketch`	1-bit sketch output from QjlQuantizer.
`KvCacheCompressor`	Online KV cache for transformer attention.
`KvCacheConfig`	Configuration struct for KvCacheCompressor.
`CompressedToken`	A single compressed key-value pair in the cache.
`StoredRotation`	A d x d orthogonal rotation matrix.
`BatchStats`	Aggregate compression statistics.

Constructors

TurboQuantizer::new(dim: usize, bits: u8, projections: usize, seed: u64) -> Result<Self>
PolarQuantizer::new(dim: usize, bits: u8, seed: u64) -> Result<Self>
QjlQuantizer::new(dim: usize, projections: usize, seed: u64) -> Result<Self>
KvCacheCompressor::new(config: KvCacheConfig) -> Result<Self>

Error Handling

All fallible operations return Result<T, TurboQuantError>. Error variants:

Variant	Cause
`DimensionMismatch`	Input vector length does not match quantizer dimension
`OddDimension`	Dimension must be even for polar pair encoding
`ZeroDimension`	Dimension must be non-zero
`InvalidBitWidth`	Bits must be in range 1-16
`ZeroProjectionCount`	Projections must be non-zero
`RotationFailed`	Internal error during rotation matrix generation
`MalformedCode`	Compressed code failed validation

Performance Characteristics

Time Complexity

Operation	Complexity	Notes
`new()`	O(d^2)	One-time cost: QR decomposition of d x d matrix
`encode()`	O(d * m)	d = dimension, m = projections
`inner_product_estimate()`	O(d * m)	No decompression needed
`decode()`	O(d)	PolarQuant only; TurboQuant decode is approximate
`attention_scores()`	O(n * d * m)	n = number of cached tokens

Space Complexity

Component	Size
Quantizer state	O(d^2) for the rotation matrix (~9 MB at d=1536)
PolarCode	d/2 * 4 bytes (radii) + d/2 * 2 bytes (angles) = 3d bytes
QjlSketch	m bytes (one sign byte per projection)
TurboCode	3d + m bytes total

Compression Ratios

Dimension	Bits	Projections	Uncompressed	Compressed	Ratio
1536	8	384	6144 B	4992 B	1.23x
1536	4	192	6144 B	4800 B	1.28x
128	8	32	512 B	416 B	1.23x
128	4	16	512 B	400 B	1.28x

The primary value is not raw byte savings but the ability to compute inner products directly on compressed data, avoiding full decompression during search.

Testing

Run the full test suite:

cargo test

Run specific test categories:

cargo test --lib                     # Unit tests in each module
cargo test --test determinism        # Reproducibility and ordering tests
cargo test --test inner_product      # Accuracy and recall tests

Test Coverage

The test suite validates:

Determinism: Same seed produces identical codes across quantizer instances
Seed independence: Different seeds produce different codes
Encoding order independence: Encoding order does not affect individual codes
Rank preservation: Top-k ordering is preserved at 8 bits
Error scaling: Distortion decreases monotonically with bit width
Accuracy bounds: Relative error < 5% at 16 bits; top-10 recall >= 60% at 8 bits
Mathematical properties: Self inner product > 0, antipodal vectors yield negative estimates, L2 self-distance near zero
KV cache correctness: Attention score ordering, softmax normalization, dimension validation
Serialization roundtrips: JSON encode/decode for all code types
Edge cases: Zero vectors, unit vectors, empty caches, dimension mismatches

Project Structure

turbo-quant/
├── Cargo.toml                 # Package manifest and dependencies
├── README.md                  # This file
├── src/
│   ├── lib.rs                 # Library root, module exports, and top-level docs
│   ├── error.rs               # Error types (TurboQuantError) and Result alias
│   ├── rotation.rs            # StoredRotation: Haar-random orthogonal matrices via QR
│   ├── polar.rs               # PolarQuantizer: single-stage polar coordinate encoding
│   ├── qjl.rs                 # QjlQuantizer: 1-bit Johnson-Lindenstrauss sketching
│   ├── turbo.rs               # TurboQuantizer: two-stage PolarQuant + QJL compression
│   └── kv.rs                  # KvCacheCompressor: online KV cache for transformers
└── tests/
    ├── determinism.rs          # Reproducibility, seed, and ordering tests
    └── inner_product.rs        # Accuracy, recall, and mathematical property tests

Module Dependency Graph

lib.rs
 ├── error.rs          (standalone)
 ├── rotation.rs       (standalone, uses nalgebra + rand_chacha)
 ├── polar.rs          (depends on rotation, error)
 ├── qjl.rs            (depends on error, uses rand_chacha)
 ├── turbo.rs          (depends on polar, qjl, error)
 └── kv.rs             (depends on turbo, error)

References

TurboQuant: Zandieh et al., "TurboQuant: Redefining AI Efficiency with Extreme Compression", ICLR 2026.
PolarQuant: Zandieh et al., AISTATS 2026.
QJL (Quantized Johnson-Lindenstrauss): Zandieh et al., AAAI 2025.

License

This project is licensed under the MIT License.

turbo-quant 0.1.0