Expand description
TurboQuant – KV-Cache quantization with zero accuracy loss.
Implements Google’s TurboQuant algorithm (Zandieh et al., ICLR 2026) for compressing LLM key-value caches to 3-4 bits per value.
Re-exports§
pub use attention::PackedImport;pub use attention::QuantizedKVCache;pub use error::Result;pub use error::TurboQuantError;pub use packed::PackedBlock;pub use packed::TurboQuantConfig;pub use qjl::compute_qjl_signs;pub use qjl::estimate_inner_product;pub use qjl::estimate_inner_product_single;pub use qjl::estimate_inner_product_with_codebook;pub use qjl::precompute_query_projections;pub use qjl::quantize_with_qjl;pub use qjl::quantize_with_qjl_resources;pub use qjl::EstimationContext;pub use qjl::QjlBatchResources;pub use qjl::QjlBlock;pub use quantize::dequantize_into_with_codebook;pub use quantize::dequantize_vec;pub use quantize::dequantize_vec_with_codebook;pub use quantize::quantize_vec;pub use quantize::quantize_vec_with_codebook;pub use rotation::RotationOrder;
Modules§
- attention
- High-level quantized KV-cache attention API.
- codebook
- Codebook lookup, Beta PDF, and pre-computed tables for Lloyd-Max quantization.
- error
- math
- Mathematical helper functions used across TurboQuant modules.
- packed
- Packed data structures for quantized blocks.
- qjl
- QJL (Quantized Johnson-Lindenstrauss) bias correction.
- quantize
- Core quantize / dequantize pipeline.
- rotation
- Walsh-Hadamard Transform and random rotation.