TurboQuant
TurboQuant is a Rust library for research-grade vector quantization of LLM KV caches. It now includes three benchmark/evaluation paths:
synthetic: model-shaped random vectorstrace: exported per-head safetensors tracesreal-model: true end-to-end decoder inference on lightweight ONNX models with iterative past-key-value reuse
Current status as of 2026-03-25: alpha. Suitable for local research, benchmarking, and integration experiments. Not yet a production inference backend.
Installation
[]
= "0.1.1"
Rust 1.87.0+ is required.
Feature flags:
- default: scalar CPU path plus runtime-dispatched SIMD
gpu: experimental Burn/WGPU batch kernels
Core APIs
| Type | Purpose | Notes |
|---|---|---|
TurboQuantMSE |
Reconstruction-oriented vector quantization | Unit-norm input contract |
TurboQuantProd |
Inner-product-oriented vector quantization | Requires bit_width >= 2 |
BatchQuantizedMSE / BatchQuantizedProd |
Packed batch storage | Validate layout after deserialization |
QuantizedKVCache / MultiHeadKVCache |
Quantized KV cache helpers | Keys and values can be reconstructed |
KvTrace |
Trace loader for exported per-head workloads | Rejects invalid query positions |
RealModelRunner |
End-to-end ONNX decoder runner via ort / ONNX Runtime |
CPU-oriented real-model path |
Real-Model Support
The repository now has a true decoder loop for lightweight open-source models:
- load a tokenizer and ONNX decoder bundle
- run prompt prefill
- run iterative decoding with explicit
past_key_values - compare exact cache reuse vs quantized cache reuse in the actual decode loop
The real-model execution backend is ONNX Runtime on CPU via the Rust ort binding. Burn remains in the repository for optional WGPU batch quantization kernels, but it is not the primary path for full decoder inference.
Supported Lightweight Models
Verified end-to-end on the Rust real-model path today:
distilgpt2HuggingFaceTB/SmolLM2-135M-Instruct
The export helper also includes additional presets for experimentation, but only the verified models above should be treated as supported.
Other decoder-only models can work if their exported ONNX bundle exposes:
input_ids- optional
attention_mask,position_ids,cache_position,use_cache_branch past_key_values.<layer>.{key,value}present.<layer>.{key,value}logits
Important Honesty Note
The quantized real-model path quantizes the cache in the real decode loop, then reconstructs float tensors before feeding them back into ONNX Runtime for the next step. That means:
- KV storage metrics reflect the quantized cache representation
- generation quality reflects quantized-cache reuse
- ONNX Runtime still performs standard float attention math internally
This is true end-to-end model execution with quantized cache feedback, but it is not a custom quantized attention kernel inside the ONNX runtime.
ONNX Export Workflow
Pinned Python dependencies for the real-model scripts live in scripts/requirements-real-model.txt.
The Rust real-model path also pulls ONNX Runtime CPU binaries through the ort crate on first build.
Example setup:
Export a documented lightweight preset:
Or export the verified SmolLM2 preset:
Or export an explicit model id:
The export helper targets text-generation-with-past and defaults to fp32, which is the current verified dtype on the CPU ONNX Runtime path.
Benchmark CLI
Synthetic quick run:
Trace run:
Real-model exact run:
Real-model quantized run:
Real-model side-by-side comparison:
The CLI reports source explicitly as synthetic, trace, or real-model to avoid confusing model-shaped workloads with true decoder runs.
One-Command Real-Model Eval
For a fuller end-to-end workflow, use the orchestration helper:
What it does:
- exports or reuses a real ONNX decoder bundle
- builds the Rust benchmark example once
- runs an exact baseline
- runs multiple exact-vs-quantized compare benchmarks
- writes raw JSON plus a markdown summary report under
artifacts/real-model-evals/
Useful options:
The default prompt suite lives at scripts/prompts/real_model_eval_prompts.jsonl.
Real-Model Metrics
real-model mode can report:
- next-token logit RMSE
- top-k agreement
- token match rate and divergence rate
- reference-token cross-entropy / perplexity
- latency
- tokens/sec
- exact vs quantized KV memory usage
For compare mode, cross-entropy is computed against the exact run's generated token at each shared step. This is a distribution-drift metric, not a dataset perplexity claim.
Trace Workflow
The existing trace exporter is still available when you want per-head analysis rather than full-model decode:
Then run:
Validation Commands
CI-Safe vs Manual Tests
CI-safe:
cargo test --all-featuresIncludes a tiny local ONNX fixture that exercises the ONNX Runtime/tokenizer/KV-cache path without downloading external weights.
Manual heavier smoke test:
TURBOQUANT_REAL_MODEL_DIR=artifacts/real-model-bundles/distilgpt2 \
Limitations
- The real-model backend is CPU-oriented and currently uses ONNX Runtime, not Burn.
- Quantized real-model evaluation reconstructs float past tensors before the next ONNX Runtime step.
- The WGPU/Burn path remains experimental and is still primarily a batch-kernel benchmark surface.
- The crate still assumes unit-norm vectors for the core quantizer APIs; the real-model path handles raw KV tensors by storing norms separately.
- The verified real-model surface is currently limited to
distilgpt2andHuggingFaceTB/SmolLM2-135M-Instruct. - This repository does not provide production serving, observability, or deployment tooling.
Contributing
See CONTRIBUTING.md, ARCHITECTURE.md, and AGENTS.md.