# TurboQuant
TurboQuant is a Rust library for research-grade vector quantization of LLM KV caches. It now includes three benchmark/evaluation paths:
- `synthetic`: model-shaped random vectors
- `trace`: exported per-head safetensors traces
- `real-model`: true end-to-end decoder inference on lightweight ONNX models with iterative past-key-value reuse
Current status as of 2026-03-25: alpha. Suitable for local research, benchmarking, and integration experiments. Not yet a production inference backend.
## Installation
```toml
[dependencies]
turboquant = "0.1.1"
```
Rust `1.87.0+` is required.
Feature flags:
- default: scalar CPU path plus runtime-dispatched SIMD
- `gpu`: experimental Burn/WGPU batch kernels
## Core APIs
| `TurboQuantMSE` | Reconstruction-oriented vector quantization | Unit-norm input contract |
| `TurboQuantProd` | Inner-product-oriented vector quantization | Requires `bit_width >= 2` |
| `BatchQuantizedMSE` / `BatchQuantizedProd` | Packed batch storage | Validate layout after deserialization |
| `QuantizedKVCache` / `MultiHeadKVCache` | Quantized KV cache helpers | Keys and values can be reconstructed |
| `KvTrace` | Trace loader for exported per-head workloads | Rejects invalid query positions |
| `RealModelRunner` | End-to-end ONNX decoder runner via `ort` / ONNX Runtime | CPU-oriented real-model path |
## Real-Model Support
The repository now has a true decoder loop for lightweight open-source models:
- load a tokenizer and ONNX decoder bundle
- run prompt prefill
- run iterative decoding with explicit `past_key_values`
- compare exact cache reuse vs quantized cache reuse in the actual decode loop
The real-model execution backend is ONNX Runtime on CPU via the Rust `ort` binding. Burn remains in the repository for optional WGPU batch quantization kernels, but it is not the primary path for full decoder inference.
### Supported Lightweight Models
Verified end-to-end on the Rust real-model path today:
- `distilgpt2`
- `HuggingFaceTB/SmolLM2-135M-Instruct`
The export helper also includes additional presets for experimentation, but only the verified models above should be treated as supported.
Other decoder-only models can work if their exported ONNX bundle exposes:
- `input_ids`
- optional `attention_mask`, `position_ids`, `cache_position`, `use_cache_branch`
- `past_key_values.<layer>.{key,value}`
- `present.<layer>.{key,value}`
- `logits`
### Important Honesty Note
The quantized real-model path quantizes the cache in the real decode loop, then reconstructs float tensors before feeding them back into ONNX Runtime for the next step. That means:
- KV storage metrics reflect the quantized cache representation
- generation quality reflects quantized-cache reuse
- ONNX Runtime still performs standard float attention math internally
This is true end-to-end model execution with quantized cache feedback, but it is not a custom quantized attention kernel inside the ONNX runtime.
## ONNX Export Workflow
Pinned Python dependencies for the real-model scripts live in [`scripts/requirements-real-model.txt`](scripts/requirements-real-model.txt).
The Rust real-model path also pulls ONNX Runtime CPU binaries through the `ort` crate on first build.
Example setup:
```bash
python3 -m venv .venv
. .venv/bin/activate
pip install -r scripts/requirements-real-model.txt
```
Export a documented lightweight preset:
```bash
python3 scripts/export_hf_decoder_onnx.py \
--preset distilgpt2 \
--output-dir artifacts/distilgpt2-onnx
```
Or export the verified SmolLM2 preset:
```bash
python3 scripts/export_hf_decoder_onnx.py \
--preset smollm2-135m-instruct \
--output-dir artifacts/smollm2-135m-instruct-onnx
```
Or export an explicit model id:
```bash
python3 scripts/export_hf_decoder_onnx.py \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--output-dir artifacts/tinyllama-onnx
```
The export helper targets `text-generation-with-past` and defaults to `fp32`, which is the current verified dtype on the CPU ONNX Runtime path.
## Benchmark CLI
Synthetic quick run:
```bash
cargo run --release --example benchmark -- --workload synthetic --quick
```
Trace run:
```bash
cargo run --release --example benchmark -- \
--workload trace \
--trace traces/example.safetensors
```
Real-model exact run:
```bash
cargo run --release --example benchmark -- \
--workload real-model \
--real-model-dir artifacts/distilgpt2-onnx \
--prompt "Summarize the role of a KV cache in one sentence." \
--real-eval-mode exact \
--max-new-tokens 16
```
Real-model quantized run:
```bash
cargo run --release --example benchmark -- \
--workload real-model \
--real-model-dir artifacts/distilgpt2-onnx \
--prompt "Summarize the role of a KV cache in one sentence." \
--real-eval-mode quantized \
--bits 4 \
--real-key-strategy prod \
--max-new-tokens 16
```
Real-model side-by-side comparison:
```bash
cargo run --release --example benchmark -- \
--workload real-model \
--real-model-dir artifacts/distilgpt2-onnx \
--prompt "Summarize the role of a KV cache in one sentence." \
--real-eval-mode compare \
--bits 4 \
--value-bits 4 \
--real-key-strategy prod \
--top-k 5 \
--max-new-tokens 16
```
The CLI reports `source` explicitly as `synthetic`, `trace`, or `real-model` to avoid confusing model-shaped workloads with true decoder runs.
## One-Command Real-Model Eval
For a fuller end-to-end workflow, use the orchestration helper:
```bash
python3 scripts/run_real_model_eval.py \
--preset distilgpt2 \
--bits 2 4 8 \
--strategies prod mse
```
What it does:
- exports or reuses a real ONNX decoder bundle
- builds the Rust benchmark example once
- runs an exact baseline
- runs multiple exact-vs-quantized compare benchmarks
- writes raw JSON plus a markdown summary report under `artifacts/real-model-evals/`
Useful options:
```bash
python3 scripts/run_real_model_eval.py \
--model-dir artifacts/distilgpt2-onnx \
--prompts scripts/prompts/real_model_eval_prompts.jsonl \
--max-prompts 6 \
--max-new-tokens 24 \
--top-k 5 \
--bits 4 8 \
--strategies prod
```
The default prompt suite lives at [`scripts/prompts/real_model_eval_prompts.jsonl`](scripts/prompts/real_model_eval_prompts.jsonl).
## Real-Model Metrics
`real-model` mode can report:
- next-token logit RMSE
- top-k agreement
- token match rate and divergence rate
- reference-token cross-entropy / perplexity
- latency
- tokens/sec
- exact vs quantized KV memory usage
For `compare` mode, cross-entropy is computed against the exact run's generated token at each shared step. This is a distribution-drift metric, not a dataset perplexity claim.
## Trace Workflow
The existing trace exporter is still available when you want per-head analysis rather than full-model decode:
```bash
python3 scripts/export_hf_kv.py \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--input prompts.txt \
--output traces/mistral_layer0_head0.safetensors \
--layer 0 \
--head 0
```
Then run:
```bash
cargo run --release --example benchmark -- \
--workload trace \
--trace traces/mistral_layer0_head0.safetensors
```
## Validation Commands
```bash
cargo fmt -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all-features
cargo check --examples --all-features
cargo llvm-cov --workspace --all-features --summary-only
cargo audit
```
## CI-Safe vs Manual Tests
CI-safe:
- `cargo test --all-features`
Includes a tiny local ONNX fixture that exercises the ONNX Runtime/tokenizer/KV-cache path without downloading external weights.
Manual heavier smoke test:
```bash
TURBOQUANT_REAL_MODEL_DIR=artifacts/real-model-bundles/distilgpt2 \
cargo test --all-features manual_exported_real_model_smoke_test -- --ignored --nocapture
```
## Limitations
- The real-model backend is CPU-oriented and currently uses ONNX Runtime, not Burn.
- Quantized real-model evaluation reconstructs float past tensors before the next ONNX Runtime step.
- The WGPU/Burn path remains experimental and is still primarily a batch-kernel benchmark surface.
- The crate still assumes unit-norm vectors for the core quantizer APIs; the real-model path handles raw KV tensors by storing norms separately.
- The verified real-model surface is currently limited to `distilgpt2` and `HuggingFaceTB/SmolLM2-135M-Instruct`.
- This repository does not provide production serving, observability, or deployment tooling.
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md), [ARCHITECTURE.md](ARCHITECTURE.md), and [AGENTS.md](AGENTS.md).