TurboQuant

TurboQuant is a Rust library for research-grade vector quantization of LLM KV caches. It now includes three benchmark/evaluation paths:

synthetic: model-shaped random vectors
trace: exported per-head safetensors traces
real-model: true end-to-end decoder inference on lightweight ONNX models with iterative past-key-value reuse

Current status as of 2026-03-25: alpha. Suitable for local research, benchmarking, and integration experiments. Not yet a production inference backend.

Installation

[dependencies]
turboquant = "0.1.1"

Rust 1.87.0+ is required.

Feature flags:

default: scalar CPU path plus runtime-dispatched SIMD
gpu: experimental Burn/WGPU batch kernels

Core APIs

Type	Purpose	Notes
`TurboQuantMSE`	Reconstruction-oriented vector quantization	Unit-norm input contract
`TurboQuantProd`	Inner-product-oriented vector quantization	Requires `bit_width >= 2`
`BatchQuantizedMSE` / `BatchQuantizedProd`	Packed batch storage	Validate layout after deserialization
`QuantizedKVCache` / `MultiHeadKVCache`	Quantized KV cache helpers	Keys and values can be reconstructed
`KvTrace`	Trace loader for exported per-head workloads	Rejects invalid query positions
`RealModelRunner`	End-to-end ONNX decoder runner via `ort` / ONNX Runtime	CPU-oriented real-model path

Real-Model Support

The repository now has a true decoder loop for lightweight open-source models:

load a tokenizer and ONNX decoder bundle
run prompt prefill
run iterative decoding with explicit past_key_values
compare exact cache reuse vs quantized cache reuse in the actual decode loop

The real-model execution backend is ONNX Runtime on CPU via the Rust ort binding. Burn remains in the repository for optional WGPU batch quantization kernels, but it is not the primary path for full decoder inference.

Supported Lightweight Models

Verified end-to-end on the Rust real-model path today:

distilgpt2
HuggingFaceTB/SmolLM2-135M-Instruct

The export helper also includes additional presets for experimentation, but only the verified models above should be treated as supported.

Other decoder-only models can work if their exported ONNX bundle exposes:

input_ids
optional attention_mask, position_ids, cache_position, use_cache_branch
past_key_values.<layer>.{key,value}
present.<layer>.{key,value}
logits

Important Honesty Note

The quantized real-model path quantizes the cache in the real decode loop, then reconstructs float tensors before feeding them back into ONNX Runtime for the next step. That means:

KV storage metrics reflect the quantized cache representation
generation quality reflects quantized-cache reuse
ONNX Runtime still performs standard float attention math internally

This is true end-to-end model execution with quantized cache feedback, but it is not a custom quantized attention kernel inside the ONNX runtime.

ONNX Export Workflow

Pinned Python dependencies for the real-model scripts live in scripts/requirements-real-model.txt.

The Rust real-model path also pulls ONNX Runtime CPU binaries through the ort crate on first build.

Example setup:

python3 -m venv .venv
. .venv/bin/activate
pip install -r scripts/requirements-real-model.txt

Export a documented lightweight preset:

python3 scripts/export_hf_decoder_onnx.py \
  --preset distilgpt2 \
  --output-dir artifacts/distilgpt2-onnx

Or export the verified SmolLM2 preset:

python3 scripts/export_hf_decoder_onnx.py \
  --preset smollm2-135m-instruct \
  --output-dir artifacts/smollm2-135m-instruct-onnx

Or export an explicit model id:

python3 scripts/export_hf_decoder_onnx.py \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --output-dir artifacts/tinyllama-onnx

The export helper targets text-generation-with-past and defaults to fp32, which is the current verified dtype on the CPU ONNX Runtime path.

Benchmark CLI

Synthetic quick run:

cargo run --release --example benchmark -- --workload synthetic --quick

Trace run:

cargo run --release --example benchmark -- \
  --workload trace \
  --trace traces/example.safetensors

Real-model exact run:

cargo run --release --example benchmark -- \
  --workload real-model \
  --real-model-dir artifacts/distilgpt2-onnx \
  --prompt "Summarize the role of a KV cache in one sentence." \
  --real-eval-mode exact \
  --max-new-tokens 16

Real-model quantized run:

cargo run --release --example benchmark -- \
  --workload real-model \
  --real-model-dir artifacts/distilgpt2-onnx \
  --prompt "Summarize the role of a KV cache in one sentence." \
  --real-eval-mode quantized \
  --bits 4 \
  --real-key-strategy prod \
  --max-new-tokens 16

Real-model side-by-side comparison:

cargo run --release --example benchmark -- \
  --workload real-model \
  --real-model-dir artifacts/distilgpt2-onnx \
  --prompt "Summarize the role of a KV cache in one sentence." \
  --real-eval-mode compare \
  --bits 4 \
  --value-bits 4 \
  --real-key-strategy prod \
  --top-k 5 \
  --max-new-tokens 16

The CLI reports source explicitly as synthetic, trace, or real-model to avoid confusing model-shaped workloads with true decoder runs.

One-Command Real-Model Eval

For a fuller end-to-end workflow, use the orchestration helper:

python3 scripts/run_real_model_eval.py \
  --preset distilgpt2 \
  --bits 2 4 8 \
  --strategies prod mse

What it does:

exports or reuses a real ONNX decoder bundle
builds the Rust benchmark example once
runs an exact baseline
runs multiple exact-vs-quantized compare benchmarks
writes raw JSON plus a markdown summary report under artifacts/real-model-evals/

Useful options:

python3 scripts/run_real_model_eval.py \
  --model-dir artifacts/distilgpt2-onnx \
  --prompts scripts/prompts/real_model_eval_prompts.jsonl \
  --max-prompts 6 \
  --max-new-tokens 24 \
  --top-k 5 \
  --bits 4 8 \
  --strategies prod

The default prompt suite lives at scripts/prompts/real_model_eval_prompts.jsonl.

Real-Model Metrics

real-model mode can report:

next-token logit RMSE
top-k agreement
token match rate and divergence rate
reference-token cross-entropy / perplexity
latency
tokens/sec
exact vs quantized KV memory usage

For compare mode, cross-entropy is computed against the exact run's generated token at each shared step. This is a distribution-drift metric, not a dataset perplexity claim.

Trace Workflow

The existing trace exporter is still available when you want per-head analysis rather than full-model decode:

python3 scripts/export_hf_kv.py \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --input prompts.txt \
  --output traces/mistral_layer0_head0.safetensors \
  --layer 0 \
  --head 0

Then run:

cargo run --release --example benchmark -- \
  --workload trace \
  --trace traces/mistral_layer0_head0.safetensors

Validation Commands

cargo fmt -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all-features
cargo check --examples --all-features
cargo llvm-cov --workspace --all-features --summary-only
cargo audit

CI-Safe vs Manual Tests

CI-safe:

cargo test --all-features Includes a tiny local ONNX fixture that exercises the ONNX Runtime/tokenizer/KV-cache path without downloading external weights.

Manual heavier smoke test:

TURBOQUANT_REAL_MODEL_DIR=artifacts/real-model-bundles/distilgpt2 \
  cargo test --all-features manual_exported_real_model_smoke_test -- --ignored --nocapture

Limitations

The real-model backend is CPU-oriented and currently uses ONNX Runtime, not Burn.
Quantized real-model evaluation reconstructs float past tensors before the next ONNX Runtime step.
The WGPU/Burn path remains experimental and is still primarily a batch-kernel benchmark surface.
The crate still assumes unit-norm vectors for the core quantizer APIs; the real-model path handles raw KV tensors by storing norms separately.
The verified real-model surface is currently limited to distilgpt2 and HuggingFaceTB/SmolLM2-135M-Instruct.
This repository does not provide production serving, observability, or deployment tooling.

Contributing

See CONTRIBUTING.md, ARCHITECTURE.md, and AGENTS.md.

turboquant 0.1.1