turboquant 0.1.1

# TurboQuant

TurboQuant is a Rust library for research-grade vector quantization of LLM KV caches. It now includes three benchmark/evaluation paths:

- `synthetic`: model-shaped random vectors
- `trace`: exported per-head safetensors traces
- `real-model`: true end-to-end decoder inference on lightweight ONNX models with iterative past-key-value reuse

Current status as of 2026-03-25: alpha. Suitable for local research, benchmarking, and integration experiments. Not yet a production inference backend.

## Installation

```toml
[dependencies]
turboquant = "0.1.1"
```

Rust `1.87.0+` is required.

Feature flags:

- default: scalar CPU path plus runtime-dispatched SIMD
- `gpu`: experimental Burn/WGPU batch kernels

## Core APIs

| Type | Purpose | Notes |
|------|---------|-------|
| `TurboQuantMSE` | Reconstruction-oriented vector quantization | Unit-norm input contract |
| `TurboQuantProd` | Inner-product-oriented vector quantization | Requires `bit_width >= 2` |
| `BatchQuantizedMSE` / `BatchQuantizedProd` | Packed batch storage | Validate layout after deserialization |
| `QuantizedKVCache` / `MultiHeadKVCache` | Quantized KV cache helpers | Keys and values can be reconstructed |
| `KvTrace` | Trace loader for exported per-head workloads | Rejects invalid query positions |
| `RealModelRunner` | End-to-end ONNX decoder runner via `ort` / ONNX Runtime | CPU-oriented real-model path |

## Real-Model Support

The repository now has a true decoder loop for lightweight open-source models:

- load a tokenizer and ONNX decoder bundle
- run prompt prefill
- run iterative decoding with explicit `past_key_values`
- compare exact cache reuse vs quantized cache reuse in the actual decode loop

The real-model execution backend is ONNX Runtime on CPU via the Rust `ort` binding. Burn remains in the repository for optional WGPU batch quantization kernels, but it is not the primary path for full decoder inference.

### Supported Lightweight Models

Verified end-to-end on the Rust real-model path today:

- `distilgpt2`
- `HuggingFaceTB/SmolLM2-135M-Instruct`

The export helper also includes additional presets for experimentation, but only the verified models above should be treated as supported.

Other decoder-only models can work if their exported ONNX bundle exposes:

- `input_ids`
- optional `attention_mask`, `position_ids`, `cache_position`, `use_cache_branch`
- `past_key_values.<layer>.{key,value}`
- `present.<layer>.{key,value}`
- `logits`

### Important Honesty Note

The quantized real-model path quantizes the cache in the real decode loop, then reconstructs float tensors before feeding them back into ONNX Runtime for the next step. That means:

- KV storage metrics reflect the quantized cache representation
- generation quality reflects quantized-cache reuse
- ONNX Runtime still performs standard float attention math internally

This is true end-to-end model execution with quantized cache feedback, but it is not a custom quantized attention kernel inside the ONNX runtime.

## ONNX Export Workflow

Pinned Python dependencies for the real-model scripts live in [`scripts/requirements-real-model.txt`](scripts/requirements-real-model.txt).

The Rust real-model path also pulls ONNX Runtime CPU binaries through the `ort` crate on first build.

Example setup:

```bash
python3 -m venv .venv
. .venv/bin/activate
pip install -r scripts/requirements-real-model.txt
```

Export a documented lightweight preset:

```bash
python3 scripts/export_hf_decoder_onnx.py \
  --preset distilgpt2 \
  --output-dir artifacts/distilgpt2-onnx
```

Or export the verified SmolLM2 preset:

```bash
python3 scripts/export_hf_decoder_onnx.py \
  --preset smollm2-135m-instruct \
  --output-dir artifacts/smollm2-135m-instruct-onnx
```

Or export an explicit model id:

```bash
python3 scripts/export_hf_decoder_onnx.py \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --output-dir artifacts/tinyllama-onnx
```

The export helper targets `text-generation-with-past` and defaults to `fp32`, which is the current verified dtype on the CPU ONNX Runtime path.

## Benchmark CLI

Synthetic quick run:

```bash
cargo run --release --example benchmark -- --workload synthetic --quick
```

Trace run:

```bash
cargo run --release --example benchmark -- \
  --workload trace \
  --trace traces/example.safetensors
```

Real-model exact run:

```bash
cargo run --release --example benchmark -- \
  --workload real-model \
  --real-model-dir artifacts/distilgpt2-onnx \
  --prompt "Summarize the role of a KV cache in one sentence." \
  --real-eval-mode exact \
  --max-new-tokens 16
```

Real-model quantized run:

```bash
cargo run --release --example benchmark -- \
  --workload real-model \
  --real-model-dir artifacts/distilgpt2-onnx \
  --prompt "Summarize the role of a KV cache in one sentence." \
  --real-eval-mode quantized \
  --bits 4 \
  --real-key-strategy prod \
  --max-new-tokens 16
```

Real-model side-by-side comparison:

```bash
cargo run --release --example benchmark -- \
  --workload real-model \
  --real-model-dir artifacts/distilgpt2-onnx \
  --prompt "Summarize the role of a KV cache in one sentence." \
  --real-eval-mode compare \
  --bits 4 \
  --value-bits 4 \
  --real-key-strategy prod \
  --top-k 5 \
  --max-new-tokens 16
```

The CLI reports `source` explicitly as `synthetic`, `trace`, or `real-model` to avoid confusing model-shaped workloads with true decoder runs.

## One-Command Real-Model Eval

For a fuller end-to-end workflow, use the orchestration helper:

```bash
python3 scripts/run_real_model_eval.py \
  --preset distilgpt2 \
  --bits 2 4 8 \
  --strategies prod mse
```

What it does:

- exports or reuses a real ONNX decoder bundle
- builds the Rust benchmark example once
- runs an exact baseline
- runs multiple exact-vs-quantized compare benchmarks
- writes raw JSON plus a markdown summary report under `artifacts/real-model-evals/`

Useful options:

```bash
python3 scripts/run_real_model_eval.py \
  --model-dir artifacts/distilgpt2-onnx \
  --prompts scripts/prompts/real_model_eval_prompts.jsonl \
  --max-prompts 6 \
  --max-new-tokens 24 \
  --top-k 5 \
  --bits 4 8 \
  --strategies prod
```

The default prompt suite lives at [`scripts/prompts/real_model_eval_prompts.jsonl`](scripts/prompts/real_model_eval_prompts.jsonl).

## Real-Model Metrics

`real-model` mode can report:

- next-token logit RMSE
- top-k agreement
- token match rate and divergence rate
- reference-token cross-entropy / perplexity
- latency
- tokens/sec
- exact vs quantized KV memory usage

For `compare` mode, cross-entropy is computed against the exact run's generated token at each shared step. This is a distribution-drift metric, not a dataset perplexity claim.

## Trace Workflow

The existing trace exporter is still available when you want per-head analysis rather than full-model decode:

```bash
python3 scripts/export_hf_kv.py \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --input prompts.txt \
  --output traces/mistral_layer0_head0.safetensors \
  --layer 0 \
  --head 0
```

Then run:

```bash
cargo run --release --example benchmark -- \
  --workload trace \
  --trace traces/mistral_layer0_head0.safetensors
```

## Validation Commands

```bash
cargo fmt -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all-features
cargo check --examples --all-features
cargo llvm-cov --workspace --all-features --summary-only
cargo audit
```

## CI-Safe vs Manual Tests

CI-safe:

- `cargo test --all-features`
  Includes a tiny local ONNX fixture that exercises the ONNX Runtime/tokenizer/KV-cache path without downloading external weights.

Manual heavier smoke test:

```bash
TURBOQUANT_REAL_MODEL_DIR=artifacts/real-model-bundles/distilgpt2 \
  cargo test --all-features manual_exported_real_model_smoke_test -- --ignored --nocapture
```

## Limitations

- The real-model backend is CPU-oriented and currently uses ONNX Runtime, not Burn.
- Quantized real-model evaluation reconstructs float past tensors before the next ONNX Runtime step.
- The WGPU/Burn path remains experimental and is still primarily a batch-kernel benchmark surface.
- The crate still assumes unit-norm vectors for the core quantizer APIs; the real-model path handles raw KV tensors by storing norms separately.
- The verified real-model surface is currently limited to `distilgpt2` and `HuggingFaceTB/SmolLM2-135M-Instruct`.
- This repository does not provide production serving, observability, or deployment tooling.

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md), [ARCHITECTURE.md](ARCHITECTURE.md), and [AGENTS.md](AGENTS.md).