# Architecture
## Overview
TurboQuant is a library, not a long-running service. The repository has three evaluation surfaces:
- synthetic vector workloads
- exported per-head trace workloads
- real end-to-end ONNX decoder workloads via `tract`
The core quantization path is still:
```text
finite vector
-> normalize to unit norm
-> random rotation / scalar codebook quantization (TurboQuantMSE)
-> optional residual sketch (TurboQuantProd / QJL)
-> bit-packed batch or KV-cache storage (batch, kv_cache, bitpack)
-> score / reconstruct on scalar, SIMD, or WGPU paths (backend, gpu)
```
The real-model path layers on top of that by normalizing each raw K/V vector, storing its norm separately, quantizing the unit vector, and reconstructing float `past_key_values` tensors for the next tract decode step.
## Modules
- `turboquant_mse`: stage-1 quantizer for reconstruction quality.
- `turboquant_prod`: stage-2 composition for inner-product quality.
- `qjl`: residual sketch and estimators.
- `batch`: packed batch containers and high-throughput scoring helpers.
- `kv_cache`: stateful quantized cache wrappers for attention-style workloads.
- `real_model`: tract/tokenizer-based end-to-end ONNX decoder runner.
- `backend`: scalar/SIMD math kernels.
- `gpu`: optional Burn/WGPU batch acceleration.
- `trace`: safetensors trace ingestion for exported benchmark inputs.
- `codebook`, `scalar_quant`, `rotation`, `bitpack`, `utils`: supporting primitives.
## Real-Model Runtime
The real-model runtime expects an exported decoder-only ONNX bundle containing:
- `model.onnx` or `decoder_model_merged.onnx` or `decoder_model.onnx`
- `tokenizer.json`
- `config.json`
Common optional inputs supported by the runner:
- `attention_mask`
- `position_ids`
- `cache_position`
- `use_cache_branch`
Required decoder-cache I/O pattern:
- `past_key_values.<layer>.key`
- `past_key_values.<layer>.value`
- `present.<layer>.key`
- `present.<layer>.value`
- `logits`
## Key Design Decisions
- Finite, unit-norm vectors remain the crate-level contract for the public quantizer APIs.
- The real-model path does not weaken that contract. Raw model K/V tensors are normalized per vector and their norms are stored separately.
- `tract` is the real decoder inference backend because it currently gives the cleanest Rust path for ONNX causal-LM execution with explicit `past_key_values`.
- Burn/WGPU remains optional and experimental for batch quantization kernels. It is not the primary backend for full decoder inference.
- Quantized end-to-end evaluation feeds reconstructed float caches back into the decoder. This measures real decode-loop behavior under cache quantization without pretending the ONNX runtime itself performs quantized attention.
## Invariants
- Dimensions must match the quantizer or cache configuration.
- Quantizer inputs must be finite. Most public quantizer and cache APIs also require unit-norm vectors.
- Packed batch payloads must have the expected row count, bit width, and packed byte length.
- Trace query positions must be non-negative and within the available token prefix.
- Real-model ONNX bundles must expose logits plus per-layer past/present key/value tensors.
- Real-model cache tensors are expected to be float32 and decoder-only.
## Failure Model
- Invalid user input returns `TurboQuantError`; the library should not silently coerce malformed data.
- Unsupported or malformed ONNX bundles fail with typed model/runtime errors.
- The tiny checked-in ONNX fixture is only a plumbing test. It is not a language-model quality benchmark.
- There is no service deployment/runtime surface in this repository.
## Operational Runbook
For a release candidate:
1. Run `cargo fmt -- --check`.
2. Run `cargo clippy --all-targets --all-features -- -D warnings`.
3. Run `cargo test --all-features`.
4. Run `cargo check --examples --all-features`.
5. Run `cargo llvm-cov --workspace --all-features --summary-only`.
6. Run `cargo audit`.
7. Run the synthetic quick benchmark.
8. Run the real-model quick path on at least one exported lightweight bundle manually.
## Non-Goals
- Production inference serving.
- Fleet observability, tracing, health checks, or deployment manifests.
- Claiming that the current tract path is a custom quantized attention runtime.