turboquant 0.1.1

# Architecture

## Overview

TurboQuant is a library, not a long-running service. The repository has three evaluation surfaces:

- synthetic vector workloads
- exported per-head trace workloads
- real end-to-end ONNX decoder workloads via `tract`

The core quantization path is still:

```text
finite vector
  -> normalize to unit norm
  -> random rotation / scalar codebook quantization      (TurboQuantMSE)
  -> optional residual sketch                            (TurboQuantProd / QJL)
  -> bit-packed batch or KV-cache storage                (batch, kv_cache, bitpack)
  -> score / reconstruct on scalar, SIMD, or WGPU paths (backend, gpu)
```

The real-model path layers on top of that by normalizing each raw K/V vector, storing its norm separately, quantizing the unit vector, and reconstructing float `past_key_values` tensors for the next tract decode step.

## Modules

- `turboquant_mse`: stage-1 quantizer for reconstruction quality.
- `turboquant_prod`: stage-2 composition for inner-product quality.
- `qjl`: residual sketch and estimators.
- `batch`: packed batch containers and high-throughput scoring helpers.
- `kv_cache`: stateful quantized cache wrappers for attention-style workloads.
- `real_model`: tract/tokenizer-based end-to-end ONNX decoder runner.
- `backend`: scalar/SIMD math kernels.
- `gpu`: optional Burn/WGPU batch acceleration.
- `trace`: safetensors trace ingestion for exported benchmark inputs.
- `codebook`, `scalar_quant`, `rotation`, `bitpack`, `utils`: supporting primitives.

## Real-Model Runtime

The real-model runtime expects an exported decoder-only ONNX bundle containing:

- `model.onnx` or `decoder_model_merged.onnx` or `decoder_model.onnx`
- `tokenizer.json`
- `config.json`

Common optional inputs supported by the runner:

- `attention_mask`
- `position_ids`
- `cache_position`
- `use_cache_branch`

Required decoder-cache I/O pattern:

- `past_key_values.<layer>.key`
- `past_key_values.<layer>.value`
- `present.<layer>.key`
- `present.<layer>.value`
- `logits`

## Key Design Decisions

- Finite, unit-norm vectors remain the crate-level contract for the public quantizer APIs.
- The real-model path does not weaken that contract. Raw model K/V tensors are normalized per vector and their norms are stored separately.
- `tract` is the real decoder inference backend because it currently gives the cleanest Rust path for ONNX causal-LM execution with explicit `past_key_values`.
- Burn/WGPU remains optional and experimental for batch quantization kernels. It is not the primary backend for full decoder inference.
- Quantized end-to-end evaluation feeds reconstructed float caches back into the decoder. This measures real decode-loop behavior under cache quantization without pretending the ONNX runtime itself performs quantized attention.

## Invariants

- Dimensions must match the quantizer or cache configuration.
- Quantizer inputs must be finite. Most public quantizer and cache APIs also require unit-norm vectors.
- Packed batch payloads must have the expected row count, bit width, and packed byte length.
- Trace query positions must be non-negative and within the available token prefix.
- Real-model ONNX bundles must expose logits plus per-layer past/present key/value tensors.
- Real-model cache tensors are expected to be float32 and decoder-only.

## Failure Model

- Invalid user input returns `TurboQuantError`; the library should not silently coerce malformed data.
- Unsupported or malformed ONNX bundles fail with typed model/runtime errors.
- The tiny checked-in ONNX fixture is only a plumbing test. It is not a language-model quality benchmark.
- There is no service deployment/runtime surface in this repository.

## Operational Runbook

For a release candidate:

1. Run `cargo fmt -- --check`.
2. Run `cargo clippy --all-targets --all-features -- -D warnings`.
3. Run `cargo test --all-features`.
4. Run `cargo check --examples --all-features`.
5. Run `cargo llvm-cov --workspace --all-features --summary-only`.
6. Run `cargo audit`.
7. Run the synthetic quick benchmark.
8. Run the real-model quick path on at least one exported lightweight bundle manually.

## Non-Goals

- Production inference serving.
- Fleet observability, tracing, health checks, or deployment manifests.
- Claiming that the current tract path is a custom quantized attention runtime.