turbo-quant 0.2.2

Experimental vector compression sidecars with PolarQuant, TurboQuant, QJL sketches, wire formats, and benchmark receipts
Documentation

turbo-quant

Experimental vector compression sidecars for embedding search.

turbo-quant implements three compression sidecars (PolarQuant, TurboQuant, and QJL sketches) and the surrounding infrastructure needed to use them in a real retrieval system: bit-packed wire formats, candidate generation, exact rerank, KV-cache shadow mode, and a complete benchmark harness that validates quality against a raw-vector reference.

Status: experimental / research substrate. See the "Scope and limits" section below for what this crate is and is not safe to claim.

What's in the box

  • PolarQuant — angular quantization of vectors onto a uniform angle grid. Asymmetric: scoring in compressed space only (no decode). Source code: src/polar.rs.
  • TurboQuant — adds a QJL residual sketch on top of PolarQuant to recover accuracy. Symmetric: the residual sketch lets you approximate the inner product, and exact rerank uses the raw vector. Source: src/turbo.rs.
  • QJL sketches — randomized sign-based Johnson-Lindenstrauss projections for cheap approximate inner product estimation. Source: src/qjl.rs.
  • Bit-packed wire formatsPackedPolarCode, PackedQjlSketch, PackedTurboCode with a fixed storage_layout: "polar_radii_f32_angles_bitpacked_qjl_signs_bitpacked". Source: src/packed.rs, src/wire.rs.
  • KV-cache shadow modeKvRuntimeConfig, KvShadowToken. Lets you score a compressed KV cache against a raw baseline and emit a KvShadowReceipt. Experimental, not for production. Source: src/kv.rs.
  • Codec profiles — typed CodecProfileV1 that captures the codec kind, dim, bits, projections, rotation, and a profile_digest (FNV-1a 64-bit) for receipt comparison. Source: src/profile.rs.
  • Benchmark harnesstools/semantic_memory_harness/ validates the sidecar against semantic_memory::search::cosine_similarity as the raw-vector reference, and emits a SemanticMemoryHarnessSummaryV1 receipt.

Quick Start

use turbo_quant::{CodecProfile, TurboSidecarCode, TurboSidecarIndex};
use nalgebra::DVector;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Build a codec profile.
    let profile = CodecProfile::turbo_quant_8bit(32)
        .with_projections(16)
        .with_seed(42);

    // Encode a corpus.
    let corpus: Vec<DVector<f32>> = /* your vectors */;
    let code = TurboSidecarCode::encode(&profile, &corpus)?;
    let index = TurboSidecarIndex::build(&profile, code)?;

    // Search — get candidates in compressed space, then rerank on raw.
    let query = DVector::from_vec(/* query vector */);
    let candidates = index.candidates(&profile, &query, 40, 10)?;  // oversample × top_k
    let reranked = index.exact_rerank(&candidates, &corpus, &query, 10)?;

    Ok(())
}

The examples/ directory has runnable versions of this and three other flows: bench_embeddings.rs, kv_shadow.rs, profile_receipt.rs, and compat_0_1_smoke.rs (the P26 release-gate smoke test).

Benchmarks — measured

These are real numbers from the P26 release-evidence (docs/release-evidence/0.2.0/semantic_memory_harness_receipt.json). They are run on a synthetic deterministic corpus (synthetic-deterministic-unit-vectors-v1, 128 vectors × 32 dim, 12 queries, k=10, oversample=4, TurboQuant 8-bit, 16 QJL projections, fast-Hadamard rotation, seed=42):

Metric Value Notes
Recall@1 (after exact rerank) 0.917 11/12 queries had the raw top-1 in the reranked top-1
Recall@5 (after exact rerank) 0.983 11.8/12 queries
Recall@10 (after exact rerank) 0.992 11.9/12 queries
Exact-rerank recovery rate 0.917 top-1 in compressed candidates → top-1 in raw
Top-k overlap (compressed top-10 vs raw top-10) 0.567 Exploratory metric — see note
Rank drift (mean) 0.083 Position changes in top-10
Rank drift (p95) 1.0
Rank drift (max) 1
Score error (mean) 0.479 Absolute score difference, compressed vs raw
Score error (p95) 1.085
Score error (max) 1.203

Storage (128 vectors × 32 dim):

Layout Bytes Ratio to raw
Raw f32 (reference) 16,384 1.0×
f16 baseline 8,192 0.5×
TurboQuant sidecar (no fallback) 10,240 0.625×
TurboQuant sidecar + exact fallback 26,624 1.625×

The sidecar is 0.625× the raw size when the raw fallback is not kept. With the fallback, total storage is 1.625× raw — this is the production configuration where you keep the compressed sidecar and the raw vector for the rerank step. The "exact-rerank recovery rate" is the gate: at 0.917 (11/12 queries exact top-1) the harness passed.

P31 retrieval-benchmark numbers (semantic-memory, May 2026):

Run on a different harness with 1,000 vectors × 384 dim, 50 queries, k=10, candidate multiplier 20:

Metric Value
Candidate scoring p50 138.0 ms
Candidate scoring p95 148.1 ms
Exact rerank p95 0.087 ms
Mean abs score error 0.0024
P95 abs score error 0.0061
NDCG@10 1.0
Mean rank drift 0.0
Recall@10 1.0

P32 retrieval-benchmark numbers (semantic-memory, May 2026, smoke):

Same harness, 1,000 × 384, 50 queries, with the optimized candidate-then-exact flow:

Metric Value
Candidate scoring p50 109.1 ms
Candidate scoring p95 111.3 ms
Exact rerank p95 0.046 ms
Mean abs score error 0.0024
P95 abs score error 0.0061
NDCG@10 1.0
Recall@10 1.0
Fallback rate 0.0

Both P31 and P32 receipts classify as green. The full receipts are at semantic-memory/docs/codex-runs/archive/.../turboquant-*-benchmark-summary.json.

To reproduce: cd turbo-quant && cargo run --release --example bench_embeddings.

Scope and limits

This crate is experimental. The following claims are explicitly forbidden in documentation, rustdoc, README, and release notes unless scoped to a specific external paper claim or local receipt evidence:

  • "zero accuracy loss"
  • "zero overhead"
  • "production KV cache runtime"
  • "drop-in replacement"
  • "better than semantic-memory"
  • "proven deployment quality"
  • "no dataset-specific calibration needed"

What's allowed:

  • "experimental codec substrate"
  • "derived sidecar (not canonical vectors)"
  • "approximate scoring; exact fallback or rerank required"
  • "workload-specific benchmark receipts required"
  • "semantic-memory reference harness validates retrieval drift locally"

The full release-claim law is at turbo-quant/AGENTS.md (P26 patch).

What's verified

  • cargo test --all-targets --all-features --locked passes (29 tests, 11.3s in CI).
  • cargo check --all-targets --all-features --locked clean.
  • cargo fmt --all -- --check clean.
  • python3 scripts/assert_p26_invariants.py . passes — all required P26 release artifacts are present and content-addressed.
  • cargo package succeeds.
  • The SemanticMemoryHarnessSummaryV1 is emitted and SHA-256 recorded in docs/release-evidence/0.2.0/release_receipt.json.

Test coverage

  • 18 integration test files in tests/:
    • api_compat.rs, bitpack.rs, determinism.rs, encoded_size.rs, inner_product.rs, invalid_inputs.rs, kv_policy.rs, malformed_artifacts.rs, packed_index.rs, profile_receipt.rs, query_workspace.rs, readiness.rs, rotation_policy.rs, serialization.rs, wire_format.rs, workspace.rs — plus 2 more.
  • 4 examples: bench_embeddings, compat_0_1_smoke, kv_shadow, profile_receipt.
  • 1 criterion bench: benches/turbo_quant_search.rs.

MSRV

Rust 1.75 (2021 edition). Stable features only.

Dependencies

  • serde, nalgebra (with serde-serialize).
  • bitvec (transitive, for BitPack).
  • Workspace Cargo.toml pin.

Zero platform-specific code, zero FFI, zero unsafe (unsafe_code is denied at the workspace level).

License

MIT OR Apache-2.0 (dual-licensed). See LICENSE-MIT and LICENSE-APACHE for the full texts.

Changelog

See CHANGELOG.md for the release history. The v0.2.0 release notes are in RELEASE_NOTES.md and the receipts in docs/release-evidence/0.2.0/.

Where it's used

turbo-quant is the experimental vector compression sidecar for:

  • semantic-memory — every projection import with AdmissibilityClass::Standard or below can route through the sidecar (gated by quant-governor).
  • scr-runtime-compression — the cross-runtime compression scheduler can use TurboSidecarCode for batched candidate generation.
  • The KV-cache shadow mode is used by the tools/semantic_memory_harness/ to validate the sidecar against a raw-vector reference.

Adopting turbo-quant directly is appropriate for systems that need a vector compression sidecar with a documented, receipted benchmark harness, and that can run their own workload-specific benchmarks to confirm the sidecar is appropriate.