turbo-quant

Experimental vector compression sidecars for embedding search.

turbo-quant implements three compression sidecars (PolarQuant, TurboQuant, and QJL sketches) and the surrounding infrastructure needed to use them in a real retrieval system: bit-packed wire formats, candidate generation, exact rerank, KV-cache shadow mode, and a complete benchmark harness that validates quality against a raw-vector reference.

Status: experimental / research substrate. See the "Scope and limits" section below for what this crate is and is not safe to claim.

What's in the box

PolarQuant — angular quantization of vectors onto a uniform angle grid. Asymmetric: scoring in compressed space only (no decode). Source code: src/polar.rs.
TurboQuant — adds a QJL residual sketch on top of PolarQuant to recover accuracy. Symmetric: the residual sketch lets you approximate the inner product, and exact rerank uses the raw vector. Source: src/turbo.rs.
QJL sketches — randomized sign-based Johnson-Lindenstrauss projections for cheap approximate inner product estimation. Source: src/qjl.rs.
Bit-packed wire formats — PackedPolarCode, PackedQjlSketch, PackedTurboCode with a fixed storage_layout: "polar_radii_f32_angles_bitpacked_qjl_signs_bitpacked". Source: src/packed.rs, src/wire.rs.
KV-cache shadow mode — KvRuntimeConfig, KvShadowToken. Lets you score a compressed KV cache against a raw baseline and emit a KvShadowReceipt. Experimental, not for production. Source: src/kv.rs.
Codec profiles — typed CodecProfileV1 that captures the codec kind, dim, bits, projections, rotation, and a profile_digest (FNV-1a 64-bit) for receipt comparison. Source: src/profile.rs.
Benchmark harness — tools/semantic_memory_harness/ validates the sidecar against semantic_memory::search::cosine_similarity as the raw-vector reference, and emits a SemanticMemoryHarnessSummaryV1 receipt.

Quick Start

use turbo_quant::{CodecProfile, TurboSidecarCode, TurboSidecarIndex};
use nalgebra::DVector;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Build a codec profile.
    let profile = CodecProfile::turbo_quant_8bit(32)
        .with_projections(16)
        .with_seed(42);

    // Encode a corpus.
    let corpus: Vec<DVector<f32>> = /* your vectors */;
    let code = TurboSidecarCode::encode(&profile, &corpus)?;
    let index = TurboSidecarIndex::build(&profile, code)?;

    // Search — get candidates in compressed space, then rerank on raw.
    let query = DVector::from_vec(/* query vector */);
    let candidates = index.candidates(&profile, &query, 40, 10)?;  // oversample × top_k
    let reranked = index.exact_rerank(&candidates, &corpus, &query, 10)?;

    Ok(())
}

The examples/ directory has runnable versions of this and four other flows: bench_embeddings.rs, kv_shadow.rs, profile_receipt.rs, compat_0_1_smoke.rs (the P26 release-gate smoke test), and real_bench.rs (real-embedding benchmark with semantic-memory harness).

Benchmarks — measured

These are real numbers from the P26 release-evidence (docs/release-evidence/0.2.0/semantic_memory_harness_receipt.json). They are run on a synthetic deterministic corpus (synthetic-deterministic-unit-vectors-v1, 128 vectors × 32 dim, 12 queries, k=10, oversample=4, TurboQuant 8-bit, 16 QJL projections, fast-Hadamard rotation, seed=42):

Metric	Value	Notes
Recall@1 (after exact rerank)	0.917	11/12 queries had the raw top-1 in the reranked top-1
Recall@5 (after exact rerank)	0.983	11.8/12 queries
Recall@10 (after exact rerank)	0.992	11.9/12 queries
Exact-rerank recovery rate	0.917	top-1 in compressed candidates → top-1 in raw
Top-k overlap (compressed top-10 vs raw top-10)	0.567	Exploratory metric — see note
Rank drift (mean)	0.083	Position changes in top-10
Rank drift (p95)	1.0
Rank drift (max)	1
Score error (mean)	0.479	Absolute score difference, compressed vs raw
Score error (p95)	1.085
Score error (max)	1.203

Storage (128 vectors × 32 dim):

Layout	Bytes	Ratio to raw
Raw f32 (reference)	16,384	1.0×
f16 baseline	8,192	0.5×
TurboQuant sidecar (no fallback)	10,240	0.625×
TurboQuant sidecar + exact fallback	26,624	1.625×

The sidecar is 0.625× the raw size when the raw fallback is not kept. With the fallback, total storage is 1.625× raw — this is the production configuration where you keep the compressed sidecar and the raw vector for the rerank step. The "exact-rerank recovery rate" is the gate: at 0.917 (11/12 queries exact top-1) the harness passed.

P31 retrieval-benchmark numbers (semantic-memory, May 2026):

Run on a different harness with 1,000 vectors × 384 dim, 50 queries, k=10, candidate multiplier 20:

Metric	Value
Candidate scoring p50	138.0 ms
Candidate scoring p95	148.1 ms
Exact rerank p95	0.087 ms
Mean abs score error	0.0024
P95 abs score error	0.0061
NDCG@10	1.0
Mean rank drift	0.0
Recall@10	1.0

P32 retrieval-benchmark numbers (semantic-memory, May 2026, smoke):

Same harness, 1,000 × 384, 50 queries, with the optimized candidate-then-exact flow:

Metric	Value
Candidate scoring p50	109.1 ms
Candidate scoring p95	111.3 ms
Exact rerank p95	0.046 ms
Mean abs score error	0.0024
P95 abs score error	0.0061
NDCG@10	1.0
Recall@10	1.0
Fallback rate	0.0

Both P31 and P32 receipts classify as green. The full receipts are at semantic-memory/docs/codex-runs/archive/.../turboquant-*-benchmark-summary.json.

To reproduce: cd turbo-quant && cargo run --release --example bench_embeddings.

C Kernels

Starting in v0.2.3, the hot paths in turbo-quant are backed by C kernels compiled with -O3 -mavx2 -mfma via the cc crate. The original Rust implementations are preserved in src/archive/ with headers documenting the replacement.

Kernel	C file	Purpose	Speedup vs Rust
FWHT	`c-kernels/fwht.c`	Fast Walsh-Hadamard Transform	2.75×
Polar encode/decode	`c-kernels/polar.c`	`atan2` + angle quantize, dequantize	1.8×
QJL sketch/project/IP	`c-kernels/qjl.c`	Sign projection, query projection, inner-product estimate	1.5×
Bitpack	`c-kernels/bitpack.c`	Bit packing/unpacking	reverted to Rust (FFI overhead)

The bitpack kernel was reverted because branch-heavy bit manipulation has high FFI call overhead that negates the compiler advantage. The FWHT and scoring kernels benefit from GCC's auto-vectorization on AVX2 targets.

All C kernels are compiled at build time via build.rs. No pre-built binaries. The cc build dependency is required.

Scope and limits

This crate is experimental. The following claims are explicitly forbidden in documentation, rustdoc, README, and release notes unless scoped to a specific external paper claim or local receipt evidence:

"zero accuracy loss"
"zero overhead"
"production KV cache runtime"
"drop-in replacement"
"better than semantic-memory"
"proven deployment quality"
"no dataset-specific calibration needed"

What's allowed:

"experimental codec substrate"
"derived sidecar (not canonical vectors)"
"approximate scoring; exact fallback or rerank required"
"workload-specific benchmark receipts required"
"semantic-memory reference harness validates retrieval drift locally"

The full release-claim law is at turbo-quant/AGENTS.md (P26 patch).

What's verified

cargo test --all-targets --all-features --locked passes (123 tests).
cargo check --all-targets --all-features --locked clean.
cargo fmt --all -- --check clean.
python3 scripts/assert_p26_invariants.py . passes — all required P26 release artifacts are present and content-addressed.
cargo package succeeds.
The SemanticMemoryHarnessSummaryV1 is emitted and SHA-256 recorded in docs/release-evidence/0.2.0/release_receipt.json.

Test coverage

18 integration test files in tests/:
- api_compat.rs, bitpack.rs, determinism.rs, encoded_size.rs, inner_product.rs, invalid_inputs.rs, kv_policy.rs, malformed_artifacts.rs, packed_index.rs, profile_receipt.rs, query_workspace.rs, readiness.rs, rotation_policy.rs, serialization.rs, wire_format.rs, workspace.rs — plus 2 more.
4 examples: bench_embeddings, compat_0_1_smoke, kv_shadow, profile_receipt.
1 criterion bench: benches/turbo_quant_search.rs.

MSRV

Rust 1.75 (2021 edition). Stable features only.

A C compiler (GCC or Clang with AVX2/FMA support) is required for the C kernel build step via the cc crate.

Dependencies

serde, nalgebra (with serde-serialize).
bitvec (transitive, for BitPack).
cc (build dependency, for C kernel compilation).
Workspace Cargo.toml pin.

C kernels (FWHT, polar, QJL) are compiled at build time via build.rs with -O3 -mavx2 -mfma. The unsafe keyword is used in the FFI boundary (extern "C" calls); the unsafe_code lint is allowed at the crate level for these specific modules.

License

MIT OR Apache-2.0 (dual-licensed). See LICENSE-MIT and LICENSE-APACHE for the full texts.

Changelog

See CHANGELOG.md for the release history. The v0.2.0 release notes are in RELEASE_NOTES.md and the receipts in docs/release-evidence/0.2.0/.

Where it's used

turbo-quant is the experimental vector compression sidecar for:

semantic-memory — every projection import with AdmissibilityClass::Standard or below can route through the sidecar (gated by quant-governor).
scr-runtime-compression — the cross-runtime compression scheduler can use TurboSidecarCode for batched candidate generation.
The KV-cache shadow mode is used by the tools/semantic_memory_harness/ to validate the sidecar against a raw-vector reference.

Adopting turbo-quant directly is appropriate for systems that need a vector compression sidecar with a documented, receipted benchmark harness, and that can run their own workload-specific benchmarks to confirm the sidecar is appropriate.

turbo-quant 0.2.3