- recall@k, done correctly — compares an index's results against the true top-k from an exact
iqdb-flatoracle (or against a known.ivecsground-truth set); never approximated - Latency percentiles — mean / min / max and nearest-rank p50 / p95 / p99 in microseconds, with build cost excluded by construction
- Throughput — single-thread queries-per-second over the measured query set
- Index-agnostic — one generic surface measures any backend behind the
Index/IndexCoretraits - Standard datasets — zero-dependency loaders for the TEXMEX SIFT family (
SIFT1M,GIST1M,siftsmall) in.fvecs/.ivecsformat - Reproducible — deterministic aggregation and a documented
VectorId::U64row-index convention, so numbers are comparable across runs
Installation
[]
= "1.0"
iqdb-eval takes its vocabulary — VectorId, SearchParams, DistanceMetric, IqdbError — from iqdb-types, the Index / IndexCore traits from iqdb-index, and the exact oracle from iqdb-flat. A typical consumer depends on all four:
[]
= "1.0"
= "1.0" # the exact oracle (and a fine first index under test)
= "1.0" # the Index / IndexCore traits
= "1.0" # VectorId, SearchParams, DistanceMetric, ...
MSRV is Rust 1.87 (edition 2024). The crate is std-only; the optional serde feature derives Serialize / Deserialize on the report types.
Quick Start
Build the index under test and an exact oracle from the same base set, then ask the harness for recall@k and latency:
use ;
use ;
use ;
The complete surface — every function, parameter, error, and more examples — is
in docs/API.md.
Measuring an approximate index
Swap the target for any backend behind the Index / IndexCore traits; the
oracle stays flat. recall@k now reports how much accuracy the approximate index
trades for its speed:
use ;
use ;
use ;
use ;
let metric = Euclidean;
let target: HnswIndex = build_index_from_base?;
let oracle: FlatIndex = build_index_from_base?;
let params = new;
let report = recall_at_k_vs_oracle?;
println!;
The one rule: build both indexes with
build_index_from_base(or insert each base row atVectorId::U64(row_index)by hand). That convention is what lets.ivecsground-truth ids line up with the idssearchreturns.
Standard datasets
The loaders read the TEXMEX corpus layout — a little-endian u32 dim header
followed by dim payload values per record — shared by SIFT1M, GIST1M, and
siftsmall. Point load_sift_dataset at a directory and a prefix; it resolves
{prefix}_base.fvecs, {prefix}_query.fvecs, and {prefix}_groundtruth.ivecs,
validates dimensions and lengths, and returns a SiftDataset:
use load_sift_dataset;
#
Datasets are read from local files; downloading and caching them is left to the
caller (so the crate pulls in no network dependency). read_fvecs and
read_ivecs are available directly for non-standard layouts.
Tiered API
- Tier 1 — the lazy path.
build_index_from_base+recall_at_k_vs_oraclelatencycover the whole common case in three calls.
- Tier 2 — the configured path. Precompute ground truth once with
compute_ground_truthand reuse it acrossrecall_at_k; tune the timing loop withLatencyConfig { warmup }; load standard corpora withload_sift_dataset/read_fvecs/read_ivecs. - Tier 3 — the trait seam. Everything is generic over
iqdb_index::IndexCore(andIndexfor construction), so any custom backend behind those traits is measurable with no extra wiring.
Performance
- The harness is thin. A measurement run's cost is dominated by the index
searchcalls it drives;iqdb-evaladds only anO(k)set-membership check per query for recall and a single sort for latency percentiles. - No allocation in the timing window.
latencyrecords into a pre-sized sample buffer; the index is borrowed, so build cost is never timed. - Recall sets are hashed once. Each query's true top-k is a
HashSet<u64>membership test against the retrieved hits — linear in the result size. - Nearest-rank percentiles. Every reported percentile is an observed sample
(
clamp(ceil(q·n) − 1, 0, n − 1)), never an interpolation.
Benchmarks live in benches/eval_bench.rs
(cargo bench).
Examples
Runnable end-to-end programs in examples/:
| Example | Shows |
|---|---|
recall_quickstart |
recall@k against the exact iqdb-flat oracle |
latency_report |
latency percentiles + single-thread QPS |
precomputed_ground_truth |
compute ground truth once, sweep recall across several k |
multi_metric |
comparing latency across distance metrics on one corpus |
serde_report |
serializing reports to JSON (--features serde) |
sift_eval |
loading a real SIFT dataset and evaluating it end to end |
Status
v1.0.0 is stable: recall@k against an exact oracle, latency percentiles
and throughput, and the TEXMEX SIFT-family loaders are committed under the SemVer
1.x guarantee — no breaking changes until 2.0. The surface is covered by unit,
property-based, differential (against the exact iqdb-flat oracle), and
real-corpus integration tests, plus a runnable
examples/ suite, and is recorded in the
ROADMAP. Only additive, non-breaking
changes are made within 1.x.
Where It Fits
iqdb-eval is a Phase-4 evaluation tool. It builds on:
iqdb-types— core typesiqdb-index— generic over any index via theIndex/IndexCoretraitsiqdb-flat— exact ground-truth generation
Standards
Built to the iQDB Rust standard. See REPS.md (Rust Efficiency & Performance Standards) and dev/DIRECTIVES.md for the engineering law and the definition of done. Before a PR: cargo fmt --all, cargo clippy --all-targets --all-features -- -D warnings, and cargo test --all-features must be clean.