iqdb-eval 1.0.0

Benchmarking and evaluation: recall@k, latency, and throughput for vector indexes - part of the iQDB family.
Documentation
  • recall@k, done correctly — compares an index's results against the true top-k from an exact iqdb-flat oracle (or against a known .ivecs ground-truth set); never approximated
  • Latency percentiles — mean / min / max and nearest-rank p50 / p95 / p99 in microseconds, with build cost excluded by construction
  • Throughput — single-thread queries-per-second over the measured query set
  • Index-agnostic — one generic surface measures any backend behind the Index / IndexCore traits
  • Standard datasets — zero-dependency loaders for the TEXMEX SIFT family (SIFT1M, GIST1M, siftsmall) in .fvecs / .ivecs format
  • Reproducible — deterministic aggregation and a documented VectorId::U64 row-index convention, so numbers are comparable across runs

Installation

[dependencies]
iqdb-eval = "1.0"

iqdb-eval takes its vocabulary — VectorId, SearchParams, DistanceMetric, IqdbError — from iqdb-types, the Index / IndexCore traits from iqdb-index, and the exact oracle from iqdb-flat. A typical consumer depends on all four:

[dependencies]
iqdb-eval  = "1.0"
iqdb-flat  = "1.0"   # the exact oracle (and a fine first index under test)
iqdb-index = "1.0"   # the Index / IndexCore traits
iqdb-types = "1.0"   # VectorId, SearchParams, DistanceMetric, ...

MSRV is Rust 1.87 (edition 2024). The crate is std-only; the optional serde feature derives Serialize / Deserialize on the report types.

Quick Start

Build the index under test and an exact oracle from the same base set, then ask the harness for recall@k and latency:

use iqdb_eval::{build_index_from_base, latency, recall_at_k_vs_oracle, LatencyConfig};
use iqdb_flat::{FlatConfig, FlatIndex};
use iqdb_types::{DistanceMetric, SearchParams};

fn main() -> Result<(), iqdb_eval::EvalError> {
    let base: Vec<Vec<f32>> = vec![vec![0.0, 0.0], vec![3.0, 4.0], vec![1.0, 1.0]];
    let queries: Vec<Vec<f32>> = vec![vec![0.5, 0.5]];
    let metric = DistanceMetric::Euclidean;

    // The index under test and an exact oracle, built identically.
    let target: FlatIndex = build_index_from_base(FlatConfig, 2, metric, &base)?;
    let oracle: FlatIndex = build_index_from_base(FlatConfig, 2, metric, &base)?;
    let params = SearchParams::new(2, metric);

    // recall@k against the oracle's true top-k.
    let recall = recall_at_k_vs_oracle(&target, &oracle, &queries, &params)?;
    assert_eq!(recall.mean_recall, 1.0); // flat is exact

    // Latency percentiles (build cost is excluded — `target` is borrowed).
    let lat = latency(&target, &queries, &params, &LatencyConfig::default())?;
    assert!(lat.p50_us <= lat.p95_us);
    Ok(())
}

The complete surface — every function, parameter, error, and more examples — is in docs/API.md.

Measuring an approximate index

Swap the target for any backend behind the Index / IndexCore traits; the oracle stays flat. recall@k now reports how much accuracy the approximate index trades for its speed:

use iqdb_eval::{build_index_from_base, recall_at_k_vs_oracle};
use iqdb_flat::{FlatConfig, FlatIndex};
use iqdb_hnsw::{HnswConfig, HnswIndex};
use iqdb_types::{DistanceMetric, SearchParams};

let metric = DistanceMetric::Euclidean;
let target: HnswIndex = build_index_from_base(HnswConfig::default(), dim, metric, &base)?;
let oracle: FlatIndex = build_index_from_base(FlatConfig, dim, metric, &base)?;
let params = SearchParams::new(10, metric);

let report = recall_at_k_vs_oracle(&target, &oracle, &queries, &params)?;
println!("recall@10 = {:.4}", report.mean_recall);

The one rule: build both indexes with build_index_from_base (or insert each base row at VectorId::U64(row_index) by hand). That convention is what lets .ivecs ground-truth ids line up with the ids search returns.

Standard datasets

The loaders read the TEXMEX corpus layout — a little-endian u32 dim header followed by dim payload values per record — shared by SIFT1M, GIST1M, and siftsmall. Point load_sift_dataset at a directory and a prefix; it resolves {prefix}_base.fvecs, {prefix}_query.fvecs, and {prefix}_groundtruth.ivecs, validates dimensions and lengths, and returns a SiftDataset:

use iqdb_eval::load_sift_dataset;

# fn run() -> Result<(), iqdb_eval::EvalError> {
let data = load_sift_dataset(".bench-data/siftsmall", "siftsmall")?;
assert_eq!(data.queries.len(), data.ground_truth.len());
# Ok(())
# }

Datasets are read from local files; downloading and caching them is left to the caller (so the crate pulls in no network dependency). read_fvecs and read_ivecs are available directly for non-standard layouts.

Tiered API

  • Tier 1 — the lazy path. build_index_from_base + recall_at_k_vs_oracle
    • latency cover the whole common case in three calls.
  • Tier 2 — the configured path. Precompute ground truth once with compute_ground_truth and reuse it across recall_at_k; tune the timing loop with LatencyConfig { warmup }; load standard corpora with load_sift_dataset / read_fvecs / read_ivecs.
  • Tier 3 — the trait seam. Everything is generic over iqdb_index::IndexCore (and Index for construction), so any custom backend behind those traits is measurable with no extra wiring.

Performance

  • The harness is thin. A measurement run's cost is dominated by the index search calls it drives; iqdb-eval adds only an O(k) set-membership check per query for recall and a single sort for latency percentiles.
  • No allocation in the timing window. latency records into a pre-sized sample buffer; the index is borrowed, so build cost is never timed.
  • Recall sets are hashed once. Each query's true top-k is a HashSet<u64> membership test against the retrieved hits — linear in the result size.
  • Nearest-rank percentiles. Every reported percentile is an observed sample (clamp(ceil(q·n) − 1, 0, n − 1)), never an interpolation.

Benchmarks live in benches/eval_bench.rs (cargo bench).

Examples

Runnable end-to-end programs in examples/:

Example Shows
recall_quickstart recall@k against the exact iqdb-flat oracle
latency_report latency percentiles + single-thread QPS
precomputed_ground_truth compute ground truth once, sweep recall across several k
multi_metric comparing latency across distance metrics on one corpus
serde_report serializing reports to JSON (--features serde)
sift_eval loading a real SIFT dataset and evaluating it end to end
cargo run --example recall_quickstart

Status

v1.0.0 is stable: recall@k against an exact oracle, latency percentiles and throughput, and the TEXMEX SIFT-family loaders are committed under the SemVer 1.x guarantee — no breaking changes until 2.0. The surface is covered by unit, property-based, differential (against the exact iqdb-flat oracle), and real-corpus integration tests, plus a runnable examples/ suite, and is recorded in the ROADMAP. Only additive, non-breaking changes are made within 1.x.

Where It Fits

iqdb-eval is a Phase-4 evaluation tool. It builds on:

  • iqdb-types — core types
  • iqdb-index — generic over any index via the Index / IndexCore traits
  • iqdb-flat — exact ground-truth generation

Standards

Built to the iQDB Rust standard. See REPS.md (Rust Efficiency & Performance Standards) and dev/DIRECTIVES.md for the engineering law and the definition of done. Before a PR: cargo fmt --all, cargo clippy --all-targets --all-features -- -D warnings, and cargo test --all-features must be clean.