rabitq-rs 0.1.2

Rust implementation of the RaBitQ quantization scheme with IVF search tooling
Documentation

RaBitQ Rust Library

This crate provides a pure-Rust implementation of the RaBitQ quantization scheme and an IVF + RaBitQ searcher that mirrors the behavior of the C++ RaBitQ Library. The library focuses on efficient approximate nearest-neighbor search for high-dimensional vectors and now ships with tooling to reproduce the GIST benchmark pipeline described in example.sh.

Highlights

  • Full IVF + RaBitQ searcher – the IvfRabitqIndex supports both L2 and inner-product metrics, fastscan-style pruning, and optional extended codes.
  • Pre-clustered training supportIvfRabitqIndex::train_with_clusters lets you reuse centroids and cluster assignments generated by external tooling (e.g. the python/ivf.py helper that wraps FAISS), matching the workflow used by the upstream C++ library.
  • Dataset utilities – the new rabitq_rs::io module parses .fvecs and .ivecs files, including convenience helpers for cluster-id lists and ground-truth tables.
  • Command-line evaluationcargo run --bin gist builds an IVF + RaBitQ index from the GIST dataset and reports recall and throughput for a configurable nprobe / top-k budget.

Quick start

Add the crate to your project by pointing Cargo.toml at this repository, adding rabitq-rs from crates.io, or by linking to a local checkout. The snippet below constructs an IVF index from randomly generated vectors, queries it, and prints the nearest neighbour id.

use rabitq_rs::ivf::{IvfRabitqIndex, SearchParams};
use rabitq_rs::Metric;
use rand::prelude::*;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut rng = StdRng::seed_from_u64(42);
    let dim = 32;
    let dataset: Vec<Vec<f32>> = (0..1_000)
        .map(|_| (0..dim).map(|_| rng.gen::<f32>() * 2.0 - 1.0).collect())
        .collect();

    let index = IvfRabitqIndex::train(&dataset, 64, 7, Metric::L2, 7_654)?;
    let params = SearchParams::new(10, 32);
    let results = index.search(&dataset[0], params)?;

    println!("nearest neighbour id: {}", results[0].id);
    Ok(())
}

Training with pre-computed clusters

When you already have k-means centroids and assignments (for example produced by FAISS), call train_with_clusters:

use rabitq_rs::ivf::IvfRabitqIndex;
use rabitq_rs::Metric;

let index = IvfRabitqIndex::train_with_clusters(
    &dataset,
    &centroids,      // Vec<Vec<f32>> with shape [nlist, dim]
    &assignments,    // Vec<usize> with length dataset.len()
    7,               // total quantisation bits
    Metric::L2,
    0xFEED_FACE,     // rotation seed
)?;

Reproducing the GIST IVF + RaBitQ benchmark

Follow the same data preparation steps shown in example.sh:

  1. Download and unpack the dataset

    mkdir -p data/gist
    wget -P data/gist ftp://ftp.irisa.fr/local/texmex/corpus/gist.tar.gz
    tar -xzvf data/gist/gist.tar.gz -C data/gist
    

    If FTP is blocked in your environment, fetch the files from an alternative mirror and place them under data/gist/ with the same filenames (gist_base.fvecs, gist_query.fvecs, gist_groundtruth.ivecs).

After the dataset is in place you can choose between two training workflows:

Option 1: Use pre-computed clusters (FAISS-compatible)

  1. Cluster the base vectors – the helper script mirrors the FAISS call used by the C++ sample:

    python python/ivf.py \
        data/gist/gist_base.fvecs \
        4096 \
        data/gist/gist_centroids_4096.fvecs \
        data/gist/gist_clusterids_4096.ivecs \
        l2
    

    (Swap l2 for ip if you plan to evaluate inner-product similarity.)

  2. Build and evaluate the Rust index – the CLI supports limiting the number of base vectors and queries so you can perform a smoke test without loading the full 1M-vector dataset:

    cargo run --release --bin gist -- \
        --base data/gist/gist_base.fvecs \
        --centroids data/gist/gist_centroids_4096.fvecs \
        --assignments data/gist/gist_clusterids_4096.ivecs \
        --queries data/gist/gist_query.fvecs \
        --groundtruth data/gist/gist_groundtruth.ivecs \
        --bits 7 \
        --top-k 100 \
        --nprobe 1024 \
        --metric l2 \
        --max-base 1000000 \
        --max-queries 200 \
        --seed 1337
    

    The command prints the construction time, the evaluated recall@top-k, and the observed queries-per-second. Remove the --max-base / --max-queries limits to run the full benchmark once you are comfortable with the workflow.

Option 2: Train everything in Rust (no pre-computed centroids)

Skip the Python/FAISS clustering step and let the crate execute k-means internally. Provide the desired IVF list count via --nlist:

cargo run --release --bin gist -- \
    --base data/gist/gist_base.fvecs \
    --bits 7 \
    --nlist 4096 \
    --queries data/gist/gist_query.fvecs \
    --groundtruth data/gist/gist_groundtruth.ivecs \
    --top-k 100 \
    --nprobe 1024 \
    --metric l2 \
    --max-base 1000000 \
    --max-queries 200 \
    --seed 1337

The command mirrors the pre-computed flow but performs clustering in-process using the Rust IvfRabitqIndex::train helper. Expect the build phase to take longer than the pre-clustered path because the binary runs k-means internally. Once the index is trained the evaluation output matches the format of the pre-computed mode.

All CLI options are documented in cargo run --bin gist -- --help.

Persisting trained indexes

Use the persistence hooks to avoid retraining between benchmarking runs:

cargo run --release --bin gist -- \
    --base data/gist/gist_base.fvecs \
    --bits 7 \
    --nlist 4096 \
    --queries data/gist/gist_query.fvecs \
    --groundtruth data/gist/gist_groundtruth.ivecs \
    --top-k 100 \
    --nprobe 1024 \
    --metric l2 \
    --max-base 1000000 \
    --max-queries 200 \
    --seed 1337 \
    --save-index data/gist/gist_rbq.idx

cargo run --release --bin gist -- \
    --base data/gist/gist_base.fvecs \
    --queries data/gist/gist_query.fvecs \
    --groundtruth data/gist/gist_groundtruth.ivecs \
    --top-k 100 \
    --nprobe 1024 \
    --metric l2 \
    --max-base 1000000 \
    --max-queries 200 \
    --load-index data/gist/gist_rbq.idx

The first command trains with the Rust pipeline, writes the persisted index, and records the benchmark. The second command reuses the saved index for subsequent recall sweeps or profiling runs.

Testing and linting

The test suite now includes regression checks for the dataset readers and the pre-clustered IVF flow. Run the full suite along with the standard linters before submitting changes:

cargo fmt
cargo clippy --all-targets --all-features
cargo test

For dataset-backed evaluation, invoke the gist binary as described above.

Publishing to crates.io

The crate is configured for publication on crates.io. Before publishing a new release:

  1. Update the version – bump the version field in Cargo.toml following semantic versioning.

  2. Log in to crates.io – authenticate once per workstation:

    cargo login <your-api-token>
    
  3. Validate the package – ensure the crate builds cleanly and packages without missing files:

    cargo fmt
    cargo clippy --all-targets --all-features
    cargo test
    cargo package
    

    Inspect the generated .crate archive under target/package/ if you need to double-check the bundle contents.

  4. Publish – when you are ready, push the package live:

    cargo publish
    

If you need to yank a release, run cargo yank --vers <version> (optionally with --undo). Remember that published versions are immutable, so double-check the README and API docs before releasing.

Project structure

src/
  bin/gist.rs     # CLI for building & evaluating IVF + RaBitQ on GIST
  io.rs           # .fvecs/.ivecs readers and helpers
  ivf.rs          # IVF + RaBitQ searcher and training routines
  kmeans.rs       # Lightweight k-means used for in-crate training
  math.rs         # Vector math helpers
  quantizer.rs    # Core RaBitQ quantisation logic
  rotation.rs     # Random orthonormal rotator

Refer to README.origin.md for the original upstream documentation.