RaBitQ Rust Library

This crate provides a pure-Rust implementation of the RaBitQ quantisation scheme and an IVF + RaBitQ searcher that mirrors the behaviour of the reference C++ pipeline. The library focuses on efficient approximate nearest-neighbour search for high- dimensional vectors and now ships with tooling to reproduce the GIST benchmark pipeline described in example.sh. The crates.io package is distributed as rabitq-rs because the original rabitq name was claimed before the 2025 publishing push for this Rust port.

Highlights

Full IVF + RaBitQ searcher – the IvfRabitqIndex supports both L2 and inner-product metrics, fastscan-style pruning, and optional extended codes.
Pre-clustered training support – IvfRabitqIndex::train_with_clusters lets you reuse centroids and cluster assignments generated by external tooling (e.g. the python/ivf.py helper that wraps FAISS), matching the workflow used by the C++ binaries in this repository.
Dataset utilities – the new rabitq_rs::io module parses .fvecs and .ivecs files, including convenience helpers for cluster-id lists and ground-truth tables.
Command-line evaluation – cargo run --bin gist builds an IVF + RaBitQ index from the GIST dataset and reports recall and throughput for a configurable nprobe / top-k budget.

Quick start

Add the crate to your project by pointing Cargo.toml at this repository, adding rabitq-rs from crates.io, or by linking to a local checkout. The snippet below constructs an IVF index from randomly generated vectors, queries it, and prints the nearest neighbour id.

use rabitq_rs::ivf::{IvfRabitqIndex, SearchParams};
use rabitq_rs::Metric;
use rand::prelude::*;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut rng = StdRng::seed_from_u64(42);
    let dim = 32;
    let dataset: Vec<Vec<f32>> = (0..1_000)
        .map(|_| (0..dim).map(|_| rng.gen::<f32>() * 2.0 - 1.0).collect())
        .collect();

    let index = IvfRabitqIndex::train(&dataset, 64, 7, Metric::L2, 7_654)?;
    let params = SearchParams::new(10, 32);
    let results = index.search(&dataset[0], params)?;

    println!("nearest neighbour id: {}", results[0].id);
    Ok(())
}

Training with pre-computed clusters

When you already have k-means centroids and assignments (for example produced by FAISS), call train_with_clusters:

use rabitq_rs::ivf::IvfRabitqIndex;
use rabitq_rs::Metric;

let index = IvfRabitqIndex::train_with_clusters(
    &dataset,
    &centroids,      // Vec<Vec<f32>> with shape [nlist, dim]
    &assignments,    // Vec<usize> with length dataset.len()
    7,               // total quantisation bits
    Metric::L2,
    0xFEED_FACE,     // rotation seed
)?;

The new unit tests (preclustered_training_matches_naive_l2 and _ip) verify that the pre-clustered fastscan path matches the naïve reconstruction baseline for both distance metrics.

Reproducing the GIST IVF + RaBitQ benchmark

Follow the same data preparation steps shown in example.sh:

Download and unpack the dataset
```
mkdir -p data/gist
wget -P data/gist ftp://ftp.irisa.fr/local/texmex/corpus/gist.tar.gz
tar -xzvf data/gist/gist.tar.gz -C data/gist
```
If FTP is blocked in your environment, fetch the files from an alternative mirror and place them under data/gist/ with the same filenames (gist_base.fvecs, gist_query.fvecs, gist_groundtruth.ivecs).

After the dataset is in place you can choose between two training workflows:

Option 1: Use pre-computed clusters (FAISS-compatible)

Cluster the base vectors – the helper script mirrors the FAISS call used by the C++ sample:

python python/ivf.py \
    data/gist/gist_base.fvecs \
    4096 \
    data/gist/gist_centroids_4096.fvecs \
    data/gist/gist_clusterids_4096.ivecs \
    l2

(Swap l2 for ip if you plan to evaluate inner-product similarity.)

Build and evaluate the Rust index – the CLI supports limiting the number of base vectors and queries so you can perform a smoke test without loading the full 1M-vector dataset:

cargo run --release --bin gist -- \
    --base data/gist/gist_base.fvecs \
    --centroids data/gist/gist_centroids_4096.fvecs \
    --assignments data/gist/gist_clusterids_4096.ivecs \
    --queries data/gist/gist_query.fvecs \
    --groundtruth data/gist/gist_groundtruth.ivecs \
    --bits 7 \
    --top-k 100 \
    --nprobe 1024 \
    --metric l2 \
    --max-base 200000 \
    --max-queries 200 \
    --seed 1337

The command prints the construction time, the evaluated recall@top-k, and the observed queries-per-second. Remove the --max-base / --max-queries limits to run the full benchmark once you are comfortable with the workflow.

Option 2: Train everything in Rust (no pre-computed centroids)

Skip the Python/FAISS clustering step and let the crate execute k-means internally. Provide the desired IVF list count via --nlist:

cargo run --release --bin gist -- \
    --base data/gist/gist_base.fvecs \
    --bits 7 \
    --nlist 4096 \
    --queries data/gist/gist_query.fvecs \
    --groundtruth data/gist/gist_groundtruth.ivecs \
    --top-k 100 \
    --nprobe 1024 \
    --metric l2 \
    --max-base 200000 \
    --max-queries 200 \
    --seed 1337

The command mirrors the pre-computed flow but performs clustering in-process using the Rust IvfRabitqIndex::train helper. Expect the build phase to take longer than the pre-clustered path because the binary runs k-means internally. Once the index is trained the evaluation output matches the format of the pre-computed mode.

All CLI options are documented in cargo run --bin gist -- --help.

Testing and linting

The test suite now includes regression checks for the dataset readers and the pre-clustered IVF flow. Run the full suite along with the standard linters before submitting changes:

cargo fmt
cargo clippy --all-targets --all-features
cargo test

For dataset-backed evaluation, invoke the gist binary as described above (optionally with reduced limits for quicker runs).

Publishing to crates.io

The crate is configured for publication on crates.io. Before publishing a new release:

Update the version – bump the version field in Cargo.toml following semantic versioning.
Log in to crates.io – authenticate once per workstation:
```
cargo login <your-api-token>
```
Validate the package – ensure the crate builds cleanly and packages without missing files:
```
cargo fmt
cargo clippy --all-targets --all-features
cargo test
cargo package
```
Inspect the generated .crate archive under target/package/ if you need to double-check the bundle contents.
Publish – when you are ready, push the package live:
```
cargo publish
```

If you need to yank a release, run cargo yank --vers <version> (optionally with --undo). Remember that published versions are immutable, so double-check the README and API docs before releasing.

Project structure

src/
  bin/gist.rs     # CLI for building & evaluating IVF + RaBitQ on GIST
  io.rs           # .fvecs/.ivecs readers and helpers
  ivf.rs          # IVF + RaBitQ searcher and training routines
  kmeans.rs       # Lightweight k-means used for in-crate training
  math.rs         # Vector math helpers
  quantizer.rs    # Core RaBitQ quantisation logic
  rotation.rs     # Random orthonormal rotator

Refer to README.origin.md for the original upstream documentation.

rabitq-rs 0.1.1