# RaBitQ Rust Library
This crate provides a pure-Rust implementation of the RaBitQ quantization scheme and an IVF + RaBitQ searcher that mirrors the
behavior of the C++ [RaBitQ Library](https://github.com/VectorDB-NTU/RaBitQ-Library). The library focuses on efficient approximate nearest-neighbor search for high-dimensional vectors and now ships with tooling to reproduce the GIST benchmark pipeline described in `example.sh`.
## Highlights
- **Full IVF + RaBitQ searcher** – the `IvfRabitqIndex` supports both L2 and inner-product metrics, fastscan-style pruning, and
optional extended codes.
- **Pre-clustered training support** – `IvfRabitqIndex::train_with_clusters` lets you reuse centroids and cluster assignments
generated by external tooling (e.g. the `python/ivf.py` helper that wraps FAISS), matching the workflow used by the upstream C++
library.
- **Dataset utilities** – the new `rabitq_rs::io` module parses `.fvecs` and `.ivecs` files, including convenience helpers for
cluster-id lists and ground-truth tables.
- **Command-line evaluation** – `cargo run --bin ivf_rabitq` builds an IVF + RaBitQ index from any .fvecs dataset and reports recall and
throughput for a configurable `nprobe` / `top-k` budget.
## Quick start
Add the crate to your project by pointing `Cargo.toml` at this repository, adding `rabitq-rs` from crates.io, or by linking to a
local checkout. The snippet below constructs an IVF index from randomly generated vectors, queries it, and prints the nearest
neighbour id.
```rust
use rabitq_rs::ivf::{IvfRabitqIndex, SearchParams};
use rabitq_rs::{Metric, RotatorType};
use rand::prelude::*;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut rng = StdRng::seed_from_u64(42);
let dim = 32;
let dataset: Vec<Vec<f32>> = (0..1_000)
.map(|_| (0..dim).map(|_| rng.gen::<f32>() * 2.0 - 1.0).collect())
.collect();
let index = IvfRabitqIndex::train(
&dataset,
64, // nlist
7, // total_bits
Metric::L2,
RotatorType::FhtKacRotator, // Use FHT for better performance
7_654, // seed
false // use_faster_config (set to true for 100-500x faster training)
)?;
let params = SearchParams::new(10, 32);
let results = index.search(&dataset[0], params)?;
println!("nearest neighbour id: {}", results[0].id);
Ok(())
}
```
## Training with pre-computed clusters
When you already have k-means centroids and assignments (for example produced by FAISS), call `train_with_clusters`:
```rust
use rabitq_rs::ivf::IvfRabitqIndex;
use rabitq_rs::{Metric, RotatorType};
let index = IvfRabitqIndex::train_with_clusters(
&dataset,
¢roids, // Vec<Vec<f32>> with shape [nlist, dim]
&assignments, // Vec<usize> with length dataset.len()
7, // total quantisation bits
Metric::L2,
RotatorType::FhtKacRotator,
0xFEED_FACE, // rotation seed
false, // use_faster_config (set to true for 100-500x faster training)
)?;
```
## Faster quantization with `faster_config`
By default, RaBitQ computes an optimal scaling factor for each vector during quantization, which provides the best accuracy but can
be slow. The `faster_config` mode precomputes a single constant scaling factor for all vectors, trading <1% accuracy for **100-500x
faster** quantization.
**When to use faster_config:**
- Large datasets (>100K vectors) where training time is a bottleneck
- Production scenarios where index build time matters
- When the small accuracy loss (<1%) is acceptable
**When NOT to use faster_config:**
- Small datasets where training is already fast
- When you need the absolute best accuracy
- Research scenarios where precision is critical
### Example usage:
```rust
use rabitq_rs::ivf::IvfRabitqIndex;
use rabitq_rs::{Metric, RotatorType};
// With faster_config enabled
let index = IvfRabitqIndex::train(
&dataset,
4096, // nlist
7, // total_bits
Metric::L2,
RotatorType::FhtKacRotator,
12345, // seed
true, // use_faster_config = true (100-500x faster!)
)?;
// Or from CLI:
// cargo run --release --bin ivf_rabitq -- \
// --base data.fvecs \
// --nlist 4096 \
// --bits 7 \
// --faster-config \
// --save index.bin
```
## Reproducing the GIST IVF + RaBitQ benchmark
Follow the same data preparation steps shown in `example.sh`:
1. **Download and unpack the dataset**
```bash
mkdir -p data/gist
wget -P data/gist ftp://ftp.irisa.fr/local/texmex/corpus/gist.tar.gz
tar -xzvf data/gist/gist.tar.gz -C data/gist
```
If FTP is blocked in your environment, fetch the files from an alternative mirror and place them under `data/gist/` with the
same filenames (`gist_base.fvecs`, `gist_query.fvecs`, `gist_groundtruth.ivecs`).
### Typical Workflow (with FAISS clustering)
1. **Cluster the base vectors** using the Python helper:
```bash
python python/ivf.py \
data/gist/gist_base.fvecs \
4096 \
data/gist/gist_centroids_4096.fvecs \
data/gist/gist_clusterids_4096.ivecs \
l2
```
2. **Build the index**:
```bash
cargo run --release --bin ivf_rabitq -- \
--base data/gist/gist_base.fvecs \
--centroids data/gist/gist_centroids_4096.fvecs \
--assignments data/gist/gist_clusterids_4096.ivecs \
--bits 3 \
--faster-config \
--save data/gist/ivf_4096_3.index
```
Add `--faster-config` for 100-500x faster training with <1% accuracy loss.
3. **Query with benchmark mode** (nprobe sweep + 5-round benchmark):
```bash
cargo run --release --bin ivf_rabitq -- \
--load data/gist/ivf_4096_3.index \
--queries data/gist/gist_query.fvecs \
--gt data/gist/gist_groundtruth.ivecs \
--benchmark
```
This performs an automatic nprobe sweep (5, 10, 20...15000), stops when recall plateaus, then runs a 5-round benchmark
and outputs a table: `nprobe | QPS | recall`.
### Alternative: Build with Rust k-means
Skip Python clustering and use built-in k-means:
```bash
cargo run --release --bin ivf_rabitq -- \
--base data/gist/gist_base.fvecs \
--nlist 4096 \
--bits 3 \
--save data/gist/index.bin
```
### Single-Config Evaluation
For a specific nprobe value (without sweep):
```bash
cargo run --release --bin ivf_rabitq -- \
--load data/gist/index.bin \
--queries data/gist/gist_query.fvecs \
--gt data/gist/gist_groundtruth.ivecs \
--nprobe 1024 \
--top-k 100
```
This evaluates at the specified nprobe and reports recall, QPS, and latency percentiles.
### Build and Query in One Command
```bash
cargo run --release --bin ivf_rabitq -- \
--base data/gist/gist_base.fvecs \
--nlist 4096 \
--bits 3 \
--queries data/gist/gist_query.fvecs \
--gt data/gist/gist_groundtruth.ivecs \
--benchmark
```
All CLI options are documented in `cargo run --bin ivf_rabitq -- --help`.
## Testing and linting
The test suite now includes regression checks for the dataset readers and the pre-clustered IVF flow. Run the full suite along
with the standard linters before submitting changes:
```bash
cargo fmt
cargo clippy --all-targets --all-features
cargo test
```
For dataset-backed evaluation, invoke the `gist` binary as described above.
## Publishing to crates.io
The crate is configured for publication on [crates.io](https://crates.io/crates/rabitq-rs). Before publishing a new release:
1. **Update the version** – bump the `version` field in `Cargo.toml` following semantic versioning.
2. **Log in to crates.io** – authenticate once per workstation:
```bash
cargo login <your-api-token>
```
3. **Validate the package** – ensure the crate builds cleanly and packages without missing files:
```bash
cargo fmt
cargo clippy --all-targets --all-features
cargo test
cargo package
```
Inspect the generated `.crate` archive under `target/package/` if you need to double-check the bundle contents.
4. **Publish** – when you are ready, push the package live:
```bash
cargo publish
```
If you need to yank a release, run `cargo yank --vers <version>` (optionally with `--undo`). Remember that published versions
are immutable, so double-check the README and API docs before releasing.
## Project structure
```text
src/
bin/ivf_rabitq.rs # CLI for building & evaluating IVF + RaBitQ on any .fvecs dataset
io.rs # .fvecs/.ivecs readers and helpers
ivf.rs # IVF + RaBitQ searcher and training routines
kmeans.rs # Lightweight k-means used for in-crate training
math.rs # Vector math helpers
quantizer.rs # Core RaBitQ quantisation logic
rotation.rs # Random orthonormal rotator
```
Refer to `README.origin.md` for the original upstream documentation.