DiskANN Implementation in Rust

A Rust implementation of DiskANN (Disk-based Approximate Nearest Neighbor search) using the Vamana graph algorithm. This project provides an efficient and scalable solution for large-scale vector similarity search with minimal memory footprint.

Overview

This implementation follows the DiskANN paper's approach by:

Using the Vamana graph algorithm for index construction
Memory-mapping the index file for efficient disk-based access
Implementing beam search with medoid entry points
Supporting both Euclidean distance and Cosine similarity
Maintaining minimal memory footprint during search operations

Key Features

Single-file storage: All index data stored in one memory-mapped file
Vamana graph construction: Efficient graph building with α-pruning
Memory-efficient search: Uses beam search that visits < 1% of vectors
Distance metrics: Support for both Euclidean and Cosine similarity
Medoid-based entry points: Smart starting points for search
Parallel query processing: Using rayon for concurrent searches
Minimal memory footprint: ~330MB RAM for 2GB index (16% of file size)

Usage

Building a New Index

use diskann_rs::{DiskANN, DistanceMetric};

// Your vectors to index
let vectors = vec![
    vec![0.1, 0.2, 0.3, ...],
    vec![0.4, 0.5, 0.6, ...],
    // ...
];

let index = DiskANN::build_index(
    &vectors,
    32,       // max neighbors per node (max_degree)
    128,      // beam width for construction
    1.2,      // alpha parameter for pruning
    DistanceMetric::Euclidean,
    "index.db"
)?;

Opening an Existing Index

let index = DiskANN::open_index("index.db")?;

Searching the Index

// Prepare your query vector
let query = vec![0.1, 0.2, ...; 128];  // must match index dimension

// Search for nearest neighbors
let k = 10;           // number of neighbors to return
let beam_width = 64;  // search beam width (higher = better recall, slower)
let neighbors = index.search(&query, k, beam_width);

// neighbors contains the IDs of the k nearest vectors

Parallel Search

use rayon::prelude::*;
use std::sync::Arc;

// Create shared index reference
let index = Arc::new(index);

// Perform parallel queries
let results: Vec<Vec<u32>> = query_batch
    .par_iter()
    .map(|query| index.search(query, k, beam_width))
    .collect();

Algorithm Details

Vamana Graph Construction

Initialize with random graph connectivity
Iteratively improve edges using greedy search
Apply α-pruning to maintain diversity in neighbors
Ensure graph remains undirected

Beam Search Algorithm

Start from medoid (vector closest to dataset centroid)
Maintain beam of promising candidates
Explore neighbors of best candidates
Terminate when no improvement found
Return top-k results

Memory Management

Vectors: Memory-mapped, loaded on-demand during distance calculations
Graph structure: Adjacency lists stored contiguously in file
Search memory: Only beam_width vectors in memory at once
Typical usage: 10-100MB RAM for billion-scale indices

Performance Characteristics

Index Build Time: O(n * max_degree * beam_width)
Search Time: O(beam_width * log n) - typically visits < 1% of dataset
Memory Usage: O(beam_width) during search
Disk Space: n * (dimension * 4 + max_degree * 4) bytes
Query Throughput: Scales linearly with CPU cores

Parameters Tuning

Build Parameters

max_degree: 32-64 for most datasets
build_beam_width: 128-256 for good graph quality
alpha: 1.2-2.0 (higher = more diverse neighbors)

Search Parameters

beam_width: 32-128 (trade-off between speed and recall)
Higher beam_width = better recall but slower search

Building and Testing

# Build the library
cargo build --release

# Run tests
cargo test

# Run demo
cargo run --release --example demo

# Run performance test
cargo run --release --example perf_test

Examples

See the examples/ directory for:

demo.rs: Demo with 100k vectors
perf_test.rs: Performance benchmarking with 1M vectors

Current Implementation

✅ Completed Features:

Single-file storage format with memory mapping
Vamana graph construction with α-pruning
Beam search with medoid entry points
Multiple distance metrics (Euclidean, Cosine)
Parallel query processing
Comprehensive test suite
Memory-efficient design (< 100MB for large indices)

Future Improvements

Support for incremental index updates
Additional distance metrics (Manhattan, Hamming)
Compressed vector storage
Distributed index support
GPU acceleration for distance calculations
Auto-tuning of parameters

Contributing

Contributions are welcome! Please feel free to:

Open issues for bugs or feature requests
Submit PRs for improvements
Share performance benchmarks
Suggest optimizations

License

This project is licensed under the MIT License - see the LICENSE file for details.

References

Acknowledgments

This implementation is based on the DiskANN paper and the official Microsoft implementation, adapted for Rust with focus on simplicity and memory efficiency.

diskann-rs 0.2.0