Expand description
§DiskANN Library
The core DiskANN library providing graph-based high-performance approximate nearest neighbor (ANN) search functionality in Rust.
§Overview
DiskANN is a high-performance, scalable approximate nearest neighbor (ANN) search library implemented in Rust. It provides both in-memory and disk-based out-of-core indexing for large-scale vector search (e.g., billions of vectors) with high recall and low latency.
§Features
- Disk-based indexing - Support for datasets that don’t fit in memory
- High performance - Optimized for fast approximate nearest neighbor search
- Parallel processing - Leverages Rust’s concurrency features
- Multiple distance metrics - Support for L2, cosine, inner product, and other distance functions
- Flexible data types - Support for f32, f16, and other numeric types
- Memory efficient - Smart caching and memory management for large datasets
§Quick Start
use diskann::{IndexBuilder, Metric, ANNResult};
fn main() -> ANNResult<()> {
// Create an in-memory index
let mut index = IndexBuilder::new()
.with_dimension(128) // Vector dimension
.with_metric(Metric::L2) // Distance metric
.with_max_degree(64) // Maximum graph degree
.with_search_list_size(100) // Search beam width
.with_alpha(1.2) // Graph density parameter
.with_num_threads(4) // Number of threads
.build_in_memory::<f32>()?;
// Insert vectors
let vectors = vec![
vec![1.0, 2.0, 3.0, 4.0],
vec![2.0, 3.0, 4.0, 5.0],
vec![3.0, 4.0, 5.0, 6.0],
];
index.insert_batch(&vectors)?;
// Build the index
index.build(&vectors)?;
// Search for nearest neighbors
let query = vec![1.0, 2.0, 3.0, 4.0];
let results = index.search(&query, 5, 50)?;
println!("Found {} nearest neighbors", results.len());
for result in results {
println!("ID: {}, Distance: {:.4}", result.id, result.distance);
}
Ok(())
}
§API Reference
§IndexBuilder
The main entry point for creating indices with a fluent builder pattern:
let index = IndexBuilder::new()
.with_dimension(128) // Required: vector dimension
.with_metric(Metric::L2) // Required: distance metric
.with_max_degree(64) // Optional: max graph degree (default: 64)
.with_search_list_size(100) // Optional: search beam width (default: 100)
.with_alpha(1.2) // Optional: graph density (default: 1.2)
.with_num_threads(4) // Optional: thread count (default: 1)
.with_opq(false) // Optional: enable OPQ compression (default: false)
.build_in_memory::<f32>()?; // Build in-memory index
§Distance Metrics
Available distance metrics:
Metric::L2
- Euclidean distance (fastest, most common)Metric::Cosine
- Cosine distance (for normalized vectors)Metric::InnerProduct
- Inner product similarity
§Search Parameters
Fine-tune search behavior:
let results = index.search(&query, k, l)?;
// k: number of results to return
// l: search beam width (higher = more accurate, slower)
§Advanced Configuration
use diskann::{SearchParams, ANNResult};
let params = SearchParams {
k: 10, // Number of results
l: 50, // Search beam width
return_distances: true, // Include distances in results
};
let results = index.search_with_params(&query, ¶ms)?;
§Architecture
§In-Memory Index
Best for:
- Small to medium datasets (< 1M vectors)
- High-performance requirements
- Frequent updates
let mut index = IndexBuilder::new()
.with_dimension(128)
.with_metric(Metric::L2)
.build_in_memory::<f32>()?;
§Disk-Based Index
Best for:
- Large datasets (> 1M vectors)
- Memory-constrained environments
- Persistent storage requirements
let mut index = IndexBuilder::new()
.with_dimension(128)
.with_metric(Metric::L2)
.build_disk_index::<f32>("index_path")?;
§Performance Tips
§Index Building
- Alpha parameter: Higher values (1.2-1.4) create denser graphs with better accuracy but slower search
- Max degree: Higher values improve recall but increase memory usage
- Thread count: Use more threads for faster building on multi-core systems
§Search Optimization
- k value: Start with k=10-20 for most applications
- l value: Start with l=50-100, increase for better accuracy
- Distance metric: L2 is fastest, cosine good for normalized vectors
- Batch operations: Use batch search for multiple queries
§Memory Management
- In-memory: ~2-4x vector data size for index overhead
- Disk-based: Only cache size in memory, rest on disk
- Batch size: Larger batches are more efficient but use more memory
§Use Cases
§Image Similarity Search
// Load image embeddings
let embeddings = load_image_embeddings("images.bin")?;
let mut index = IndexBuilder::new()
.with_dimension(512) // Image embedding dimension
.with_metric(Metric::Cosine) // Cosine for normalized embeddings
.build_in_memory::<f32>()?;
index.insert_batch(&embeddings)?;
index.build(&embeddings)?;
// Find similar images
let query_embedding = extract_image_embedding("query.jpg")?;
let similar_images = index.search(&query_embedding, 10, 50)?;
§Text Search with Embeddings
// Load text embeddings
let text_embeddings = load_text_embeddings("documents.bin")?;
let mut index = IndexBuilder::new()
.with_dimension(768) // BERT embedding dimension
.with_metric(Metric::L2)
.build_in_memory::<f32>()?;
index.insert_batch(&text_embeddings)?;
index.build(&text_embeddings)?;
// Semantic search
let query_embedding = embed_text("search query")?;
let relevant_docs = index.search(&query_embedding, 20, 100)?;
§Large-Scale Recommendation Systems
// For very large datasets, use disk-based index
let mut index = IndexBuilder::new()
.with_dimension(128)
.with_metric(Metric::InnerProduct) // For recommendation scores
.with_max_degree(128)
.with_search_list_size(200)
.with_num_threads(16)
.build_disk_index::<f32>("recommendations")?;
// Build from file
index.build_from_file("user_embeddings.bin")?;
// Get recommendations
let user_embedding = get_user_embedding(user_id)?;
let recommendations = index.search(&user_embedding, 50, 200)?;
§Benchmarks
Performance on typical hardware (Intel i7-8700K, 32GB RAM):
Dataset Size | Index Type | Build Time | Search Time (ms) | Memory Usage |
---|---|---|---|---|
100K vectors | In-Memory | 2.3s | 0.8 | 512MB |
1M vectors | In-Memory | 18.7s | 1.2 | 4.2GB |
10M vectors | Disk-Based | 3m 45s | 2.1 | 1.1GB |
100M vectors | Disk-Based | 42m 12s | 3.8 | 2.3GB |
Search times are for k=10, l=50 on 128-dimensional vectors
§Development
§Building from Source
RUSTFLAGS="-C target-feature=+avx2 -C target-cpu=native" cargo build --release --no-default-features --features "disk_store,disk_store_linux"
§Running Tests
RUSTFLAGS="-C target-feature=+avx2 -C target-cpu=native" cargo test --release index_end_to_end_test_multithread --no-default-features --features "disk_store,disk_store_linux"
§Running Benchmarks
cargo bench
§Dependencies
- vector: Vector operations and distance metrics
- platform: Platform-specific optimizations
- logger: Logging and instrumentation
- rayon: Parallel processing
- serde: Serialization support
§License
This project is licensed under the MIT License - see the LICENSE file for details.
§Contributing
We welcome contributions! Please see the main README for contribution guidelines.
This is a Linux port of DistANN library (Rust). The later only supports Windows. No MacOS support yet. Credit to original authors. I added a significant amount of new functions in it, including hamming similarity with difference precision, better log and better parallelism. This modified crate will be applied in large-scale biological database search.
§References
Jayaram Subramanya, S., Devvrit, F., Simhadri, H.V., Krishnawamy, R. and Kadekodi, R., 2019. Diskann: Fast accurate billion-point nearest neighbor search on a single node. Advances in neural information processing Systems, 32.
§DiskANN - Approximate Nearest Neighbor Search in Rust
DiskANN is a high-performance, scalable approximate nearest neighbor (ANN) search library implemented in Rust. It provides both in-memory and disk-based indexing for large-scale vector search with high recall and low latency.
§Key Features
- Pure Rust implementation - No C/C++ dependencies, leveraging Rust’s memory safety
- Disk-based indexing - Support for datasets that don’t fit in memory
- High performance - Optimized for fast approximate nearest neighbor search
- Parallel processing - Leverages Rust’s concurrency features
- Multiple distance metrics - Support for L2, cosine, and other distance functions
- Flexible data types - Support for f32, f16, and other numeric types
§Quick Start
use diskann::{IndexBuilder, Metric, SearchParams};
// Create an in-memory index
let mut index = IndexBuilder::new()
.with_dimension(128)
.with_metric(Metric::L2)
.with_max_degree(64)
.with_search_list_size(100)
.build_in_memory()?;
// Insert vectors
let vectors = vec![vec![1.0, 2.0, 3.0], vec![4.0, 5.0, 6.0]];
index.insert_batch(&vectors)?;
// Search for nearest neighbors
let query = vec![1.0, 2.0, 3.0];
let results = index.search(&query, 5, 50)?;
println!("Found {} nearest neighbors", results.len());
§Examples
See the examples/
directory for complete working examples:
basic_usage.rs
- Basic in-memory index usagedisk_index.rs
- Disk-based index for large datasetsbatch_operations.rs
- Batch insert and search operationscustom_metrics.rs
- Using custom distance metrics
Re-exports§
pub use common::ANNError;
pub use common::ANNResult;
pub use model::configuration::index_configuration::IndexConfiguration;
pub use model::configuration::index_write_parameters::IndexWriteParametersBuilder;
Modules§
Structs§
- Disk
Index - High-level disk-based index for large datasets
- InMemory
Index - High-level in-memory index for vector search
- Index
Builder - High-level index builder for easy configuration
- Search
Params - Search parameters for fine-tuning search behavior
- Search
Result - Result of a search operation
Enums§
- Metric
- Distance metric