DiskANN Library
The core DiskANN library providing graph-based high-performance approximate nearest neighbor (ANN) search functionality in Rust.
Overview
DiskANN is a high-performance, scalable approximate nearest neighbor (ANN) search library implemented in Rust. It provides both in-memory and disk-based out-of-core indexing for large-scale vector search (e.g., billions of vectors) with high recall and low latency.
Features
- Disk-based indexing - Support for datasets that don't fit in memory
- High performance - Optimized for fast approximate nearest neighbor search
- Parallel processing - Leverages Rust's concurrency features
- Multiple distance metrics - Support for L2, cosine, inner product, and other distance functions
- Flexible data types - Support for f32, f16, and other numeric types
- Memory efficient - Smart caching and memory management for large datasets
Quick Start
use ;
API Reference
IndexBuilder
The main entry point for creating indices with a fluent builder pattern:
let index = new
.with_dimension // Required: vector dimension
.with_metric // Required: distance metric
.with_max_degree // Optional: max graph degree (default: 64)
.with_search_list_size // Optional: search beam width (default: 100)
.with_alpha // Optional: graph density (default: 1.2)
.with_num_threads // Optional: thread count (default: 1)
.with_opq // Optional: enable OPQ compression (default: false)
.?; // Build in-memory index
Distance Metrics
Available distance metrics:
Metric::L2
- Euclidean distance (fastest, most common)Metric::Cosine
- Cosine distance (for normalized vectors)Metric::InnerProduct
- Inner product similarity
Search Parameters
Fine-tune search behavior:
let results = index.search?;
// k: number of results to return
// l: search beam width (higher = more accurate, slower)
Advanced Configuration
use ;
let params = SearchParams ;
let results = index.search_with_params?;
Architecture
In-Memory Index
Best for:
- Small to medium datasets (< 1M vectors)
- High-performance requirements
- Frequent updates
let mut index = new
.with_dimension
.with_metric
.?;
Disk-Based Index
Best for:
- Large datasets (> 1M vectors)
- Memory-constrained environments
- Persistent storage requirements
let mut index = new
.with_dimension
.with_metric
.?;
Performance Tips
Index Building
- Alpha parameter: Higher values (1.2-1.4) create denser graphs with better accuracy but slower search
- Max degree: Higher values improve recall but increase memory usage
- Thread count: Use more threads for faster building on multi-core systems
Search Optimization
- k value: Start with k=10-20 for most applications
- l value: Start with l=50-100, increase for better accuracy
- Distance metric: L2 is fastest, cosine good for normalized vectors
- Batch operations: Use batch search for multiple queries
Memory Management
- In-memory: ~2-4x vector data size for index overhead
- Disk-based: Only cache size in memory, rest on disk
- Batch size: Larger batches are more efficient but use more memory
Use Cases
Image Similarity Search
// Load image embeddings
let embeddings = load_image_embeddings?;
let mut index = new
.with_dimension // Image embedding dimension
.with_metric // Cosine for normalized embeddings
.?;
index.insert_batch?;
index.build?;
// Find similar images
let query_embedding = extract_image_embedding?;
let similar_images = index.search?;
Text Search with Embeddings
// Load text embeddings
let text_embeddings = load_text_embeddings?;
let mut index = new
.with_dimension // BERT embedding dimension
.with_metric
.?;
index.insert_batch?;
index.build?;
// Semantic search
let query_embedding = embed_text?;
let relevant_docs = index.search?;
Large-Scale Recommendation Systems
// For very large datasets, use disk-based index
let mut index = new
.with_dimension
.with_metric // For recommendation scores
.with_max_degree
.with_search_list_size
.with_num_threads
.?;
// Build from file
index.build_from_file?;
// Get recommendations
let user_embedding = get_user_embedding?;
let recommendations = index.search?;
Benchmarks
Performance on typical hardware (Intel i7-8700K, 32GB RAM):
Dataset Size | Index Type | Build Time | Search Time (ms) | Memory Usage |
---|---|---|---|---|
100K vectors | In-Memory | 2.3s | 0.8 | 512MB |
1M vectors | In-Memory | 18.7s | 1.2 | 4.2GB |
10M vectors | Disk-Based | 3m 45s | 2.1 | 1.1GB |
100M vectors | Disk-Based | 42m 12s | 3.8 | 2.3GB |
Search times are for k=10, l=50 on 128-dimensional vectors
Development
Building from Source
RUSTFLAGS="-C target-feature=+avx2 -C target-cpu=native"
Running Tests
RUSTFLAGS="-C target-feature=+avx2 -C target-cpu=native"
Running Benchmarks
Dependencies
- vector: Vector operations and distance metrics
- platform: Platform-specific optimizations
- logger: Logging and instrumentation
- rayon: Parallel processing
- serde: Serialization support
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
We welcome contributions! Please see the main README for contribution guidelines.
This is a Linux port of DistANN library (Rust). The later only supports Windows. No MacOS support yet. Credit to original authors. I added a significant amount of new functions in it, including hamming similarity with difference precision, better log and better parallelism. This modified crate will be applied in large-scale biological database search.
References
Jayaram Subramanya, S., Devvrit, F., Simhadri, H.V., Krishnawamy, R. and Kadekodi, R., 2019. Diskann: Fast accurate billion-point nearest neighbor search on a single node. Advances in neural information processing Systems, 32.