SWARC - Small World Approximate Recall Crate
A high-performance implementation of the Hierarchical Navigable Small World (HNSW) algorithm in Rust. SWARC provides state-of-the-art approximate nearest neighbor search with excellent performance for high-dimensional vector similarity search.
Features
- Fast Approximate Nearest Neighbor Search: Efficient k-NN search with logarithmic time complexity
- Document Linking: Associate embeddings with external data using a flexible document structure
- Dynamic Operations: Insert, remove, and rebalance the index at runtime
- Modular Architecture: Clean separation of concerns with dedicated modules for different operations
- Type Safety: Generic implementation that works with any data type
- Serialization Support: Built-in support for JSON serialization/deserialization
Algorithm Overview
HNSW constructs a multi-layer graph where:
- Higher layers contain fewer nodes and provide long-range connections
- Lower layers contain more nodes and provide fine-grained search
- Search starts from the top layer and navigates down to find nearest neighbors
- Insertion uses probabilistic level assignment and intelligent neighbor selection
Installation
Add this to your Cargo.toml:
[]
= { = "path/to/swarc" }
Or if using from crates.io (when published):
[]
= "0.1.0"
Quick Start
use ;
use Rng;
API Reference
Core Types
Document<T>
A wrapper for external data associated with embeddings.
HNSWIndex<T>
The main index structure that provides all HNSW operations.
Main Methods
new(dim: usize, m: usize, ef_construction: usize) -> HNSWIndex<T>
Creates a new HNSW index.
dim: Dimensionality of the embedding vectorsm: Maximum number of connections per node (except layer 0)ef_construction: Size of dynamic candidate list during construction
insert(id: String, embedding: Vec<f32>, document: Option<Document<T>>) -> Result<(), String>
Inserts a new node into the index.
id: Unique identifier for the nodeembedding: The vector embeddingdocument: Optional associated document data
search(query: &[f32], k: usize) -> Vec<(String, f32, Option<&Document<T>>)>
Searches for k nearest neighbors.
query: The query vectork: Number of nearest neighbors to return- Returns: Vector of (node_id, distance, document) tuples
remove(id: &str) -> Result<Option<Document<T>>, String>
Removes a node from the index.
id: The node identifier to remove- Returns: The removed document if it existed
rebalance() -> Result<(), String>
Rebalances the index structure (currently a placeholder for future enhancements).
Utility Methods
len() -> usize: Get the number of nodes in the indexis_empty() -> bool: Check if the index is emptyget_node(id: &str) -> Option<&HNSWNode<T>>: Get a node by IDget_all_ids() -> Vec<String>: Get all node IDscontains(id: &str) -> bool: Check if a node existsclear(): Remove all nodes from the indexremove_multiple(ids: &[&str]) -> Result<Vec<Option<Document<T>>>, String>: Remove multiple nodes
Architecture
The implementation is organized into several modules:
types.rs: Core data structures (Document,HNSWNode)index.rs: Main index structure and basic operationsinsert.rs: Insertion and rebalancing logicsearch.rs: Search and nearest neighbor algorithmsremove.rs: Node removal and cleanup operations
Performance Characteristics
- Search Complexity: O(log N) for approximate nearest neighbor search
- Insertion Complexity: O(log N) for adding new nodes
- Memory Usage: O(N × M) where N is the number of nodes and M is the average connections per node
- Distance Metric: Currently uses Euclidean distance (L2 norm)
Performance Benchmarks
SWARC has been benchmarked with high-dimensional embeddings (3072 dimensions) across various dataset sizes. See the performance tests directory for detailed benchmarks and visualizations.
Key Performance Metrics:
- Insertion Throughput: Up to millions of embeddings per hour
- Search Time: Sub-millisecond query times for large datasets
- Memory Efficiency: Linear scaling with dataset size
- Scalability: Tested up to 5 million 3072-dimensional embeddings
Insertion time scales logarithmically with dataset size
Search time remains relatively constant across dataset sizes
Memory usage scales linearly with dataset size
Insertion throughput across different dataset sizes
Configuration Parameters
m (Maximum Connections)
- Controls the maximum number of connections per node
- Higher values: Better recall, more memory usage, slower search
- Lower values: Faster search, less memory, potentially lower recall
- Typical range: 8-32
ef_construction (Construction Parameter)
- Size of dynamic candidate list during index construction
- Higher values: Better index quality, slower construction
- Lower values: Faster construction, potentially lower search quality
- Typical range: 100-500
max_layers
- Automatically calculated based on expected dataset size
- Higher layers provide long-range navigation
- Lower layers provide fine-grained search
Examples
Basic Usage
let mut index = new;
index.insert?;
let results = index.search;
Performance Testing
# Run benchmarks with millions of 3072-dimensional embeddings
# Generate performance plots
With Documents
let doc = Document ;
index.insert?;
Batch Operations
// Insert multiple documents
for in embeddings.iter.enumerate
// Remove multiple documents
let ids = ;
let removed = index.remove_multiple?;
Limitations and Future Work
Current Limitations
- Only supports Euclidean distance (L2 norm)
- Rebalancing is a placeholder implementation
- No persistence/serialization of the index structure
- No support for different distance metrics
Planned Enhancements
- Support for multiple distance metrics (cosine, inner product, etc.)
- Index persistence and loading
- Advanced rebalancing strategies
- Parallel search and insertion
- Memory-mapped storage for large datasets
- Benchmarking and performance optimization tools
Contributing
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
License
This project is licensed under the MIT License - see the LICENSE file for details.
References
- Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs - Original HNSW paper
- HNSW: Hierarchical Navigable Small World - Reference C++ implementation