SWARC - Small World Approximate Recall Crate

A high-performance implementation of the Hierarchical Navigable Small World (HNSW) algorithm in Rust. SWARC provides state-of-the-art approximate nearest neighbor search with excellent performance for high-dimensional vector similarity search.

Features

Fast Approximate Nearest Neighbor Search: Efficient k-NN search with logarithmic time complexity
Document Linking: Associate embeddings with external data using a flexible document structure
Dynamic Operations: Insert, remove, and rebalance the index at runtime
Modular Architecture: Clean separation of concerns with dedicated modules for different operations
Type Safety: Generic implementation that works with any data type
Serialization Support: Built-in support for JSON serialization/deserialization

Algorithm Overview

HNSW constructs a multi-layer graph where:

Higher layers contain fewer nodes and provide long-range connections
Lower layers contain more nodes and provide fine-grained search
Search starts from the top layer and navigates down to find nearest neighbors
Insertion uses probabilistic level assignment and intelligent neighbor selection

Installation

Add this to your Cargo.toml:

[dependencies]
swarc = { path = "path/to/swarc" }

Or if using from crates.io (when published):

[dependencies]
swarc = "0.1.0"

Quick Start

use swarc::{HNSWIndex, Document};
use rand::Rng;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a new HNSW index
    // Parameters: dimension, max_connections, ef_construction
    let mut index = HNSWIndex::new(128, 16, 200);
    
    // Create a document with external data
    let document = Document {
        id: "doc1".to_string(),
        data: "This is my document content".to_string(),
    };
    
    // Generate a random embedding (in practice, use your embedding model)
    let mut rng = rand::thread_rng();
    let embedding: Vec<f32> = (0..128).map(|_| rng.gen_range(-1.0..1.0)).collect();
    
    // Insert the document into the index
    index.insert("node1".to_string(), embedding, Some(document))?;
    
    // Search for nearest neighbors
    let query: Vec<f32> = (0..128).map(|_| rng.gen_range(-1.0..1.0)).collect();
    let results = index.search(&query, 5); // Find 5 nearest neighbors
    
    for (id, distance, document) in results {
        println!("ID: {}, Distance: {:.4}, Document: {:?}", id, distance, document);
    }
    
    // Remove a document
    let removed_doc = index.remove("node1")?;
    println!("Removed: {:?}", removed_doc);
    
    Ok(())
}

API Reference

Core Types

`Document<T>`

pub struct Document<T> {
    pub id: String,
    pub data: T,
}

A wrapper for external data associated with embeddings.

`HNSWIndex<T>`

The main index structure that provides all HNSW operations.

Main Methods

`new(dim: usize, m: usize, ef_construction: usize) -> HNSWIndex<T>`

Creates a new HNSW index.

dim: Dimensionality of the embedding vectors
m: Maximum number of connections per node (except layer 0)
ef_construction: Size of dynamic candidate list during construction

`insert(id: String, embedding: Vec<f32>, document: Option<Document<T>>) -> Result<(), String>`

Inserts a new node into the index.

id: Unique identifier for the node
embedding: The vector embedding
document: Optional associated document data

`search(query: &[f32], k: usize) -> Vec<(String, f32, Option<&Document<T>>)>`

Searches for k nearest neighbors.

query: The query vector
k: Number of nearest neighbors to return
Returns: Vector of (node_id, distance, document) tuples

`remove(id: &str) -> Result<Option<Document<T>>, String>`

Removes a node from the index.

id: The node identifier to remove
Returns: The removed document if it existed

`rebalance() -> Result<(), String>`

Rebalances the index structure (currently a placeholder for future enhancements).

Utility Methods

len() -> usize: Get the number of nodes in the index
is_empty() -> bool: Check if the index is empty
get_node(id: &str) -> Option<&HNSWNode<T>>: Get a node by ID
get_all_ids() -> Vec<String>: Get all node IDs
contains(id: &str) -> bool: Check if a node exists
clear(): Remove all nodes from the index
remove_multiple(ids: &[&str]) -> Result<Vec<Option<Document<T>>>, String>: Remove multiple nodes

Architecture

The implementation is organized into several modules:

types.rs: Core data structures (Document, HNSWNode)
index.rs: Main index structure and basic operations
insert.rs: Insertion and rebalancing logic
search.rs: Search and nearest neighbor algorithms
remove.rs: Node removal and cleanup operations

Performance Characteristics

Search Complexity: O(log N) for approximate nearest neighbor search
Insertion Complexity: O(log N) for adding new nodes
Memory Usage: O(N × M) where N is the number of nodes and M is the average connections per node
Distance Metric: Currently uses Euclidean distance (L2 norm)

Performance Benchmarks

SWARC has been benchmarked with high-dimensional embeddings (3072 dimensions) across various dataset sizes. See the performance tests directory for detailed benchmarks and visualizations.

Key Performance Metrics:

Insertion Throughput: Up to millions of embeddings per hour
Search Time: Sub-millisecond query times for large datasets
Memory Efficiency: Linear scaling with dataset size
Scalability: Tested up to 5 million 3072-dimensional embeddings

Insertion Performance Insertion time scales logarithmically with dataset size

Search Performance Search time remains relatively constant across dataset sizes

Memory usage scales linearly with dataset size

Insertion throughput across different dataset sizes

Configuration Parameters

`m` (Maximum Connections)

Controls the maximum number of connections per node
Higher values: Better recall, more memory usage, slower search
Lower values: Faster search, less memory, potentially lower recall
Typical range: 8-32

`ef_construction` (Construction Parameter)

Size of dynamic candidate list during index construction
Higher values: Better index quality, slower construction
Lower values: Faster construction, potentially lower search quality
Typical range: 100-500

`max_layers`

Automatically calculated based on expected dataset size
Higher layers provide long-range navigation
Lower layers provide fine-grained search

Examples

Basic Usage

let mut index = HNSWIndex::new(128, 16, 200);
index.insert("node1".to_string(), vec![0.1, 0.2, 0.3], None)?;
let results = index.search(&vec![0.1, 0.2, 0.3], 1);

Performance Testing

# Run benchmarks with millions of 3072-dimensional embeddings
cargo run --bin benchmark

# Generate performance plots
cargo run --bin plot_results

With Documents

let doc = Document {
    id: "article_1".to_string(),
    data: Article { title: "AI Research", content: "..." },
};
index.insert("node1".to_string(), embedding, Some(doc))?;

Batch Operations

// Insert multiple documents
for (i, embedding) in embeddings.iter().enumerate() {
    let doc = Document {
        id: format!("doc_{}", i),
        data: documents[i].clone(),
    };
    index.insert(format!("node_{}", i), embedding.clone(), Some(doc))?;
}

// Remove multiple documents
let ids = ["node_1", "node_2", "node_3"];
let removed = index.remove_multiple(&ids)?;

Limitations and Future Work

Current Limitations

Only supports Euclidean distance (L2 norm)
Rebalancing is a placeholder implementation
No persistence/serialization of the index structure
No support for different distance metrics

Planned Enhancements

Support for multiple distance metrics (cosine, inner product, etc.)
Index persistence and loading
Advanced rebalancing strategies
Parallel search and insertion
Memory-mapped storage for large datasets
Benchmarking and performance optimization tools

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

References

Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs - Original HNSW paper
HNSW: Hierarchical Navigable Small World - Reference C++ implementation

swarc 0.1.0