Expand description
Hierarchical Navigable Small World (HNSW) Vector Search
This module provides high-performance approximate nearest neighbor (ANN) search using the HNSW algorithm. HNSW builds a multi-layer graph structure where higher layers provide “express lanes” for fast navigation, while the bottom layer contains all vectors for precise search.
§What is HNSW?
HNSW (Hierarchical Navigable Small World) is a graph-based algorithm for efficient approximate nearest neighbor search in high-dimensional vector spaces. It constructs a hierarchical graph where:
- Layer 0 contains all vectors with dense connectivity
- Higher layers contain exponentially fewer vectors, forming “skip lists”
- Search starts at the top layer and greedily descends to the bottom
- Result: O(log N) search complexity with high recall
This design provides excellent trade-offs between search speed, accuracy, and memory usage for vector similarity search.
§Module Architecture
The HNSW implementation is organized into focused modules:
- config:
HnswConfig- Algorithm configuration and parameters - builder:
HnswConfigBuilder- Fluent configuration builder with validation - index:
HnswIndex- Main HNSW index implementation with search/insert - storage:
VectorStoragetrait - Pluggable vector persistence (in-memory/SQLite) - distance_metric: SIMD-ready vector distance calculations
- distance_functions: Low-level distance functions with SIMD optimization
- simd: SIMD-accelerated distance functions with runtime CPU detection
- batch_filter: SIMD-accelerated batch ID filtering for multi-tenant operations
- serialization: SIMD-accelerated varint and delta encoding for HNSW persistence
- errors: Comprehensive error handling for all HNSW operations
- multilayer: Multi-layer graph construction and management
- neighborhood: Neighbor selection and heuristics for graph connectivity
§Key Types
HnswIndex- Main HNSW index with insert, search, and persistenceHnswConfig- Configuration parameters (dimension, M, ef_construction)VectorStorage- Pluggable storage backend for vectorsDistanceMetric- Supported distance metrics (Cosine, Euclidean, etc.)
§Invariants and Guarantees
§Approximate Results
HNSW provides approximate nearest neighbor search, not exact results:
- Recall depends on
ef_constructionandef_searchparameters - Higher ef = better recall but slower search
- Typical recall: 95%+ for well-tuned parameters
- No guarantee of finding exact nearest neighbor
§Determinism
- Same input + same config → same results (deterministic)
- Random seed controls layer assignment (via
multilayer_deterministic_seed) - Insert order affects graph structure (but not correctness)
§Thread Safety
NOT thread-safe for concurrent writes. HnswIndex uses interior mutability
and is not Sync. Do not share across threads for insert operations.
For concurrent search:
- Read-only search operations can access shared data
- Use separate indexes per thread for writes
- Or wrap in
Mutex/RwLockfor explicit synchronization
§Vector Dimension Consistency
All vectors in an index must have the same dimension:
- Configured at index creation time via
HnswConfig::dimension - Enforced on insert - returns error for mismatched dimensions
- Cannot change after index creation (recreate index to change)
§Performance Characteristics
§Search Performance
- Time Complexity: O(log N) average case
- Space Complexity: O(N × M) where M = max_connections per node
- Accuracy: 95%+ recall for typical workloads with proper tuning
§Insert Performance
- Time Complexity: O(log N) average case
- Amortized: Faster than rebuilding entire index
- Dynamic: No full rebuild required for new vectors
§Memory Usage
- Base overhead: 2-3x vector data size
- Graph edges: O(N × M) where M is configurable (default 16)
- Multi-layer: Additional ~15% overhead for higher layers
§Configuration Parameters
Key parameters in HnswConfig:
§dimension (Required)
- Vector dimensionality (e.g., 768 for sentence embeddings)
- Must match all vectors inserted into the index
- Cannot be changed after index creation
§m (max_connections, default: 16)
- Max connections per node in the graph
- Higher M: Better recall, more memory, slower inserts
- Lower M: Faster inserts, less memory, lower recall
- Recommended: 12-32 depending on dataset size
§ef_construction (default: 200)
- Candidate list size during index construction
- Higher ef: Better graph quality, higher recall, slower builds
- Must be ≥ M: Required for valid index construction
- Recommended: 2× M for good quality
§ef_search (default: 50)
- Candidate list size during search operations
- Higher ef: Better recall, slower search
- Can adjust at runtime without rebuilding index
- Recommended: Same as ef_construction or slightly lower
§Persistence Behavior
HNSW indexes persist across database sessions:
§Metadata Persistence
- Index name and configuration saved to
hnsw_indexestable - Automatically restored on
SqliteGraph::open() - Survives database restarts and reconnections
§Vector Persistence
- Vectors persisted to
hnsw_vectorstable (file-based databases) - In-memory databases use in-memory storage (no persistence)
- Graph structure rebuilt from persisted vectors on load
§Limitations
- Graph edges NOT persisted - rebuilt from vectors on load
- Rebuild cost: O(N log N) on index open
- Workaround: Keep index in-memory for hot restarts
§Usage Examples
§Create Index and Insert Vectors
use sqlitegraph::{SqliteGraph, hnsw::{HnswConfig, DistanceMetric}};
let graph = SqliteGraph::open("vectors.db")?;
let config = HnswConfig::builder()
.dimension(768)
.m_connections(16)
.ef_construction(200)
.ef_search(50)
.distance_metric(DistanceMetric::Cosine)
.build()?;
let mut index = graph.create_hnsw_index("docs", config)?;
// Insert vectors
for (id, vector) in vectors {
index.insert(id, &vector)?;
}§Search for Nearest Neighbors
let results = index.search(&query_vector, 10)?;
for (vector_id, distance) in results {
println!("ID: {}, Distance: {}", vector_id, distance);
}§Distance Metrics
Supported distance metrics via DistanceMetric:
-
Cosine: Ideal for normalized vectors and text embeddings
- Range: [0, 2] where 0 = identical
- Use case: Semantic similarity, document embeddings
-
Euclidean: L2 distance, general-purpose similarity
- Range: [0, ∞) where 0 = identical
- Use case: Image embeddings, feature vectors
-
Dot Product: Fast approximate cosine for normalized vectors
- Range: [-1, 1] where 1 = identical (negated for min-heap)
- Use case: Pre-normalized embeddings, fast approximate search
-
Manhattan: L1 distance, robust to outliers
- Range: [0, ∞) where 0 = identical
- Use case: Sparse vectors, robust similarity search
§Error Handling
All HNSW operations return Result types with comprehensive errors:
use sqlitegraph::hnsw::{HnswConfig, HnswConfigError};
match HnswConfig::builder().dimension(0).build() {
Ok(config) => println!("Valid config"),
Err(HnswConfigError::InvalidDimension) => {
println!("Vector dimension must be > 0");
}
Err(e) => println!("Other error: {}", e),
}Re-exports§
pub use builder::HnswConfigBuilder;pub use config::HnswConfig;pub use config::hnsw_config;pub use distance_metric::DistanceMetric;pub use distance_metric::compute_distance;pub use errors::HnswConfigError;pub use errors::HnswError;pub use errors::HnswIndexError;pub use errors::HnswMultiLayerError;pub use errors::HnswStorageError;pub use index::HnswIndex;pub use index::HnswIndexStats;pub use storage::InMemoryVectorStorage;pub use storage::VectorBatch;pub use storage::VectorRecord;pub use storage::VectorStorage;pub use storage::VectorStorageStats;pub use multilayer::LayerMappings;pub use multilayer::LevelDistributor;pub use multilayer::MultiLayerNodeManager;pub use batch_filter::filter_allowed_scalar;pub use batch_filter::filter_batch;pub use batch_filter::filter_denied_scalar;pub use serialization::decode_varint_scalar;pub use serialization::delta_decode;pub use serialization::delta_encode;pub use serialization::encode_varint_scalar;
Modules§
- batch_
filter - SIMD-Accelerated Batch ID Filtering
- builder
- HNSW Configuration Builder
- config
- HNSW Algorithm Configuration
- distance_
functions - Distance Calculation Functions
- distance_
metric - Distance Metrics for Vector Similarity
- errors
- HNSW Error Types
- index
- HNSW Vector Search Index API
- layer
- HNSW Layer Management
- multilayer
- Multi-layer HNSW Implementation
- neighborhood
- HNSW Neighborhood Search Algorithms
- serialization
- SIMD-Accelerated Serialization for HNSW Persistence
- simd
- SIMD-Accelerated Distance Functions
- storage
- HNSW Vector Storage Abstraction