Module hnsw

Expand description

Hierarchical Navigable Small World (HNSW) Vector Search

This module provides high-performance approximate nearest neighbor (ANN) search using the HNSW algorithm. HNSW builds a multi-layer graph structure where higher layers provide “express lanes” for fast navigation, while the bottom layer contains all vectors for precise search.

§What is HNSW?

HNSW (Hierarchical Navigable Small World) is a graph-based algorithm for efficient approximate nearest neighbor search in high-dimensional vector spaces. It constructs a hierarchical graph where:

Layer 0 contains all vectors with dense connectivity
Higher layers contain exponentially fewer vectors, forming “skip lists”
Search starts at the top layer and greedily descends to the bottom
Result: O(log N) search complexity with high recall

This design provides excellent trade-offs between search speed, accuracy, and memory usage for vector similarity search.

§Module Architecture

The HNSW implementation is organized into focused modules:

config: HnswConfig - Algorithm configuration and parameters
builder: HnswConfigBuilder - Fluent configuration builder with validation
index: HnswIndex - Main HNSW index implementation with search/insert
storage: VectorStorage trait - Pluggable vector persistence (in-memory/SQLite)
distance_metric: SIMD-ready vector distance calculations
distance_functions: Low-level distance functions with SIMD optimization
simd: SIMD-accelerated distance functions with runtime CPU detection
batch_filter: SIMD-accelerated batch ID filtering for multi-tenant operations
serialization: SIMD-accelerated varint and delta encoding for HNSW persistence
errors: Comprehensive error handling for all HNSW operations
multilayer: Multi-layer graph construction and management
neighborhood: Neighbor selection and heuristics for graph connectivity

§Key Types

HnswIndex - Main HNSW index with insert, search, and persistence
HnswConfig - Configuration parameters (dimension, M, ef_construction)
VectorStorage - Pluggable storage backend for vectors
DistanceMetric - Supported distance metrics (Cosine, Euclidean, etc.)

§Invariants and Guarantees

§Approximate Results

HNSW provides approximate nearest neighbor search, not exact results:

Recall depends on ef_construction and ef_search parameters
Higher ef = better recall but slower search
Typical recall: 95%+ for well-tuned parameters
No guarantee of finding exact nearest neighbor

§Determinism

Same input + same config → same results (deterministic)
Random seed controls layer assignment (via multilayer_deterministic_seed)
Insert order affects graph structure (but not correctness)

§Thread Safety

NOT thread-safe for concurrent writes. HnswIndex uses interior mutability and is not Sync. Do not share across threads for insert operations.

For concurrent search:

Read-only search operations can access shared data
Use separate indexes per thread for writes
Or wrap in Mutex/RwLock for explicit synchronization

§Vector Dimension Consistency

All vectors in an index must have the same dimension:

Configured at index creation time via HnswConfig::dimension
Enforced on insert - returns error for mismatched dimensions
Cannot change after index creation (recreate index to change)

§Performance Characteristics

§Search Performance

Time Complexity: O(log N) average case
Space Complexity: O(N × M) where M = max_connections per node
Accuracy: 95%+ recall for typical workloads with proper tuning

§Insert Performance

Time Complexity: O(log N) average case
Amortized: Faster than rebuilding entire index
Dynamic: No full rebuild required for new vectors

§Memory Usage

Base overhead: 2-3x vector data size
Graph edges: O(N × M) where M is configurable (default 16)
Multi-layer: Additional ~15% overhead for higher layers

§Configuration Parameters

Key parameters in HnswConfig:

§`dimension` (Required)

Vector dimensionality (e.g., 768 for sentence embeddings)
Must match all vectors inserted into the index
Cannot be changed after index creation

§`m` (max_connections, default: 16)

Max connections per node in the graph
Higher M: Better recall, more memory, slower inserts
Lower M: Faster inserts, less memory, lower recall
Recommended: 12-32 depending on dataset size

§`ef_construction` (default: 200)

Candidate list size during index construction
Higher ef: Better graph quality, higher recall, slower builds
Must be ≥ M: Required for valid index construction
Recommended: 2× M for good quality

§`ef_search` (default: 50)

Candidate list size during search operations
Higher ef: Better recall, slower search
Can adjust at runtime without rebuilding index
Recommended: Same as ef_construction or slightly lower

§Persistence Behavior

HNSW indexes persist across database sessions:

§Metadata Persistence

Index name and configuration saved to hnsw_indexes table
Automatically restored on SqliteGraph::open()
Survives database restarts and reconnections

§Vector Persistence

Vectors persisted to hnsw_vectors table (file-based databases)
In-memory databases use in-memory storage (no persistence)
Graph structure rebuilt from persisted vectors on load

§Limitations

Graph edges NOT persisted - rebuilt from vectors on load
Rebuild cost: O(N log N) on index open
Workaround: Keep index in-memory for hot restarts

§Usage Examples

§Create Index and Insert Vectors

use sqlitegraph::{SqliteGraph, hnsw::{HnswConfig, DistanceMetric}};

let graph = SqliteGraph::open("vectors.db")?;

let config = HnswConfig::builder()
    .dimension(768)
    .m_connections(16)
    .ef_construction(200)
    .ef_search(50)
    .distance_metric(DistanceMetric::Cosine)
    .build()?;

let mut index = graph.create_hnsw_index("docs", config)?;

// Insert vectors
for (id, vector) in vectors {
    index.insert(id, &vector)?;
}

§Search for Nearest Neighbors

let results = index.search(&query_vector, 10)?;

for (vector_id, distance) in results {
    println!("ID: {}, Distance: {}", vector_id, distance);
}

§Distance Metrics

Supported distance metrics via DistanceMetric:

Cosine: Ideal for normalized vectors and text embeddings
- Range: [0, 2] where 0 = identical
- Use case: Semantic similarity, document embeddings
Euclidean: L2 distance, general-purpose similarity
- Range: [0, ∞) where 0 = identical
- Use case: Image embeddings, feature vectors
Dot Product: Fast approximate cosine for normalized vectors
- Range: [-1, 1] where 1 = identical (negated for min-heap)
- Use case: Pre-normalized embeddings, fast approximate search
Manhattan: L1 distance, robust to outliers
- Range: [0, ∞) where 0 = identical
- Use case: Sparse vectors, robust similarity search

§Error Handling

All HNSW operations return Result types with comprehensive errors:

use sqlitegraph::hnsw::{HnswConfig, HnswConfigError};

match HnswConfig::builder().dimension(0).build() {
    Ok(config) => println!("Valid config"),
    Err(HnswConfigError::InvalidDimension) => {
        println!("Vector dimension must be > 0");
    }
    Err(e) => println!("Other error: {}", e),
}

Re-exports§

pub use builder::HnswConfigBuilder;
pub use config::HnswConfig;
pub use config::hnsw_config;
pub use distance_metric::DistanceMetric;
pub use distance_metric::compute_distance;
pub use errors::HnswConfigError;
pub use errors::HnswError;
pub use errors::HnswIndexError;
pub use errors::HnswMultiLayerError;
pub use errors::HnswStorageError;
pub use index::HnswIndex;
pub use index::HnswIndexStats;
pub use storage::InMemoryVectorStorage;
pub use storage::VectorBatch;
pub use storage::VectorRecord;
pub use storage::VectorStorage;
pub use storage::VectorStorageStats;
pub use multilayer::LayerMappings;
pub use multilayer::LevelDistributor;
pub use multilayer::MultiLayerNodeManager;
pub use batch_filter::filter_allowed_scalar;
pub use batch_filter::filter_batch;
pub use batch_filter::filter_denied_scalar;
pub use serialization::decode_varint_scalar;
pub use serialization::delta_decode;
pub use serialization::delta_encode;
pub use serialization::encode_varint_scalar;

Modules§

batch_filter: SIMD-Accelerated Batch ID Filtering
builder: HNSW Configuration Builder
config: HNSW Algorithm Configuration
distance_functions: Distance Calculation Functions
distance_metric: Distance Metrics for Vector Similarity
errors: HNSW Error Types
index: HNSW Vector Search Index API
layer: HNSW Layer Management
multilayer: Multi-layer HNSW Implementation
neighborhood: HNSW Neighborhood Search Algorithms
serialization: SIMD-Accelerated Serialization for HNSW Persistence
simd: SIMD-Accelerated Distance Functions
storage: HNSW Vector Storage Abstraction

Module hnsw

Module hnsw Copy item path

§What is HNSW?

§Module Architecture

§Key Types

§Invariants and Guarantees

§Approximate Results

§Determinism

§Thread Safety

§Vector Dimension Consistency

§Performance Characteristics

§Search Performance

§Insert Performance

§Memory Usage

§Configuration Parameters

§dimension (Required)

§m (max_connections, default: 16)

§ef_construction (default: 200)

§ef_search (default: 50)

§Persistence Behavior

§Metadata Persistence

§Vector Persistence

§Limitations

§Usage Examples

§Create Index and Insert Vectors

§Search for Nearest Neighbors

§Distance Metrics

§Error Handling

Re-exports§

Modules§

Module hnsw

§`dimension` (Required)

§`m` (max_connections, default: 16)

§`ef_construction` (default: 200)

§`ef_search` (default: 50)