crvecdb

A fast vector database library with HNSW indexing for Rust.

Features

HNSW Indexing - Hierarchical Navigable Small World graphs for fast approximate nearest neighbor search
Multiple Distance Metrics - Cosine, Euclidean (L2), Dot Product
SIMD Acceleration - Cross-platform support for ARM NEON and x86 SSE/AVX2
Memory-Mapped Storage - Persistent indexes with automatic memory mapping
Parallel Operations - Rayon-powered parallel insert and search

Installation

[dependencies]
crvecdb = "0.1"

Quick Start

use crvecdb::{Index, DistanceMetric};

// Create an in-memory index
let index = Index::builder(128)  // 128 dimensions
    .metric(DistanceMetric::Cosine)
    .m(16)                           // HNSW connections per node
    .ef_construction(200)            // Build-time search width
    .capacity(10_000)
    .build()
    .unwrap();

// Insert vectors
index.insert(1, &vec![0.1; 128]).unwrap();
index.insert(2, &vec![0.2; 128]).unwrap();

// Search for nearest neighbors
let results = index.search(&vec![0.15; 128], 10).unwrap();
for result in results {
    println!("ID: {}, Distance: {:.4}", result.id, result.distance);
}

Parallel Bulk Insert

use crvecdb::{Index, DistanceMetric};

let index = Index::builder(128)
    .metric(DistanceMetric::Euclidean)
    .capacity(1_000_000)
    .build()
    .unwrap();

// Prepare batch
let vectors: Vec<_> = (0..1_000_000)
    .map(|i| (i as u64, vec![0.1; 128]))
    .collect();

// Parallel insert - uses all CPU cores
index.insert_parallel(&vectors).unwrap();

Persistent Storage

use crvecdb::{Index, DistanceMetric};

// Create a memory-mapped index
let index = Index::builder(768)
    .metric(DistanceMetric::DotProduct)
    .capacity(1_000_000)
    .build_mmap("/path/to/index.db")
    .unwrap();

// Data persists automatically
index.insert(1, &vec![0.1; 768]).unwrap();
index.flush().unwrap();  // Saves both vectors and HNSW graph

// Reopen later
let index = Index::open_mmap("/path/to/index.db").unwrap();
// Graph is restored - no rebuild needed!

Distance Metrics

Metric	Description	Use Case
`Cosine`	Normalized angular distance	Text embeddings, semantic search
`Euclidean`	L2 distance	Image features, spatial data
`DotProduct`	Inner product	Recommendation systems

HNSW Parameters

Parameter	Default	Description
`m`	16	Max connections per node. Higher = better recall, more memory
`ef_construction`	200	Search width during build. Higher = better graph, slower insert
`ef_search`	50	Search width at query time. Higher = better recall, slower search

Feature Flags

[features]
default = ["simd", "parallel"]
simd = ["simdeez"]      # SIMD acceleration
parallel = ["rayon"]    # Parallel insert and search

The parallel feature enables multi-threaded operations:

insert_parallel() uses all CPU cores for bulk loading
Search benchmarks run queries in parallel

Disable for single-threaded builds:

[dependencies]
crvecdb = { version = "0.1", default-features = false, features = ["simd"] }

Performance

SIFT1M benchmark (1M vectors, 128 dimensions, Euclidean distance):

Operation	Throughput	Notes
Parallel Insert	4,000 vectors/sec	m=16, ef_construction=200
Parallel Search (k=10)	4,000 QPS	97% recall@10
Single Query Latency	~1ms p50

Benchmarks

SIFT1M Benchmark

Download the dataset (not included in repo):

mkdir -p data/sift
cd data/sift
curl -O ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz
mv sift/* .
rmdir sift
rm sift.tar.gz
cd ../..

Run the benchmark:

cargo run --release --example sift1m_bench

Expected output:

=== SIFT1M Benchmark ===

[1/4] Loading dataset...
  Base vectors:  1000000 x 128
  Query vectors: 10000 x 128
  Ground truth:  10000 x 100

[2/4] Building index (parallel)...
  Build time:    ~4 minutes
  Vectors/sec:   ~4000

[3/4] Benchmarking search (parallel)...
  Recall@1   96.7%  |  QPS: ~4000
  Recall@10  97.1%  |  QPS: ~4000
  Recall@100 94.0%  |  QPS: ~4000

[4/4] Latency distribution (k=10, single-threaded)...
  Avg:  ~1.0 ms
  P50:  ~1.0 ms
  P95:  ~1.5 ms
  P99:  ~1.7 ms

License

MIT OR Apache-2.0

crvecdb 0.1.0