Expand description
§MemVDB - An In-Memory Vector Database
MemVDB is a fast, lightweight in-memory vector database written in Rust. It supports multiple distance metrics and provides efficient similarity search for machine learning applications, recommendation systems, and semantic search.
§Features
- Multiple Distance Metrics: Euclidean, Cosine, and Dot Product
- High Performance: Optimized similarity search with binary heap algorithms
- Flexible Metadata: Store arbitrary metadata with each embedding
- Batch Operations: Efficient batch insertion and updates
- Thread Safety: Safe concurrent access with proper locking
- Zero Dependencies: Minimal external dependencies for core functionality
§Quick Start
use memvdb::{CacheDB, Distance, Embedding};
use std::collections::HashMap;
// Create a new in-memory vector database
let mut db = CacheDB::new();
// Create a collection with 128-dimensional vectors using cosine similarity
db.create_collection("documents".to_string(), 128, Distance::Cosine).unwrap();
// Create an embedding with metadata
let mut id = HashMap::new();
id.insert("doc_id".to_string(), "doc_001".to_string());
let mut metadata = HashMap::new();
metadata.insert("title".to_string(), "Sample Document".to_string());
metadata.insert("category".to_string(), "AI".to_string());
let vector = vec![0.1; 128]; // 128-dimensional vector
let embedding = Embedding {
id,
vector,
metadata: Some(metadata),
};
// Insert the embedding
db.insert_into_collection("documents", embedding).unwrap();
// Perform similarity search
let query_vector = vec![0.2; 128];
let collection = db.get_collection("documents").unwrap();
let results = collection.get_similarity(&query_vector, 5);
println!("Found {} similar documents", results.len());
§Distance Metrics
MemVDB supports three distance metrics optimized for different use cases:
- Euclidean Distance: Best for spatial data and when absolute distances matter
- Cosine Similarity: Ideal for text embeddings and high-dimensional sparse data
- Dot Product: Efficient for normalized vectors and neural network outputs
§Architecture
The library is organized into three main modules:
db
: Core database functionality, collections, and embeddings managementsimilarity
: Distance calculation and vector operations- Public API exports for easy integration
§Performance Characteristics
- Insertion: O(1) average case for single embeddings
- Similarity Search: O(n) where n is the number of embeddings in the collection
- Memory Usage: Linear with number of embeddings and vector dimensions
- Concurrency: Thread-safe operations with mutex-based protection
Structs§
- Batch
Insert Embeddings Struct - Configuration for batch embedding operations.
- CacheDB
- The main in-memory vector database structure.
- Collection
- A collection of embeddings with a specific dimensionality and distance metric.
- Collection
Handler Struct - Configuration for collection operations.
- Create
Collection Struct - Configuration for creating a new collection.
- Embedding
- An individual embedding (vector) with associated metadata.
- GetSimilarity
Struct - Configuration for similarity search operations.
- Insert
Embedding Struct - Configuration for inserting a single embedding.
- Score
Index - A helper structure for k-nearest neighbor search operations.
- Similarity
Result - Result of a similarity search operation.
Enums§
- Distance
- Supported distance metrics for similarity calculations.
- Error
- Error types for database operations.
Functions§
- add
- Simple addition function for testing purposes
- create_
database - Convenience function for quick database setup
- get_
cache_ attr - Pre-computes cacheable attributes for distance calculations.
- get_
distance_ fn - Returns the appropriate distance function for the specified metric.
- normalize
- Normalizes a vector to unit length (L2 normalization).