SatoriDB
Billion scale embedded vector database for approximate nearest neighbor (ANN) search.
Architecture
SatoriDB uses a two-tier architecture: an HNSW index routes queries to the most relevant buckets (clusters of similar vectors), then CPU-pinned workers scan those buckets in parallel. Vectors are automatically clustered and rebalanced as data grows.
Features
- Embedded: runs entirely in-process, no external services
- Two-tier search: HNSW routing + parallel bucket scanning
- Automatic clustering: vectors grouped by similarity, splits when buckets grow
- CPU-pinned workers: Glommio executors with io_uring
- SIMD acceleration: AVX2/AVX-512 for distance computation
- Configurable durability: fsync schedules from "every write" to "no sync"
- Persistent storage: Walrus (topic-based append storage) + RocksDB indexes
Linux only (requires io_uring, kernel 5.8+)
Install
Quick Start
use Arc;
use Walrus;
use ;
use ;
API
Core Operations
// Insert a vector (id, data, optional bucket_hint)
api.upsert_blocking?;
// Query: returns Vec<(id, distance)>
let results = api.query_blocking?;
// Query with vectors inline: returns Vec<(id, distance, vector)>
let results = api.query_with_vectors_blocking?;
// Fetch vectors by ID (via RocksDB index)
let vectors = api.fetch_vectors_by_id_blocking?;
Parameters
| Parameter | Description |
|---|---|
top_k |
Number of results to return |
router_top_k |
Number of buckets to probe (higher = better recall, slower) |
Architecture
See docs/architecture.md for detailed documentation including:
- System overview and component diagrams
- Two-tier search architecture
- Storage layer (Walrus + RocksDB)
- Rebalancer and clustering algorithms
- Data flow diagrams
SatoriHandle ──▶ Router Manager ──▶ HNSW Index (centroids)
│ │
│ ┌─────────────────────────┘
│ ▼
│ Bucket IDs ──▶ Consistent Hash Ring
│ │
▼ ▼
Workers ◀──────────────── bucket_id → shard
│
▼
Walrus (storage) + RocksDB (indexes)
Configuration
| Variable | Default | Description |
|---|---|---|
SATORI_REBALANCE_THRESHOLD |
2000 |
Split bucket when vector count exceeds this |
SATORI_ROUTER_REBUILD_EVERY |
1000 |
Rebuild HNSW index after N upserts |
SATORI_WORKER_CACHE_BUCKETS |
64 |
Max buckets cached per worker |
SATORI_WORKER_CACHE_BUCKET_MB |
64 |
Max MB per cached bucket |
SATORI_VECTOR_INDEX_PATH |
vector_index |
RocksDB path for id→vector index |
SATORI_BUCKET_INDEX_PATH |
bucket_index |
RocksDB path for id→bucket index |
WALRUS_DATA_DIR |
./wal_files |
Storage directory |
Durability
Configure via FsyncSchedule when creating Walrus:
// Fsync every 200ms (default), balances durability and throughput
Milliseconds
// Fsync every write, maximum durability
SyncEach
// No fsync, maximum throughput, data loss on crash
NoFsync
Build
Test
Benchmark (BigANN)
- Requires significant disk (~1TB+ download + converted). See
Makefiletargets. - Run
make benchmarkto download BigANN base/query/ground-truth, convert the base set viaprepare_dataset, and execute the benchmark (SATORI_RUN_BENCH=1 cargo run --release --bin satoridb). - Default ingest ceiling is 1B vectors (BigANN); uses streaming ingestion and queries via
src/bin/satoridb.rs. - On 1B+ (bigger-than-RAM) workloads, the benchmark reports 95%+ recall using the default settings.
License
See LICENSE.
Note: SatoriDB is in early development (v0.1.0). APIs may change between versions. See CHANGELOG.md for release notes.