satoridb 0.1.0

Embedded vector database for approximate nearest neighbor search (experimental).
Documentation

SatoriDB

Billion scale embedded vector database for approximate nearest neighbor (ANN) search.

Architecture

SatoriDB uses a two-tier architecture: an HNSW index routes queries to the most relevant buckets (clusters of similar vectors), then CPU-pinned workers scan those buckets in parallel. Vectors are automatically clustered and rebalanced as data grows.

Features

  • Embedded: runs entirely in-process, no external services
  • Two-tier search: HNSW routing + parallel bucket scanning
  • Automatic clustering: vectors grouped by similarity, splits when buckets grow
  • CPU-pinned workers: Glommio executors with io_uring
  • SIMD acceleration: AVX2/AVX-512 for distance computation
  • Configurable durability: fsync schedules from "every write" to "no sync"
  • Persistent storage: Walrus (topic-based append storage) + RocksDB indexes

Linux only (requires io_uring, kernel 5.8+)

Install

cargo add satoridb

Quick Start

use std::sync::Arc;
use satoridb::wal::runtime::Walrus;
use satoridb::wal::{FsyncSchedule, ReadConsistency};
use satoridb::{SatoriDb, SatoriDbConfig};

fn main() -> anyhow::Result<()> {
    // Initialize storage (writes to wal_files/my_app/)
    let wal = Arc::new(Walrus::with_consistency_and_schedule_for_key(
        "my_app",
        ReadConsistency::StrictlyAtOnce,
        FsyncSchedule::Milliseconds(200),
    )?);

    // Start database with 4 worker threads
    let mut cfg = SatoriDbConfig::new(wal);
    cfg.workers = 4;
    let db = SatoriDb::start(cfg)?;
    let api = db.handle();

    // Upsert vectors
    api.upsert_blocking(1, vec![0.1, 0.2, 0.3], None)?;
    api.upsert_blocking(2, vec![0.2, 0.3, 0.4], None)?;
    api.upsert_blocking(3, vec![0.9, 0.8, 0.7], None)?;

    // Query: find 10 nearest neighbors, probe 200 buckets
    let results = api.query_blocking(vec![0.15, 0.25, 0.35], 10, 200)?;
    for (id, distance) in results {
        println!("id={id} distance={distance}");
    }

    db.shutdown()?;
    Ok(())
}
cargo run --example embedded_basic

API

Core Operations

// Insert a vector (id, data, optional bucket_hint)
api.upsert_blocking(id, vector, None)?;

// Query: returns Vec<(id, distance)>
let results = api.query_blocking(query_vector, top_k, router_top_k)?;

// Query with vectors inline: returns Vec<(id, distance, vector)>
let results = api.query_with_vectors_blocking(query_vector, top_k, router_top_k)?;

// Fetch vectors by ID (via RocksDB index)
let vectors = api.fetch_vectors_by_id_blocking(vec![1, 2, 3])?;

Parameters

Parameter Description
top_k Number of results to return
router_top_k Number of buckets to probe (higher = better recall, slower)

Architecture

See docs/architecture.md for detailed documentation including:

  • System overview and component diagrams
  • Two-tier search architecture
  • Storage layer (Walrus + RocksDB)
  • Rebalancer and clustering algorithms
  • Data flow diagrams
SatoriHandle ──▶ Router Manager ──▶ HNSW Index (centroids)
      │                                   │
      │         ┌─────────────────────────┘
      │         ▼
      │    Bucket IDs ──▶ Consistent Hash Ring
      │                          │
      ▼                          ▼
  Workers ◀──────────────── bucket_id → shard
      │
      ▼
  Walrus (storage) + RocksDB (indexes)

Configuration

Variable Default Description
SATORI_REBALANCE_THRESHOLD 2000 Split bucket when vector count exceeds this
SATORI_ROUTER_REBUILD_EVERY 1000 Rebuild HNSW index after N upserts
SATORI_WORKER_CACHE_BUCKETS 64 Max buckets cached per worker
SATORI_WORKER_CACHE_BUCKET_MB 64 Max MB per cached bucket
SATORI_VECTOR_INDEX_PATH vector_index RocksDB path for id→vector index
SATORI_BUCKET_INDEX_PATH bucket_index RocksDB path for id→bucket index
WALRUS_DATA_DIR ./wal_files Storage directory

Durability

Configure via FsyncSchedule when creating Walrus:

// Fsync every 200ms (default), balances durability and throughput
FsyncSchedule::Milliseconds(200)

// Fsync every write, maximum durability
FsyncSchedule::SyncEach

// No fsync, maximum throughput, data loss on crash
FsyncSchedule::NoFsync

Build

cargo build --release

Test

Benchmark (BigANN)

  • Requires significant disk (~1TB+ download + converted). See Makefile targets.
  • Run make benchmark to download BigANN base/query/ground-truth, convert the base set via prepare_dataset, and execute the benchmark (SATORI_RUN_BENCH=1 cargo run --release --bin satoridb).
  • Default ingest ceiling is 1B vectors (BigANN); uses streaming ingestion and queries via src/bin/satoridb.rs.
  • On 1B+ (bigger-than-RAM) workloads, the benchmark reports 95%+ recall using the default settings.

License

See LICENSE.

Note: SatoriDB is in early development (v0.1.0). APIs may change between versions. See CHANGELOG.md for release notes.