bplus_store 0.4.0

Copy-on-write B+ tree with page-aligned storage, split/merge, and crash-safety primitives.
Documentation

bplus_store

Status: alpha — APIs may change. License: MIT OR Apache-2.0

Embedded, copy-on-write B+-tree key-value store in Rust. Synchronous, zero-network. Multi-writer with optimistic commits (CAS). Snapshot readers via epoch-based reclamation.


Why

  • Multi-writer: concurrent writers proceed in parallel; losers retry via OCC.
  • Crash-safe: COW + atomic metadata publish; no WAL, no fsck.
  • Predictable perf: slotted pages, bounded fanout, no heap storms.
  • Epoch snapshots: readers pin an epoch and never block writers.
  • Embedded-first: link as a library, call put/get/delete.
  • Pluggable encoding: NodeStorage is a swappable node encoding strategy on top of raw PageStorage.

Features

  • Copy-on-write page mutation via NodeView over [u8; 4096] pages
  • Multiple concurrent writers (optimistic concurrency; CAS on metadata)
  • Batched write transactions (stage → commit → reclaim)
  • In-memory page cache in PagedNodeStorage (COW-coherent, eviction via epoch GC)
  • Cursor-based range iteration (parent-stack traversal, no sibling pointers)
  • Physical fullness handling: large values trigger page splits before reaching max keys
  • Pluggable node encoding via NodeStorage trait; raw page I/O via PageStorage trait
  • Multi-tree support: one database directory, many named trees (create, rename, drop, list)
  • Manifest-based crash recovery with CRC-framed catalog log
  • Superblock and metadata page CRC validation
  • Exclusive file locking to prevent multi-process corruption
  • Built-in order-preserving codecs for u64, i64, String, Vec<u8>
  • Typed Tree<K, V> API with KeyCodec / ValueCodec traits
  • Thread-safe handles via Arc-based storage ownership (no unsafe in the public API)

Quick start

[dependencies]
bplus_store = "0.4"

Build & test

cargo build
cargo test --tests
cargo run --example bytes_api
cargo run --example typed_api
cargo bench

Bytes-level API

use bplus_store::api::Db;

let dir = tempfile::tempdir()?;
let db = Db::open(dir.path())?;
let tree = db.create_tree::<Vec<u8>, Vec<u8>>("data", 64)?;

tree.put(&b"alpha".to_vec(), &b"1".to_vec())?;
tree.put(&b"beta".to_vec(), &b"2".to_vec())?;

let val = tree.get(&b"alpha".to_vec())?;
assert_eq!(val.as_deref(), Some(&b"1"[..]));

tree.delete(&b"alpha".to_vec())?;

Typed API

use bplus_store::api::Db;

let dir = tempfile::tempdir()?;
let db = Db::open(dir.path())?;
let tree = db.create_tree::<u64, String>("users", 64)?;

tree.put(&42, &"answer".to_string())?;
assert_eq!(tree.get(&42)?.as_deref(), Some("answer"));

Batched write transaction

let tree = db.create_tree::<u64, String>("events", 64)?;

let mut txn = tree.txn();
txn.insert(&1, &"first".to_string());
txn.insert(&2, &"second".to_string());
txn.commit()?;  // atomic CAS; retries internally on conflict

API surface

Db

  • Db::open(dir) — opens or creates a database
  • db.create_tree::<K, V>(name, order) — creates a named tree
  • db.open_tree::<K, V>(name) — opens an existing tree
  • db.tree::<K, V>(name, order) — open-or-create
  • db.rename_tree(old, new) — rename a tree (recorded in manifest)
  • db.drop_tree(name) — remove a tree from the catalog
  • db.list_trees() — returns all tree names
  • db.close() — checkpoints the freelist and drops the database
  • db.format_version() — on-disk format version from the superblock

Tree<K, V>

  • tree.put(&key, &value) — insert or replace
  • tree.get(&key) — lookup, returns Option<V>
  • tree.contains_key(&key) — check existence without decoding the value
  • tree.delete(&key) — remove
  • tree.txn() — start a batched WriteTxn
  • tree.range(&start, &end) — forward range scan [start, end)
  • tree.range_from(&start) — forward range scan from start to end of tree
  • tree.len() / tree.is_empty()
  • tree.height() — current tree height (1 = single leaf)

WriteTxn<K, V>

  • txn.insert(&key, &value) — stage an insert
  • txn.delete(&key) — stage a delete
  • txn.commit() — atomically apply all staged operations

delete returns an error if the key is not found. commit returns Err(ApiError::TxnAborted) if the retry budget is exhausted.


Multi-writer semantics (OCC)

Multiple writers run in parallel:

  1. Capture a base version (committed metadata pointer).
  2. Apply writes on a staged tree (COW pages).
  3. Commit by CAS-ing the metadata pointer.

If another writer published first, the transaction rebases from the latest root and retries (up to a configurable limit). Readers never block writers.


Durability and fsync

Each commit follows a strict sequence:

. CAS publish — the new metadata pointer becomes visible to in-process readers mmediately (atomic swap, no disk I/O). 2. Write metadata page — the new (root_id, height, size, txn_id) is written to the inactive A/B metadata slot via positional write_all_at() (kernel page cache, not yet durable). 3. fdatasync() — a single sync_data() call flushes all dirty pages in the data file to disk: both the COW node pages written during the transaction and the metadata page from step 2.

Crash safety

The A/B metadata slot alternation provides atomic commit semantics without a WAL. Each commit writes to slot = txn_id % 2, leaving the previous slot untouched. On recovery, MetadataManager::read_active_meta reads both slots and picks the one with the highest txn_id and a valid CRC32 checksum.

  • Crash before fdatasync() — the new metadata page may not be on disk. Recovery reads the old slot, which is still valid. The tree rolls back to the prior commit.
  • Torn write to new slot — the CRC32 checksum detects it. Recovery falls back to the old slot.
  • Crash after fdatasync() — both node pages and metadata are durable. Recovery picks the new slot.

Why no WAL?

COW + A/B metadata swap provides atomic commits without a write-ahead log — old pages are never modified, so there is nothing to undo. The only trade-off is that a crash between CAS and fdatasync can leak pages (allocated but unreachable); these waste space but never cause data loss. A WAL may be added in the future primarily as a replication log, where the per-commit fsync is unavoidable anyway.

Known side effect

sync_data() operates on the entire file descriptor, not a byte range. This means a commit also flushes speculative COW pages written by other concurrent writers that have not yet committed. Those pages are harmless (orphaned if the writer never commits) but represent minor wasted I/O under concurrent write workloads. This is inherent to the single-file, shared page pool design and is not a correctness issue.


Epoch-based reclamation

Readers pin an epoch while walking a snapshot; writers retire old pages with the current epoch at commit. A reclaimer frees pages only after all readers older than that epoch have unpinned. No blocking, no use-after-free.

  1. Pin: reader grabs epoch_now and reads from the committed root.
  2. Write: writer builds a staged tree (COW), collects reclaimed node IDs.
  3. Commit: writer CAS-publishes new metadata (root_id, height, size). On success, tag each reclaimed page with retire_epoch = epoch_now.
  4. GC: compute min_pinned across threads; free any page with retire_epoch < min_pinned.

On-disk layout

<dir>/
  data.db            # all pages: superblock, tree nodes, metadata slots
  manifest.log       # append-only CRC-framed catalog log
  freelist.snapshot   # optional; written on graceful shutdown
  db.lock            # exclusive flock held while the database is open

Recovery path

database::open acquires an exclusive file lock (db.lock), validates the superblock (page 0, including CRC-32C), replays the CRC-framed manifest to rebuild the in-memory catalog, then reconciles each tree's catalog entry against its on-disk A/B metadata pages (source of truth for root_id, height, size after a crash). If a freelist snapshot exists, freed page IDs are restored so they can be reused.

Key components

  • Superblock (page 0): magic number, format version, generation counter, CRC-32C.
  • Manifest: append-only log of CreateTree, RenameTree, DeleteTree, Checkpoint records. Each record is CRC-framed; truncated trailing records (crash mid-write) are silently skipped, CRC mismatches are reported as corruption.
  • Catalog: in-memory map of TreeId -> TreeMeta, rebuilt by replaying the manifest.
  • Per-tree metadata: A/B alternating pages storing (root_node_id, height, size, txn_id). Commit writes to the inactive slot; readers always see a consistent pair. Each page is CRC32-validated on read.
  • File lock: exclusive flock on db.lock prevents concurrent access from multiple processes.

Architecture

For a detailed description of the architecture, design decisions, and trade-offs, see ARCHITECTURE.md.

src/
  api.rs, api/                  # Db, Tree<K,V>, WriteTxn, ApiError
  codec.rs, codec/              # KeyCodec/ValueCodec traits, bincode codecs, kv (API codecs)
  database.rs, database/        # Database, catalog, manifest (reader/writer), metadata, superblock
  bplustree/                    # BPlusTree core: search, insert, delete, commit, transaction
  storage.rs, storage/          # PageStorage, NodeStorage, FilePageStorage, PagedNodeStorage,
                                #   EpochManager, MetadataManager, page cache
  page.rs, page/                # Slotted page layouts (leaf, internal)
  keyfmt.rs, keyfmt/            # Key encoding formats (raw, prefix-compressed)
  layout.rs                     # PAGE_SIZE constant
examples/
  bytes_api.rs                  # Vec<u8> key/value CRUD
  typed_api.rs                  # u64/String with batched transaction
  concurrent_web_store.rs       # Multi-threaded HTTP fetch + concurrent tree writes
  example.rs                    # async HTTP fetch + store
benches/
  bench_insert.rs               # Criterion benchmarks

Layer overview (bottom to top)

Page layer (page/, layout.rs): fixed 4 KB slotted pages. Header → slot directory → packed data region. Leaf pages store (key, value) pairs; internal pages store (key, right_child) with leftmost_child in the header.

Storage layer (storage.rs, storage/): PageStorage trait for raw page I/O; NodeStorage trait for encoded node I/O (pluggable encoding strategy). FilePageStorage is the concrete file-backed PageStorage. PagedNodeStorage<S> wraps any PageStorage into a NodeStorage with an in-memory read cache of decoded NodeViews. Cache correctness relies on COW immutability — a page ID's content never changes while the page is live.

Database layer (database.rs, database/): Database<S> owns a PagedNodeStorage<S> for node encoding and an Arc<S> for raw metadata I/O (both share the same underlying storage instance). Manages the superblock, manifest, catalog, and tree lifecycle.

B+ tree core (bplustree/): BPlusTree / SharedBPlusTree — search, insert, delete, commit with CAS. WriteTransaction buffers operations for batched atomic commits.

API layer (api.rs, api/): Db wraps a Database in Arc and hands out typed Tree<K, V> handles. Storage is shared via Arc, so tree handles are independently owned and can be freely sent across threads. Purely synchronous.


Design trade-offs: COW, sibling pointers, and batched writes

Why COW?

Copy-on-write is the foundation of the concurrency model. Every write clones only the pages it touches (leaf + ancestors), then atomically publishes a new root via CAS on the metadata pointer. Readers never block writers because they see a consistent snapshot pinned at their epoch. This is the same approach used by LMDB, BoltDB, and redb in production.

Sibling pointers and range iteration

Traditional B+ trees link leaves with next/prev pointers for fast sequential scans. Under COW this creates a cascade problem: COW-copying one leaf gives it a new page ID, which invalidates its left sibling's next pointer, forcing a COW-copy of that sibling too, and so on through the entire leaf chain.

The standard solution (used by LMDB, BoltDB, redb) is to not use sibling pointers at all. Range iteration instead uses a cursor that maintains a stack of (node_id, index) frames from root to leaf. When a leaf is exhausted, the cursor pops up to the parent, advances the index, and descends back down. The cost is O(log n) per leaf transition in the worst case, but in practice the tree height is 3-5 even for millions of keys, and parent pages are hot in cache.

This cursor-based iterator is implemented in BPlusTreeIter and exposed through tree.range() and tree.range_from().

Batched writes

The WriteTransaction buffers operations and replays them against the current root at commit time. If the CAS fails (another writer committed first), it rebases from the new root and retries. This is correct for the OCC model. Two potential improvements for large batches:

  • Sort the batch by key before replay, so leaf access is sequential and minimises the number of distinct COW page copies.
  • Bulk-load path for initial data ingestion: build subtrees bottom-up rather than inserting through the tree one key at a time.

Physical fullness and large values

The tree handles both logical overflow (keys_len() > max_keys) and physical overflow (PageFull from the slotted page layer). Large values can fill a 4 KB page before reaching the tree order, triggering page splits at the physical level. Entries are validated upfront: key_len + val_len must not exceed MAX_ENTRY_PAYLOAD (2038 bytes), guaranteeing that at least two entries always fit per page so splits produce valid halves.

Where this design fits

  • Embedded databases (the LMDB/redb/BoltDB niche) where the store is linked as a library, not accessed over a network.
  • Read-heavy workloads where readers must never block and always see consistent snapshots.
  • Crash safety without a WAL: COW gives atomic commits for free since old pages survive until the new root is published.
  • Low-to-moderate write contention: OCC retries are cheap when conflicts are rare.

Where it struggles

  • Write-heavy workloads with high contention: OCC retries discard and redo all speculative work.
  • Large sequential bulk loads: COW copies O(height) pages per insert; a bulk-load path would amortise this.
  • Values larger than ~2 KB: entries must fit within MAX_ENTRY_PAYLOAD (2038 bytes). Overflow pages or external value storage are not yet implemented.

Gotchas

  • Order-preserving keys: if your codec doesn't preserve lexicographic order, scans will be wrong.
  • Commit conflicts: normal under load. WriteTxn retries automatically up to a budget.
  • Entry size limit: key + value must fit within 2038 bytes (MAX_ENTRY_PAYLOAD). Entries exceeding this limit are rejected with TreeError::EntryTooLarge.

Roadmap

  • Prefix-compressed key block format (PrefixRestarts) — Keys are currently stored verbatim in each slot. When keys share long common prefixes, this wastes significant page space. Prefix compression stores the shared prefix once and only the differing suffix per key, with periodic restart points for random access within the block. This increases key density per page and reduces I/O for prefix-heavy workloads.

  • Bulk-load path for large initial imports — Inserting N keys one-by-one through the tree incurs O(height) COW copies per key. A bulk-load path sorts all keys upfront, fills leaves left-to-right, and builds internal nodes bottom-up. Orders of magnitude faster for initial data ingestion compared to incremental inserts.

  • Overflow pages for values exceeding MAX_ENTRY_PAYLOAD — Currently key + value must fit within 2038 bytes. Overflow pages would store large values across multiple linked pages, removing this size constraint. This is standard in production B-trees (SQLite, LMDB).

  • Fuzz testing (cargo-fuzz) — Use coverage-guided fuzzing to generate random sequences of inserts, deletes, splits, and merges, then verify tree invariants hold after each operation. Catches edge cases in the slotted page layout and codec encode/decode roundtrips that hand-written tests are unlikely to cover.

  • Configurable page size — Currently hardcoded to 4 KB. Some workloads benefit from larger pages (16 KB, 64 KB) for fewer tree levels and better sequential throughput; smaller pages reduce write amplification under update-heavy workloads. Making this configurable requires storing the page size in the superblock and threading it through the page layer.

  • Sharded epoch pinningEpochManager::pin()/unpin() currently acquire a central Mutex<HashMap<ThreadId, Epoch>> on every read operation. Under high reader concurrency this serialises the pin/unpin brackets even though the tree walk itself is lock-free. Replace with per-thread atomic slots (a Vec<AtomicU64> indexed by a thread-claimed slot) so that pin is a single atomic store and oldest_active() is a lock-free scan. This is the approach used by crossbeam-epoch and similar libraries.


License

Dual-licensed under MIT or Apache-2.0. You may choose either license.

Contact

Paris Mesidis — pmesidis@gmail.com