# bplus_store
> Status: **alpha** — APIs may change.
> License: **MIT OR Apache-2.0**
Embedded, copy-on-write **B+-tree** key-value store in Rust.
Synchronous, zero-network. **Multi-writer** with optimistic commits (CAS).
**Snapshot** readers via epoch-based reclamation.
---
## Why
- **Multi-writer:** concurrent writers proceed in parallel; losers retry via OCC.
- **Crash-safe:** COW + atomic metadata publish; no WAL, no fsck.
- **Predictable perf:** slotted pages, bounded fanout, no heap storms.
- **Epoch snapshots:** readers pin an epoch and never block writers.
- **Embedded-first:** link as a library, call `put`/`get`/`delete`.
- **Pluggable encoding:** `NodeStorage` is a swappable node encoding strategy on top of raw `PageStorage`.
---
## Features
- Copy-on-write page mutation via `NodeView` over `[u8; 4096]` pages
- Multiple concurrent writers (optimistic concurrency; CAS on metadata)
- Batched write transactions (stage → commit → reclaim)
- In-memory page cache in `PagedNodeStorage` (COW-coherent, eviction via epoch GC)
- Cursor-based range iteration (parent-stack traversal, no sibling pointers)
- Physical fullness handling: large values trigger page splits before reaching max keys
- Pluggable node encoding via `NodeStorage` trait; raw page I/O via `PageStorage` trait
- Multi-tree support: one database directory, many named trees (create, rename, drop, list)
- Manifest-based crash recovery with CRC-framed catalog log
- Superblock and metadata page CRC validation
- Exclusive file locking to prevent multi-process corruption
- Built-in order-preserving codecs for `u64`, `i64`, `String`, `Vec<u8>`
- Typed `Tree<K, V>` API with `KeyCodec` / `ValueCodec` traits
- Thread-safe handles via `Arc`-based storage ownership (no `unsafe` in the public API)
---
## Quick start
```toml
[dependencies]
bplus_store = "0.4"
```
### Build & test
```bash
cargo build
cargo test --tests
cargo run --example bytes_api
cargo run --example typed_api
cargo bench
```
### Bytes-level API
```rust
use bplus_store::api::Db;
let dir = tempfile::tempdir()?;
let db = Db::open(dir.path())?;
let tree = db.create_tree::<Vec<u8>, Vec<u8>>("data", 64)?;
tree.put(&b"alpha".to_vec(), &b"1".to_vec())?;
tree.put(&b"beta".to_vec(), &b"2".to_vec())?;
let val = tree.get(&b"alpha".to_vec())?;
assert_eq!(val.as_deref(), Some(&b"1"[..]));
tree.delete(&b"alpha".to_vec())?;
```
### Typed API
```rust
use bplus_store::api::Db;
let dir = tempfile::tempdir()?;
let db = Db::open(dir.path())?;
let tree = db.create_tree::<u64, String>("users", 64)?;
tree.put(&42, &"answer".to_string())?;
assert_eq!(tree.get(&42)?.as_deref(), Some("answer"));
```
### Batched write transaction
```rust
let tree = db.create_tree::<u64, String>("events", 64)?;
let mut txn = tree.txn();
txn.insert(&1, &"first".to_string());
txn.insert(&2, &"second".to_string());
txn.commit()?; // atomic CAS; retries internally on conflict
```
---
## API surface
### `Db`
- `Db::open(dir)` — opens or creates a database
- `db.create_tree::<K, V>(name, order)` — creates a named tree
- `db.open_tree::<K, V>(name)` — opens an existing tree
- `db.tree::<K, V>(name, order)` — open-or-create
- `db.rename_tree(old, new)` — rename a tree (recorded in manifest)
- `db.drop_tree(name)` — remove a tree from the catalog
- `db.list_trees()` — returns all tree names
- `db.close()` — checkpoints the freelist and drops the database
- `db.format_version()` — on-disk format version from the superblock
### `Tree<K, V>`
- `tree.put(&key, &value)` — insert or replace
- `tree.get(&key)` — lookup, returns `Option<V>`
- `tree.contains_key(&key)` — check existence without decoding the value
- `tree.delete(&key)` — remove
- `tree.txn()` — start a batched `WriteTxn`
- `tree.range(&start, &end)` — forward range scan `[start, end)`
- `tree.range_from(&start)` — forward range scan from `start` to end of tree
- `tree.len()` / `tree.is_empty()`
- `tree.height()` — current tree height (1 = single leaf)
### `WriteTxn<K, V>`
- `txn.insert(&key, &value)` — stage an insert
- `txn.delete(&key)` — stage a delete
- `txn.commit()` — atomically apply all staged operations
> `delete` returns an error if the key is not found.
> `commit` returns `Err(ApiError::TxnAborted)` if the retry budget is exhausted.
---
## Multi-writer semantics (OCC)
Multiple writers run in parallel:
1. Capture a **base version** (committed metadata pointer).
2. Apply writes on a staged tree (COW pages).
3. **Commit** by CAS-ing the metadata pointer.
If another writer published first, the transaction rebases from the latest root and
retries (up to a configurable limit). Readers never block writers.
---
## Durability and fsync
Each commit follows a strict sequence:
. **CAS publish** — the new metadata pointer becomes visible to in-process readers
mmediately (atomic swap, no disk I/O).
2. **Write metadata page** — the new `(root_id, height, size, txn_id)` is written to the
inactive A/B metadata slot via positional `write_all_at()` (kernel page cache, not yet
durable).
3. **`fdatasync()`** — a single `sync_data()` call flushes all dirty pages in the data
file to disk: both the COW node pages written during the transaction and the metadata
page from step 2.
### Crash safety
The **A/B metadata slot alternation** provides atomic commit semantics without a WAL.
Each commit writes to `slot = txn_id % 2`, leaving the previous slot untouched. On
recovery, `MetadataManager::read_active_meta` reads both slots and picks the one with
the highest `txn_id` and a valid CRC32 checksum.
- **Crash before `fdatasync()`** — the new metadata page may not be on disk. Recovery
reads the old slot, which is still valid. The tree rolls back to the prior commit.
- **Torn write to new slot** — the CRC32 checksum detects it. Recovery falls back to
the old slot.
- **Crash after `fdatasync()`** — both node pages and metadata are durable. Recovery
picks the new slot.
### Why no WAL?
COW + A/B metadata swap provides atomic commits without a write-ahead log — old
pages are never modified, so there is nothing to undo. The only trade-off is that
a crash between CAS and `fdatasync` can leak pages (allocated but unreachable);
these waste space but never cause data loss. A WAL may be added in the future
primarily as a **replication log**, where the per-commit `fsync` is unavoidable
anyway.
### Known side effect
`sync_data()` operates on the entire file descriptor, not a byte range. This means a
commit also flushes speculative COW pages written by other concurrent writers that have
not yet committed. Those pages are harmless (orphaned if the writer never commits) but
represent minor wasted I/O under concurrent write workloads. This is inherent to the
single-file, shared page pool design and is not a correctness issue.
---
## Epoch-based reclamation
Readers pin an **epoch** while walking a snapshot; writers retire old pages with the
**current epoch** at commit. A reclaimer frees pages only after all readers older than
that epoch have unpinned. No blocking, no use-after-free.
1. **Pin**: reader grabs `epoch_now` and reads from the committed root.
2. **Write**: writer builds a staged tree (COW), collects reclaimed node IDs.
3. **Commit**: writer CAS-publishes new metadata `(root_id, height, size)`. On success,
tag each reclaimed page with `retire_epoch = epoch_now`.
4. **GC**: compute `min_pinned` across threads; free any page with
`retire_epoch < min_pinned`.
---
## On-disk layout
```
<dir>/
data.db # all pages: superblock, tree nodes, metadata slots
manifest.log # append-only CRC-framed catalog log
freelist.snapshot # optional; written on graceful shutdown
db.lock # exclusive flock held while the database is open
```
### Recovery path
`database::open` acquires an exclusive file lock (`db.lock`), validates the superblock
(page 0, including CRC-32C), replays the CRC-framed manifest to rebuild the in-memory
catalog, then reconciles each tree's catalog entry against its on-disk A/B metadata pages
(source of truth for `root_id`, `height`, `size` after a crash). If a freelist snapshot
exists, freed page IDs are restored so they can be reused.
### Key components
- **Superblock** (page 0): magic number, format version, generation counter, CRC-32C.
- **Manifest**: append-only log of `CreateTree`, `RenameTree`, `DeleteTree`,
`Checkpoint` records. Each record is CRC-framed; truncated trailing records (crash
mid-write) are silently skipped, CRC mismatches are reported as corruption.
- **Catalog**: in-memory map of `TreeId -> TreeMeta`, rebuilt by replaying the manifest.
- **Per-tree metadata**: A/B alternating pages storing `(root_node_id, height, size, txn_id)`.
Commit writes to the inactive slot; readers always see a consistent pair.
Each page is CRC32-validated on read.
- **File lock**: exclusive `flock` on `db.lock` prevents concurrent access from multiple
processes.
---
## Architecture
For a detailed description of the architecture, design decisions, and trade-offs, see
[ARCHITECTURE.md](ARCHITECTURE.md).
```
src/
api.rs, api/ # Db, Tree<K,V>, WriteTxn, ApiError
codec.rs, codec/ # KeyCodec/ValueCodec traits, bincode codecs, kv (API codecs)
database.rs, database/ # Database, catalog, manifest (reader/writer), metadata, superblock
bplustree/ # BPlusTree core: search, insert, delete, commit, transaction
storage.rs, storage/ # PageStorage, NodeStorage, FilePageStorage, PagedNodeStorage,
# EpochManager, MetadataManager, page cache
page.rs, page/ # Slotted page layouts (leaf, internal)
keyfmt.rs, keyfmt/ # Key encoding formats (raw, prefix-compressed)
layout.rs # PAGE_SIZE constant
examples/
bytes_api.rs # Vec<u8> key/value CRUD
typed_api.rs # u64/String with batched transaction
concurrent_web_store.rs # Multi-threaded HTTP fetch + concurrent tree writes
example.rs # async HTTP fetch + store
benches/
bench_insert.rs # Criterion benchmarks
```
### Layer overview (bottom to top)
**Page layer** (`page/`, `layout.rs`): fixed 4 KB slotted pages. Header → slot
directory → packed data region. Leaf pages store `(key, value)` pairs; internal
pages store `(key, right_child)` with `leftmost_child` in the header.
**Storage layer** (`storage.rs`, `storage/`): `PageStorage` trait for raw page I/O;
`NodeStorage` trait for encoded node I/O (pluggable encoding strategy).
`FilePageStorage` is the concrete file-backed `PageStorage`.
`PagedNodeStorage<S>` wraps any `PageStorage` into a `NodeStorage` with an
in-memory read cache of decoded `NodeView`s. Cache correctness relies on COW
immutability — a page ID's content never changes while the page is live.
**Database layer** (`database.rs`, `database/`): `Database<S>` owns a
`PagedNodeStorage<S>` for node encoding and an `Arc<S>` for raw metadata I/O (both
share the same underlying storage instance). Manages the superblock, manifest, catalog,
and tree lifecycle.
**B+ tree core** (`bplustree/`): `BPlusTree` / `SharedBPlusTree` — search, insert,
delete, commit with CAS. `WriteTransaction` buffers operations for batched atomic
commits.
**API layer** (`api.rs`, `api/`): `Db` wraps a `Database` in `Arc` and hands out
typed `Tree<K, V>` handles. Storage is shared via `Arc`, so tree handles are
independently owned and can be freely sent across threads. Purely synchronous.
---
## Design trade-offs: COW, sibling pointers, and batched writes
### Why COW?
Copy-on-write is the foundation of the concurrency model. Every write clones only the
pages it touches (leaf + ancestors), then atomically publishes a new root via CAS on the
metadata pointer. Readers never block writers because they see a consistent snapshot
pinned at their epoch. This is the same approach used by LMDB, BoltDB, and redb in
production.
### Sibling pointers and range iteration
Traditional B+ trees link leaves with `next`/`prev` pointers for fast sequential scans.
Under COW this creates a cascade problem: COW-copying one leaf gives it a new page ID,
which invalidates its left sibling's `next` pointer, forcing a COW-copy of that sibling
too, and so on through the entire leaf chain.
The standard solution (used by LMDB, BoltDB, redb) is to not use sibling pointers at
all. Range iteration instead uses a **cursor** that maintains a stack of
`(node_id, index)` frames from root to leaf. When a leaf is exhausted, the cursor pops
up to the parent, advances the index, and descends back down. The cost is O(log n) per
leaf transition in the worst case, but in practice the tree height is 3-5 even for
millions of keys, and parent pages are hot in cache.
This cursor-based iterator is implemented in `BPlusTreeIter` and exposed through
`tree.range()` and `tree.range_from()`.
### Batched writes
The `WriteTransaction` buffers operations and replays them against the current root at
commit time. If the CAS fails (another writer committed first), it rebases from the new
root and retries. This is correct for the OCC model. Two potential improvements for
large batches:
- **Sort the batch by key** before replay, so leaf access is sequential and minimises
the number of distinct COW page copies.
- **Bulk-load path** for initial data ingestion: build subtrees bottom-up rather than
inserting through the tree one key at a time.
### Physical fullness and large values
The tree handles both logical overflow (`keys_len() > max_keys`) and physical overflow
(`PageFull` from the slotted page layer). Large values can fill a 4 KB page before
reaching the tree order, triggering page splits at the physical level. Entries are
validated upfront: `key_len + val_len` must not exceed `MAX_ENTRY_PAYLOAD` (2038 bytes),
guaranteeing that at least two entries always fit per page so splits produce valid halves.
### Where this design fits
- **Embedded databases** (the LMDB/redb/BoltDB niche) where the store is linked as a
library, not accessed over a network.
- **Read-heavy workloads** where readers must never block and always see consistent
snapshots.
- **Crash safety without a WAL**: COW gives atomic commits for free since old pages
survive until the new root is published.
- **Low-to-moderate write contention**: OCC retries are cheap when conflicts are rare.
### Where it struggles
- **Write-heavy workloads with high contention**: OCC retries discard and redo all
speculative work.
- **Large sequential bulk loads**: COW copies O(height) pages per insert; a bulk-load
path would amortise this.
- **Values larger than ~2 KB**: entries must fit within `MAX_ENTRY_PAYLOAD` (2038 bytes).
Overflow pages or external value storage are not yet implemented.
---
## Gotchas
- **Order-preserving keys:** if your codec doesn't preserve lexicographic order, scans will be wrong.
- **Commit conflicts:** normal under load. `WriteTxn` retries automatically up to a budget.
- **Entry size limit:** key + value must fit within 2038 bytes (`MAX_ENTRY_PAYLOAD`).
Entries exceeding this limit are rejected with `TreeError::EntryTooLarge`.
---
## Roadmap
- **Prefix-compressed key block format (`PrefixRestarts`)** — Keys are currently stored
verbatim in each slot. When keys share long common prefixes, this wastes significant
page space. Prefix compression stores the shared prefix once and only the differing
suffix per key, with periodic restart points for random access within the block. This
increases key density per page and reduces I/O for prefix-heavy workloads.
- **Bulk-load path for large initial imports** — Inserting N keys one-by-one through the
tree incurs O(height) COW copies per key. A bulk-load path sorts all keys upfront,
fills leaves left-to-right, and builds internal nodes bottom-up. Orders of magnitude
faster for initial data ingestion compared to incremental inserts.
- **Overflow pages for values exceeding `MAX_ENTRY_PAYLOAD`** — Currently key + value
must fit within 2038 bytes. Overflow pages would store large values across multiple
linked pages, removing this size constraint. This is standard in production B-trees
(SQLite, LMDB).
- **Fuzz testing (`cargo-fuzz`)** — Use coverage-guided fuzzing to generate random
sequences of inserts, deletes, splits, and merges, then verify tree invariants hold
after each operation. Catches edge cases in the slotted page layout and codec
encode/decode roundtrips that hand-written tests are unlikely to cover.
- **Configurable page size** — Currently hardcoded to 4 KB. Some workloads benefit from
larger pages (16 KB, 64 KB) for fewer tree levels and better sequential throughput;
smaller pages reduce write amplification under update-heavy workloads. Making this
configurable requires storing the page size in the superblock and threading it through
the page layer.
- **Sharded epoch pinning** — `EpochManager::pin()`/`unpin()` currently acquire a
central `Mutex<HashMap<ThreadId, Epoch>>` on every read operation. Under high reader
concurrency this serialises the pin/unpin brackets even though the tree walk itself is
lock-free. Replace with per-thread atomic slots (a `Vec<AtomicU64>` indexed by a
thread-claimed slot) so that pin is a single atomic store and `oldest_active()` is a
lock-free scan. This is the approach used by crossbeam-epoch and similar libraries.
---
## License
Dual-licensed under MIT or Apache-2.0. You may choose either license.
## Contact
Paris Mesidis — pmesidis@gmail.com