holt
A carefully crafted adaptive radix tree for path-shaped metadata.
⚠️ Pre-1.0 (v0.5.3 released). The public API is now narrow and SemVer-stable inside a minor release, but minor releases may still break source compatibility before 1.0. Pin the exact published version in your
Cargo.toml(holt = "=0.5.3") until 1.0 stabilises the surface.
holt is an embedded Rust library for storing hierarchical
keys — file paths, S3 object names, multi-tenant namespaces,
time-bucketed identifiers — with sub-microsecond lookups, per-blob
concurrency, and crash-safe persistence.
It targets workloads where:
- Keys are hierarchical / path-shaped (so prefix compression pays).
- The dominant access is point lookup + prefix range scan.
- Concurrency is high (many readers + writers across disjoint subtrees).
- Latency is micro-critical — no LSM compaction stalls, no single-writer locks.
It is not a general-purpose KV store; if you need full-text
or vector similarity, reach for the right tool. For this shape,
holt's strongest wins are metadata-native operations: delimiter
directory rollup and mixed metadata workloads. In the v0.3 Linux
release run, objstore_list_dir is 151× faster than RocksDB
and fs_list_dir is 268× faster; objstore_metadata_mix is
43× faster than RocksDB and fs_metadata_mix is 66×
faster. Point reads stay ahead through 2 M keys; point writes are
competitive but intentionally not the headline claim — see
benches/RESULTS.md.
Why "holt"?
A holt (Old English holt) is a small grove or copse — a
self-contained collection of trees on a single piece of ground.
That maps directly to the design: each holt::Tree is one ART
made of many 512 KB blob frames, bounded and self-contained,
grown by repeated splitBlob. Short to type, distinct from other
crates, easy to say.
When to reach for holt
| Engine | Data structure | Persistence | Concurrency | Notes |
|---|---|---|---|---|
| LMDB | B+tree | mmap | Single-writer MVCC | Battle-tested; page chasing for short hot keys. |
| RocksDB | LSM | SST + WAL | MVCC | Compaction stalls; large hot dataset is RAM-heavy. |
| SQLite | B-tree | File | Single writer | Convenient, but writer is the bottleneck under load. |
| Sled | Hybrid LSM | Log-structured | Lock-free | Rust-native, largely unmaintained. |
| holt | Adaptive Radix Tree | 512 KB blobs | Per-blob 3-mode latch | Path compression + lookup is O(key.len) |
ART's lookup cost is O(key.len), not O(log N). For short hot keys
(< 64 bytes), that beats any tree-based competitor. The per-blob
HybridLatch lets N readers traverse disjoint subtrees in parallel
without coordinating.
Project status
Pre-1.0, actively maintained. The metadata-engine core is landed:
put / get / delete / rename / range / atomic / compact, multi-blob
crossings, online maintenance gates, persistent store with O_DIRECT
and optional Linux io_uring, logical WAL with group commit, sharded
buffer manager, 3-thread background checkpointer, SIMD CRC32 + node
scans, copy-on-write snapshots, and stateful Tree::range with prefix,
start_after, and S3 delimiter rollup.
0.5 adds a two-axis durability model (Durability::Wal /
Durability::StateMachine, orthogonal to storage) and the pieces a
replicated metadata service needs: crash-consistent on-disk recovery
with no WAL when an external log owns durability
(DB::commit_durable / durable_applied_index), whole-DB checkpoint
export/install, and the metadata-shaped fast paths DB::scatter,
Tree::put_many_if_absent, and per-scan ScanStats. See the
Durability section below.
See CHANGELOG.md for the release notes and
ROADMAP.md for direction.
cargo bench --manifest-path benches/Cargo.toml --bench main
runs a side-by-side comparison with RocksDB, SQLite, and sled
across three metadata workload shapes — see
benches/README.md for the methodology and
headline numbers.
Usage
Add holt to your Cargo.toml:
[]
= "0.5"
The supported user surface is deliberately small:
DB, DBAtomicBatch, DBView, Scatter, TreeBuilder, Tree, TreeConfig,
Storage, Durability, RangeBuilder, RangeEntry, RangeIter,
KeyRangeBuilder, KeyRangeEntry, KeyRangeIter, ScanStats, Snapshot,
View, AtomicBatch, Record, RecordVersion, PutOutcome,
KeyPathBuf, KeyPrefixBuf, KeyPathError, CheckpointConfig,
CheckpointImage, TreeStats / DBStats / related stats structs,
Error / Result, and the custom-store surface (BlobStore,
MemoryBlobStore, FileBlobStore, AlignedBlobBuf, BlobGuid,
DurableManifest). Internal layout, WAL, walker, and buffer-manager
modules are not public API.
Open a tree
Two storage modes; same TreeBuilder, one knob switches between them:
use ;
// File-backed production mode, Unix-only — Linux `O_DIRECT`,
// macOS `F_NOCACHE`. The directory is created if missing.
let tree = new
.buffer_pool_size // optional: 512 blobs = 256 MiB
.durability // async group-commit WAL (default)
.open?;
// In-memory — volatile, dies with the last handle. The path
// argument becomes informational once `.memory()` flips the
// mode. Good for tests, sidecar caches, ephemeral session stores.
let tree = new.memory.open?;
File-backed trees default to 256 resident 512 KB blobs (128 MiB). In-memory trees keep the smaller 64-blob default (32 MiB).
Path-shaped keys
The core API takes byte keys. For object-store and filesystem-like
metadata, KeyPathBuf builds canonical slash-separated keys without
hand-written string formatting. It is optional: callers with opaque
keys can keep passing raw bytes.
use KeyPathBuf;
let mut key = with_namespace?;
key.push?;
key.push?;
key.push?;
key.push?;
tree.put?;
let mut prefix = with_namespace?;
prefix.push?;
prefix.push?;
let prefix = prefix.into_prefix;
for entry in tree.scan_keys.delimiter
# Ok::
Single-key CRUD
Bytes in, bytes out. put and delete are the hot paths: they write
or tombstone without reading the existing leaf. rename is atomic and
errors if dst exists unless force = true.
tree.put?;
let value: = tree.get?;
assert_eq!;
let existed: bool = tree.delete?;
assert!;
tree.put?;
tree.rename?;
Conditional writes
RecordVersion is Holt's lightweight compare-and-set token. It is
not an MVCC timestamp and does not provide historical snapshot
reads; it only says "this live record still has the same leaf seq".
tree.put_if_absent?;
let record = tree.get_record?.unwrap;
let updated: bool = tree.compare_and_put?;
assert!;
let current = tree.get_version?.unwrap;
let deleted: bool = tree.delete_if_version?;
assert!;
// Point-in-time read helper for rmdir-style metadata checks.
assert!;
Atomic batch
Multiple ops under one WAL record. Holt preflights rename and
conditional-write / prefix-emptiness guards before mutating,
applies the batch behind the tree-wide mutation gate, then emits
one Batch WAL record.
Ok(true) means committed, Ok(false) means a conditional guard
failed and nothing was published, and Err reports hard failures
such as a missing rename source or destination collision.
let template = tree.get_record?.unwrap;
let committed = tree.atomic?;
assert!;
Range scans and S3 delimiter rollup
Tree::range yields full records: key, value, and
RecordVersion. Use it when the list response needs metadata
bytes. Tree::range_keys / Tree::scan_keys use the same cursor
and delimiter semantics but skip value materialisation; use them
for name-only directory and object listings. Chain .prefix() to
anchor the scan, .start_after() for paging, and .delimiter()
for S3-style ?delimiter=/ rollup.
use ;
// Full prefix scan — `take(50)` for a paged "first 50" view
// when the caller needs metadata bytes.
let first_50: = tree
.range
.prefix
.into_iter
.take
.map
.?;
// S3-style "list one level" without copying values — leaves under
// `/img/` get emitted as Key; deeper paths roll up to a single
// `CommonPrefix` per distinct subdir.
for entry in tree.scan_keys.delimiter
Snapshot reads
Tree::range and Tree::range_keys are the hot restart-on-conflict
iterators. Use Tree::view(prefix, |view| { ... }) when multiple
reads must observe one stable prefix snapshot. A view copies the
reachable blob frames for prefix, releases the live tree, then
runs point reads and scans against that private frame set.
tree.view?;
Durability
Durability is a policy orthogonal to storage — pick who owns the durable
record with TreeConfig::durability / TreeBuilder::durability:
Durability::Wal { sync } — holt owns durability (single node). Each write
updates the BufferManager cache and appends one logical WAL record. sync: false
(the default) returns after the group-commit worker queues the record; sync: true
waits for sync_data, and concurrent writers share one fsync. Disk-truth advances at:
- Background checkpoint — enabled by default for file-backed trees. It flushes the WAL, writes dirty blobs through to the store, syncs, applies pending deletes, and truncates the WAL once the pipeline is clean.
Tree::checkpoint()— the same protocol, run synchronously.- WAL auto-flush — drains the pending buffer to the OS page cache past 64 KB, bounding in-memory buffering even when checkpoints are rare.
tree.checkpoint?; // flush WAL + write through + truncate
Durability::StateMachine — an external log owns durability (replicated).
For a replicated state machine (e.g. a Raft apply loop): holt attaches no WAL
— the external log is the durable, replicated record. Writes apply to the
BufferManager; you periodically pin a crash-consistent on-disk checkpoint with
commit_durable, and on restart replay only the log tail past the recovered index:
use ;
let mut cfg = new;
cfg.durability = StateMachine;
let db = DBopen?;
// ... apply committed log entries to the state machine, then pin a checkpoint:
db.commit_durable?; // atomic, crash-consistent on-disk point
// After a crash, reopen and replay only the tail past the recovered index:
let db = DBopen?;
let replay_from = db.durable_applied_index?; // re-apply entries > replay_from
commit_durable takes a copy-on-write snapshot, persists its frame closure, and
atomically commits a manifest recording the durable roots plus applied_index
and the resume sequence — the manifest rename is the commit point, so a crash
lands on the previous or the new checkpoint, never a torn state (fault-injection
and SIGKILL-soak tested). Two StateMachine-only fast paths exploit the external
log already owning write order: DB::scatter (independent cross-family CAS with
no atomic barrier, each on its own per-key concurrent path) and atomic batches
that take the mutation gate shared instead of fencing concurrent range scans.
Validation and observability
Holt has four validation layers:
- Unit and integration tests cover WAL replay, checkpoint recovery, range/view semantics, conditional atomic batches, and checkpoint failpoints.
- Property tests compare random operation streams against an oracle.
fuzz/contains acargo-fuzztarget for the atomic/WAL/range model.verified/contains Verus specs for ART node shape, grow/shrink, prefix split, delimiter rollup bounds, virtual terminators, and leaf alignment.
Long lifecycle campaigns live in the explicit soak tool:
With the metrics feature, holt::metrics::render_prometheus exposes
cache hit/miss, eviction/admission, route-cache, WAL work/debt,
checkpoint debt, dirty/pending-delete counts, and reopen WAL replay
time. The caller owns the HTTP endpoint; Holt only renders the
Prometheus text payload.
The GitHub workflows are split by cost:
- normal CI runs unit/integration/property tests, a bounded fuzz smoke, coverage, and a short soak smoke;
- nightly validation runs checkpoint/WAL fault tests, longer normal and crash soak campaigns, and a time-bounded fuzz campaign;
- Verus is available as a manual
Nightly Validationoption because hosted runners do not ship a Verus binary.
See TESTING.md for the full test matrix and release-gate
commands.
See examples/ for full programs:
basic_kv,
filesystem_meta,
session_store,
s3_metadata.
Architecture
See ARCHITECTURE.md for the deep dive.
The design draws on Leis et al.'s ART paper (ICDE 2013) for the
four-node-size scheme and LeanStore (ICDE 2018) for the HybridLatch
contract.
Not on the roadmap
holt is just the metadata engine — single-node, embed-in- your-process, Unix-only. Out of scope:
- Windows —
compile_error!s the crate (UnixO_DIRECT/F_NOCACHEhas no Windows analog worth carrying). - Object-storage frontend / S3 layer — no RPC server, no multi-tenant bucket registry, no distributed checkpointer.
- SQL / vector / full-text — combine with a domain-appropriate
engine (
+ FAISSfor vectors,+ Tantivyfor full-text). - Replication / consensus — build above; we'll expose hooks (change feed, snapshot transfer) but won't ship Raft.
- Network server — this is a library; wrap it in your own RPC.
License
Licensed under the MIT licence.