holt 0.3.2

An adaptive-radix-tree metadata storage engine for path-shaped keys, with per-blob concurrency and crash-safe persistence.
Documentation

holt

Crates.io Docs.rs CI MSRV License: MIT

A carefully crafted adaptive radix tree for path-shaped metadata.

⚠️ Pre-1.0 (v0.3.2 released). The public API is now narrow and SemVer-stable inside a minor release, but minor releases may still break source compatibility before 1.0. Pin the exact published version in your Cargo.toml (holt = "=0.3.2") until 1.0 stabilises the surface.

holt is an embedded Rust library for storing hierarchical keys — file paths, S3 object names, multi-tenant namespaces, time-bucketed identifiers — with sub-microsecond lookups, per-blob concurrency, and crash-safe persistence.

It targets workloads where:

  • Keys are hierarchical / path-shaped (so prefix compression pays).
  • The dominant access is point lookup + prefix range scan.
  • Concurrency is high (many readers + writers across disjoint subtrees).
  • Latency is micro-critical — no LSM compaction stalls, no single-writer locks.

It is not a general-purpose KV store; if you need full-text or vector similarity, reach for the right tool. For this shape, holt's strongest wins are metadata-native operations: delimiter directory rollup and mixed metadata workloads. In the v0.3 Linux release run, objstore_list_dir is 151× faster than RocksDB and fs_list_dir is 268× faster; objstore_metadata_mix is 43× faster than RocksDB and fs_metadata_mix is 66× faster. Point reads stay ahead through 2 M keys; point writes are competitive but intentionally not the headline claim — see benches/RESULTS.md.

Why "holt"?

A holt (Old English holt) is a small grove or copse — a self-contained collection of trees on a single piece of ground. That maps directly to the design: each holt::Tree is one ART made of many 512 KB blob frames, bounded and self-contained, grown by repeated splitBlob. Short to type, distinct from other crates, easy to say.

When to reach for holt

Engine Data structure Persistence Concurrency Notes
LMDB B+tree mmap Single-writer MVCC Battle-tested; page chasing for short hot keys.
RocksDB LSM SST + WAL MVCC Compaction stalls; large hot dataset is RAM-heavy.
SQLite B-tree File Single writer Convenient, but writer is the bottleneck under load.
Sled Hybrid LSM Log-structured Lock-free Rust-native, largely unmaintained.
holt Adaptive Radix Tree 512 KB blobs Per-blob 3-mode latch Path compression + lookup is O(key.len)

ART's lookup cost is O(key.len), not O(log N). For short hot keys (< 64 bytes), that beats any tree-based competitor. The per-blob HybridLatch lets N readers traverse disjoint subtrees in parallel without coordinating.

Project status

Pre-1.0, actively maintained. The v0.3 metadata-engine core is landed: put / get / delete / rename / range / atomic / compact, multi-blob crossings, online maintenance gates, persistent store with O_DIRECT and optional Linux io_uring, logical WAL with group commit, sharded buffer manager, 3-thread background checkpointer, SIMD CRC32 + node scans, and stateful Tree::range with prefix, start_after, and S3 delimiter rollup.

See CHANGELOG.md for the v0.3 release notes and ROADMAP.md for post-0.3 direction.

cargo bench --manifest-path benches/Cargo.toml --bench main runs a side-by-side comparison with RocksDB, SQLite, and sled across three metadata workload shapes — see benches/README.md for the methodology and headline numbers.

Usage

Add holt to your Cargo.toml:

[dependencies]
holt = "=0.3.2"

The supported user surface is deliberately small: TreeBuilder, Tree, TreeConfig, Storage, RangeBuilder, RangeEntry, RangeIter, KeyRangeBuilder, KeyRangeEntry, KeyRangeIter, AtomicBatch, Record, RecordVersion, CheckpointConfig, TreeStats / related stats structs, Error / Result, and the custom-store surface (BlobStore, MemoryBlobStore, FileBlobStore, AlignedBlobBuf, BlobGuid). Internal layout, WAL, walker, and buffer-manager modules are not public API.

Open a tree

Two storage modes; same TreeBuilder, one knob switches between them:

use holt::{TreeBuilder, WalCommit};

// File-backed production mode, Unix-only — Linux `O_DIRECT`,
// macOS `F_NOCACHE`. The directory is created if missing.
let tree = TreeBuilder::new("/var/lib/myapp/meta.holt")
    .buffer_pool_size(128)        // pinned 512 KB blobs (default 64)
    .wal_commit(WalCommit::Enqueue) // default async journal acknowledgement
    .open()?;

// In-memory — volatile, dies with the last handle. The path
// argument becomes informational once `.memory()` flips the
// mode. Good for tests, sidecar caches, ephemeral session stores.
let tree = TreeBuilder::new("scratch").memory().open()?;

Single-key CRUD

Bytes in, bytes out. put and delete are the hot paths: they write or tombstone without reading the existing leaf. rename is atomic and errors if dst exists unless force = true.

tree.put(b"img/01.jpg", b"rgb_data_blob_id_abc")?;

let value: Option<Vec<u8>> = tree.get(b"img/01.jpg")?;
assert_eq!(value.as_deref(), Some(&b"rgb_data_blob_id_abc"[..]));

let existed: bool = tree.delete(b"img/01.jpg")?;
assert!(existed);

tree.put(b"old/path", b"v")?;
tree.rename(b"old/path", b"new/path", /*force=*/ false)?;

Conditional writes

RecordVersion is Holt's lightweight compare-and-set token. It is not an MVCC timestamp and does not provide historical snapshot reads; it only says "this live record still has the same leaf seq".

tree.put_if_absent(b"users/alice", b"{...}")?;
let record = tree.get_record(b"users/alice")?.unwrap();

let updated: bool = tree.compare_and_put(
    b"users/alice",
    record.version,
    b"{\"tier\":\"hot\"}",
)?;
assert!(updated);

let current = tree.get_version(b"users/alice")?.unwrap();
let deleted: bool = tree.delete_if_version(b"users/alice", current)?;
assert!(deleted);

// Point-in-time read helper for rmdir-style metadata checks.
assert!(tree.is_prefix_empty(b"users/alice/")?);

Atomic batch

Multiple ops under one WAL record. Holt preflights rename and conditional-write / prefix-emptiness guards before mutating, applies the batch behind the tree-wide mutation gate, then emits one Batch WAL record. Ok(true) means committed, Ok(false) means a conditional guard failed and nothing was published, and Err reports hard failures such as a missing rename source or destination collision.

let template = tree.get_record(b"users/template")?.unwrap();
let committed = tree.atomic(|batch| {
    batch.assert_version(b"users/template", template.version);
    batch.put(b"users/alice", b"{...}");
    batch.put(b"users/bob",   b"{...}");
    batch.put_if_absent(b"users/new", b"{...}");
    batch.delete(b"users/legacy-account");
    batch.assert_prefix_empty(b"users/temp/");
    batch.rename(b"users/temp", b"users/permanent", true);
})?;
assert!(committed);

Range scans and S3 delimiter rollup

Tree::range yields full records: key, value, and RecordVersion. Use it when the list response needs metadata bytes. Tree::range_keys / Tree::scan_keys use the same cursor and delimiter semantics but skip value materialisation; use them for name-only directory and object listings. Chain .prefix() to anchor the scan, .start_after() for paging, and .delimiter() for S3-style ?delimiter=/ rollup.

use holt::{KeyRangeEntry, RangeEntry};

// Full prefix scan — `take(50)` for a paged "first 50" view
// when the caller needs metadata bytes.
let first_50: Vec<_> = tree
    .range()
    .prefix(b"users/")
    .into_iter()
    .take(50)
    .map(|r| r.map(|e| match e {
        RangeEntry::Key { key, value, version } => (key, value, version),
        _ => unreachable!("no delimiter set"),
    }))
    .collect::<Result<_, _>>()?;

// S3-style "list one level" without copying values — leaves under
// `/img/` get emitted as Key; deeper paths roll up to a single
// `CommonPrefix` per distinct subdir.
for entry in tree.scan_keys(b"img/").delimiter(b'/') {
    match entry? {
        KeyRangeEntry::Key { key, .. }      => println!("file {key:?}"),
        KeyRangeEntry::CommonPrefix(prefix) => println!("dir  {prefix:?}/"),
        _ => {} // `KeyRangeEntry` is `#[non_exhaustive]`
    }
}

Snapshot reads

Tree::range and Tree::range_keys are the hot restart-on-conflict iterators. Use Tree::view(prefix, |view| { ... }) when multiple reads must observe one stable prefix snapshot. A view copies the reachable blob frames for prefix, releases the live tree, then runs point reads and scans against that private frame set.

tree.view(b"img/", |view| {
    let meta = view.get_record(b"img/01.jpg")?;
    let first_page: Vec<_> = view
        .range_keys()
        .delimiter(b'/')
        .into_iter()
        .take(100)
        .collect::<Result<_, _>>()?;
    Ok(())
})?;

Durability

Per-op writes land in the journal worker + BufferManager cache. The WAL acknowledgement boundary is explicit:

  • WalCommit::Enqueue — default. Return after the journal worker queue accepts the encoded record.
  • WalCommit::Write — return after WAL bytes reach the OS page cache, with no per-op sync_data. This is the benchmark profile matching RocksDB WAL on, sync=false.
  • WalCommit::Sync — return after sync_data; concurrent writers can share one fsync through group commit.

Disk-truth advances at:

  • Tree::checkpoint() — flush the journal (sync_data), write dirty blobs through to the store, fdatasync the store, truncate the WAL. Call this at your own application checkpoint cadence.
  • WAL auto-flush — once the WAL writer's pending buffer crosses 64 KB it drains to the OS page cache (no sync_data). Bounds in-memory buffering even if checkpoint is rare.
tree.checkpoint()?;   // flush WAL + write through + truncate

See examples/ for full programs: basic_kv, filesystem_meta, session_store, s3_metadata.

Architecture

See ARCHITECTURE.md for the deep dive. The design draws on Leis et al.'s ART paper (ICDE 2013) for the four-node-size scheme and LeanStore (ICDE 2018) for the HybridLatch contract.

Not on the roadmap

holt is just the metadata engine — single-node, embed-in- your-process, Unix-only. Out of scope:

  • Windowscompile_error!s the crate (Unix O_DIRECT / F_NOCACHE has no Windows analog worth carrying).
  • Object-storage frontend / S3 layer — no RPC server, no multi-tenant bucket registry, no distributed checkpointer.
  • SQL / vector / full-text — combine with a domain-appropriate engine (+ FAISS for vectors, + Tantivy for full-text).
  • Replication / consensus — build above; we'll expose hooks (change feed, snapshot transfer) but won't ship Raft.
  • Network server — this is a library; wrap it in your own RPC.

License

Licensed under the MIT licence.