holt 0.3.3

An adaptive-radix-tree metadata storage engine for path-shaped keys, with per-blob concurrency and crash-safe persistence.
Documentation
# holt

[![Crates.io](https://img.shields.io/crates/v/holt.svg)](https://crates.io/crates/holt)
[![Docs.rs](https://docs.rs/holt/badge.svg)](https://docs.rs/holt)
[![CI](https://github.com/feichai0017/holt/actions/workflows/ci.yml/badge.svg)](https://github.com/feichai0017/holt/actions/workflows/ci.yml)
[![MSRV](https://img.shields.io/badge/MSRV-1.82-blue.svg)](https://github.com/feichai0017/holt/blob/main/Cargo.toml)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

> A carefully crafted **adaptive radix tree** for path-shaped metadata.

> ⚠️ **Pre-1.0 (v0.3.3 released).** The public API is now narrow and
> SemVer-stable inside a minor release, but minor releases may
> still break source compatibility before 1.0. Pin the exact
> published version in your `Cargo.toml` (`holt = "=0.3.3"`) until 1.0
> stabilises the surface.

`holt` is an embedded Rust library for storing **hierarchical
keys** — file paths, S3 object names, multi-tenant namespaces,
time-bucketed identifiers — with sub-microsecond lookups, per-blob
concurrency, and crash-safe persistence.

It targets workloads where:

- Keys are **hierarchical / path-shaped** (so prefix compression pays).
- The dominant access is **point lookup + prefix range scan**.
- Concurrency is **high** (many readers + writers across disjoint
  subtrees).
- Latency is **micro-critical** — no LSM compaction stalls, no
  single-writer locks.

It is **not** a general-purpose KV store; if you need full-text
or vector similarity, reach for the right tool. For this shape,
holt's strongest wins are metadata-native operations: delimiter
directory rollup and mixed metadata workloads. In the v0.3 Linux
release run, `objstore_list_dir` is **151×** faster than RocksDB
and `fs_list_dir` is **268×** faster; `objstore_metadata_mix` is
**43×** faster than RocksDB and `fs_metadata_mix` is **66×**
faster. Point reads stay ahead through 2 M keys; point writes are
competitive but intentionally not the headline claim — see
[`benches/RESULTS.md`](benches/RESULTS.md).

## Why "holt"?

A **holt** (Old English *holt*) is a small grove or copse — a
self-contained collection of trees on a single piece of ground.
That maps directly to the design: each `holt::Tree` is **one** ART
made of **many** 512 KB blob frames, bounded and self-contained,
grown by repeated `splitBlob`. Short to type, distinct from other
crates, easy to say.

## When to reach for holt

| Engine        | Data structure        | Persistence       | Concurrency        | Notes                                                |
|---------------|-----------------------|-------------------|--------------------|------------------------------------------------------|
| LMDB          | B+tree                | mmap              | Single-writer MVCC | Battle-tested; page chasing for short hot keys.      |
| RocksDB       | LSM                   | SST + WAL         | MVCC               | Compaction stalls; large hot dataset is RAM-heavy.   |
| SQLite        | B-tree                | File              | Single writer      | Convenient, but writer is the bottleneck under load. |
| Sled          | Hybrid LSM            | Log-structured    | Lock-free          | Rust-native, largely unmaintained.                   |
| **holt**      | **Adaptive Radix Tree** | **512 KB blobs** | **Per-blob 3-mode latch** | **Path compression + lookup is O(key.len)** |

ART's lookup cost is `O(key.len)`, not `O(log N)`. For short hot keys
(< 64 bytes), that beats any tree-based competitor. The per-blob
HybridLatch lets N readers traverse disjoint subtrees in parallel
without coordinating.

## Project status

**Pre-1.0, actively maintained.** The v0.3 metadata-engine core is
landed: put / get / delete / rename / range / atomic / compact,
multi-blob crossings, online maintenance gates, persistent store
with `O_DIRECT` and optional Linux `io_uring`, logical WAL
with group commit, sharded buffer manager, 3-thread background
checkpointer, SIMD CRC32 + node scans, and stateful `Tree::range`
with prefix, `start_after`, and S3 delimiter rollup.

See [`CHANGELOG.md`](CHANGELOG.md) for the v0.3 release notes and
[`ROADMAP.md`](ROADMAP.md) for post-0.3 direction.

`cargo bench --manifest-path benches/Cargo.toml --bench main`
runs a side-by-side comparison with RocksDB, SQLite, and sled
across three metadata workload shapes — see
[`benches/README.md`](benches/README.md) for the methodology and
headline numbers.

## Usage

Add holt to your `Cargo.toml`:

```toml
[dependencies]
holt = "=0.3.3"
```

The supported user surface is deliberately small:
`TreeBuilder`, `Tree`, `TreeConfig`, `Storage`, `RangeBuilder`,
`RangeEntry`, `RangeIter`, `KeyRangeBuilder`, `KeyRangeEntry`,
`KeyRangeIter`, `AtomicBatch`, `Record`, `RecordVersion`,
`CheckpointConfig`,
`TreeStats` / related stats structs, `Error` / `Result`, and the
custom-store surface (`BlobStore`, `MemoryBlobStore`,
`FileBlobStore`, `AlignedBlobBuf`, `BlobGuid`). Internal
layout, WAL, walker, and buffer-manager modules are not public API.

### Open a tree

Two storage modes; same `TreeBuilder`, one knob switches between them:

```rust
use holt::TreeBuilder;

// File-backed production mode, Unix-only — Linux `O_DIRECT`,
// macOS `F_NOCACHE`. The directory is created if missing.
let tree = TreeBuilder::new("/var/lib/myapp/meta.holt")
    .buffer_pool_size(128)        // pinned 512 KB blobs (default 64)
    .wal_sync(false)              // default async journal acknowledgement
    .open()?;

// In-memory — volatile, dies with the last handle. The path
// argument becomes informational once `.memory()` flips the
// mode. Good for tests, sidecar caches, ephemeral session stores.
let tree = TreeBuilder::new("scratch").memory().open()?;
```

### Single-key CRUD

Bytes in, bytes out. `put` and `delete` are the hot paths: they write
or tombstone without reading the existing leaf. `rename` is atomic and
errors if `dst` exists unless `force = true`.

```rust
tree.put(b"img/01.jpg", b"rgb_data_blob_id_abc")?;

let value: Option<Vec<u8>> = tree.get(b"img/01.jpg")?;
assert_eq!(value.as_deref(), Some(&b"rgb_data_blob_id_abc"[..]));

let existed: bool = tree.delete(b"img/01.jpg")?;
assert!(existed);

tree.put(b"old/path", b"v")?;
tree.rename(b"old/path", b"new/path", /*force=*/ false)?;
```

### Conditional writes

`RecordVersion` is Holt's lightweight compare-and-set token. It is
not an MVCC timestamp and does not provide historical snapshot
reads; it only says "this live record still has the same leaf seq".

```rust
tree.put_if_absent(b"users/alice", b"{...}")?;
let record = tree.get_record(b"users/alice")?.unwrap();

let updated: bool = tree.compare_and_put(
    b"users/alice",
    record.version,
    b"{\"tier\":\"hot\"}",
)?;
assert!(updated);

let current = tree.get_version(b"users/alice")?.unwrap();
let deleted: bool = tree.delete_if_version(b"users/alice", current)?;
assert!(deleted);

// Point-in-time read helper for rmdir-style metadata checks.
assert!(tree.is_prefix_empty(b"users/alice/")?);
```

### Atomic batch

Multiple ops under one WAL record. Holt preflights rename and
conditional-write / prefix-emptiness guards before mutating,
applies the batch behind the tree-wide mutation gate, then emits
one Batch WAL record.
`Ok(true)` means committed, `Ok(false)` means a conditional guard
failed and nothing was published, and `Err` reports hard failures
such as a missing rename source or destination collision.

```rust
let template = tree.get_record(b"users/template")?.unwrap();
let committed = tree.atomic(|batch| {
    batch.assert_version(b"users/template", template.version);
    batch.put(b"users/alice", b"{...}");
    batch.put(b"users/bob",   b"{...}");
    batch.put_if_absent(b"users/new", b"{...}");
    batch.delete(b"users/legacy-account");
    batch.assert_prefix_empty(b"users/temp/");
    batch.rename(b"users/temp", b"users/permanent", true);
})?;
assert!(committed);
```

### Range scans and S3 delimiter rollup

`Tree::range` yields full records: key, value, and
`RecordVersion`. Use it when the list response needs metadata
bytes. `Tree::range_keys` / `Tree::scan_keys` use the same cursor
and delimiter semantics but skip value materialisation; use them
for name-only directory and object listings. Chain `.prefix()` to
anchor the scan, `.start_after()` for paging, and `.delimiter()`
for S3-style `?delimiter=/` rollup.

```rust
use holt::{KeyRangeEntry, RangeEntry};

// Full prefix scan — `take(50)` for a paged "first 50" view
// when the caller needs metadata bytes.
let first_50: Vec<_> = tree
    .range()
    .prefix(b"users/")
    .into_iter()
    .take(50)
    .map(|r| r.map(|e| match e {
        RangeEntry::Key { key, value, version } => (key, value, version),
        _ => unreachable!("no delimiter set"),
    }))
    .collect::<Result<_, _>>()?;

// S3-style "list one level" without copying values — leaves under
// `/img/` get emitted as Key; deeper paths roll up to a single
// `CommonPrefix` per distinct subdir.
for entry in tree.scan_keys(b"img/").delimiter(b'/') {
    match entry? {
        KeyRangeEntry::Key { key, .. }      => println!("file {key:?}"),
        KeyRangeEntry::CommonPrefix(prefix) => println!("dir  {prefix:?}/"),
        _ => {} // `KeyRangeEntry` is `#[non_exhaustive]`
    }
}
```

### Snapshot reads

`Tree::range` and `Tree::range_keys` are the hot restart-on-conflict
iterators. Use `Tree::view(prefix, |view| { ... })` when multiple
reads must observe one stable prefix snapshot. A view copies the
reachable blob frames for `prefix`, releases the live tree, then
runs point reads and scans against that private frame set.

```rust
tree.view(b"img/", |view| {
    let meta = view.get_record(b"img/01.jpg")?;
    let first_page: Vec<_> = view
        .range_keys()
        .delimiter(b'/')
        .into_iter()
        .take(100)
        .collect::<Result<_, _>>()?;
    Ok(())
})?;
```

### Durability

Per-op writes update the BufferManager cache and append one WAL
record. By default (`wal_sync = false`), foreground writes return
after the journal worker queue accepts the encoded record. Set
`wal_sync = true` to wait for `sync_data`; concurrent writers can
share one fsync through group commit.

Disk-truth advances at:

- **Background checkpoint** — enabled by default for file-backed
  trees. It flushes the journal, writes dirty blobs through to the
  store, syncs the store, applies pending deletes, and truncates the
  WAL once the checkpoint pipeline is clean.
- **`Tree::checkpoint()`** — the same protocol, run synchronously
  when callers explicitly want a checkpoint boundary.
- **WAL auto-flush** — once the WAL writer's pending buffer
  crosses 64 KB it drains to the OS page cache (no `sync_data`).
  Bounds in-memory buffering even if `checkpoint` is rare.

```rust
tree.checkpoint()?;   // flush WAL + write through + truncate
```

See [`examples/`](examples/) for full programs:
[`basic_kv`](examples/basic_kv.rs),
[`filesystem_meta`](examples/filesystem_meta.rs),
[`session_store`](examples/session_store.rs),
[`s3_metadata`](examples/s3_metadata.rs).

## Architecture

See [`ARCHITECTURE.md`](ARCHITECTURE.md) for the deep dive.
The design draws on Leis et al.'s ART paper (ICDE 2013) for the
four-node-size scheme and LeanStore (ICDE 2018) for the HybridLatch
contract.

## Not on the roadmap

holt is **just the metadata engine** — single-node, embed-in-
your-process, Unix-only. Out of scope:

- **Windows**`compile_error!`s the crate (Unix `O_DIRECT` /
  `F_NOCACHE` has no Windows analog worth carrying).
- **Object-storage frontend / S3 layer** — no RPC server, no
  multi-tenant bucket registry, no distributed checkpointer.
- **SQL / vector / full-text** — combine with a domain-appropriate
  engine (`+ FAISS` for vectors, `+ Tantivy` for full-text).
- **Replication / consensus** — build above; we'll expose hooks
  (change feed, snapshot transfer) but won't ship Raft.
- **Network server** — this is a library; wrap it in your own RPC.

## License

Licensed under the [MIT licence](LICENSE).