holt 0.3.2 - Docs.rs

# holt — roadmap

## Where things stand

holt is a single-node, Unix-only ART-over-blobs metadata engine.
The algorithm core, persistent backend, WAL + replay, sharded
buffer manager, 3-thread background checkpointer, and the curated
public API are all in place. The release test surface is unit,
property, crash/replay, failpoint, and integration coverage; the
public benchmark surface is the RocksDB/SQLite metadata comparator
in `benches/main.rs`.

See [ARCHITECTURE.md](ARCHITECTURE.md) for the design and
[CHANGELOG.md](CHANGELOG.md) for what changed when. The fine-
grained shipping log lives in `git log`; this file tracks
milestones, not individual features.

## v0.1 — Usable embedded library (shipped, 2026-05-19)

Goal: build the engine end-to-end so a path-shaped-metadata
workload can use it on a single node.

Delivered: 9-NodeType ART layout pinned at compile time, recursive
walker (insert / lookup / erase / rename), cross-blob `splitBlob`
/ `mergeBlob` / `compactBlob`, `FileBlobStore` (`O_DIRECT`
Linux + `F_NOCACHE` macOS), logical WAL with replay,
`Tree::range` iterator with prefix + start-after + S3-style
delimiter rollup, `Tree::atomic` atomic batches under one WAL
record, four examples (`basic_kv` / `filesystem_meta` /
`session_store` / `s3_metadata`), property-based tests against a
`HashMap` oracle, criterion benches vs RocksDB + SQLite.

## v0.2 — Performance + concurrency upgrades (shipped)

Goal: scope the metadata-engine core for production-shaped
workloads — no new public API surface, the bench numbers from v0.1
are the success criteria.

Delivered: `io_uring` persistent-backend fast path (Linux,
feature-gated), `crc32fast` SIMD CRC32, sharded `BufferManager`
(`DashMap` replacing the v0.1 `Mutex<HashMap>` + `VecDeque` LRU),
cached `Tree.root_pin`, range-iter fast-forward in delimiter
mode, 3-thread background checkpointer (planner + I/O worker +
eviction) under a W2D-strict protocol, adaptive tick-based
eviction, observability (`Tree::stats`, structured `tracing`,
Prometheus text-format renderer behind the `metrics` feature,
silent-pin reads so scrapes don't pollute cache counters),
diagnostics (`Tree::scan`, range-iter tombstone fix,
structured `Error::NodeCorrupt`), PGO docs, `cargo deny check`
in CI, scale-curve + p95/p99 contention benches.

Concurrency primitives: per-blob `HybridLatch` (LeanStore 3-mode,
wait-free optimistic reads) inside one WAL-writer critical section
per write. Slot-versioned cross-blob lock-coupling was considered
for v0.2 but deferred to v0.3 — it needs structural changes the
v0.2 scope didn't budget for, and the per-slot version array is
best designed alongside the cross-blob descent flattening rather
than retro-fitted.

Public API surface closed before crates.io publication: the
v0.1 `pub mod layout / journal / store` exposure shipped the
on-disk layout, WAL codec, and buffer-manager guards as part of
SemVer; v0.2 tightened those to `pub(crate)` so the engine is
free to evolve internally without minor-version breaks. See
[CHANGELOG.md](CHANGELOG.md) for the supported user surface.

## v0.3 — Extreme metadata-engine performance (shipped, 2026-05-21)

v0.3 shipped as the performance ceiling milestone for an
embedded metadata engine. Feature work waits. The goal is to push
the Fractal-ART-inspired kernel as far as practical on modern
Linux/macOS: cache-resident reads, path-shaped writes, prefix/list
walks, WAL durability, checkpoint I/O, and large-tree behavior.

The target design borrows the strongest parts of Fractal ART,
LeanStore, and modern NVMe engines while keeping holt's product
boundary narrow: one embedded library, opaque byte values, no RPC
server, no replication, no distributed object-store layer.

### P0 — Remove the write-path serial choke points (done)

v0.3's concurrency cut is implemented in the codebase:

- **Cross-blob lock coupling.** The old parent-held fallback is
  gone. `insert_multi` / `erase_multi` read the parent `BlobNode`,
  pin the child, release the parent guard, and continue in the
  child. The child blob's own `header.root_slot` is the
  authoritative entry, so child-local splits, collapses, and
  compaction do not require parent rewrites.
- **No slot-version sidecar.** The earlier per-slot version plan
  would have added ~80 KB of out-of-line state per cached blob.
  The shipped shape avoids that memory tax by removing
  `BlobNode.child_entry_ptr` and making the child header the only
  cross-blob root token.
- **Scoped walker instead of unsafe state machine.** The mutation
  walker still uses recursive Rust scopes, but it no longer
  retains ancestor guards across child mutation. That gives the
  same latch lifetime as the flattened state-machine design
  without self-referential guard plumbing.
- **Maintenance is separated.** Foreground writers enter only the
  shared side of a narrow atomic `maintenance_gate` while they may
  cross `BlobNode` boundaries. Deletes and leaf-slot churn enqueue
  blob-local compaction candidates; spillovers enqueue parent-merge
  candidates. `compact()` cold-seeds only when no hints exist,
  drains bounded candidate batches, skips clean stale candidates
  after a shared-latch header check, and `compact()` plus the
  background merge pass enter the exclusive side only around the
  parent edge currently being folded/deleted.
  Point reads also take the shared side, but blob-local access
  remains optimistic; ordinary readers and writers still run
  concurrently with each other.
- **Large-tree puts avoid root write-latch traffic.** Cross-blob
  puts route through the root under a shared latch, acquire the
  child write latch while the edge is stable, then mutate only
  from the child down. Child-only writes also return a precise
  `root_dirty = false`, so the caller does not mark the root dirty
  or take the dirty-map mutex for a blob it did not change.
- **Shape counters are exposed.** `Tree::stats()` now reports
  mutation walker ops, total/average/max blob hops, max
  cross-blob boundary depth, spillovers, merges, and optimistic
  read restarts. Prometheus export includes the same counters.
- **Checkpoint/eviction interlock is closed.** Dirty entries
  drained by a checkpoint round stay protected as in-flight
  `flushing` entries until their snapshotted bytes complete
  `write_through`, so eviction cannot drop the cache image in the
  gap between `snapshot_dirty()` and the planner's byte copy.
  Dirty / flushing / pending-delete bookkeeping is sharded by
  `BlobGuid`, and fresh spillover blobs keep a local `Arc` pin
  alive until their dirty entry is visible, so background eviction
  cannot observe a new child blob as clean before checkpoint can
  flush it.

Still intentionally not in P0: per-op latch wait histograms. The
current `HybridLatch` API has no timed acquisition boundary, and
adding timing to every latch acquire should be driven by a
contention benchmark rather than added speculatively.

### P1 — Journal group commit (done, with one explicit boundary)

Durable group commit is implemented:

- Persistent trees own a dedicated journal worker instead of an
  exposed `Arc<Mutex<WalWriter>>`.
- Writers encode complete WAL records into owned buffers, submit
  them to the worker, and wait outside the commit-publish critical
  section.
- `WalCommit::Sync` writers share one `sync_data` when
  they arrive inside the short group window. `Tree::stats()` and
  Prometheus export journal appends, append batches, and sync
  counts so the batching ratio is observable.
- The old global `commit_lock` is replaced with `CommitGate`:
  foreground writers enter the shared side while they mutate blobs,
  publish dirty state, and submit journal records; checkpoint takes
  the exclusive side only while draining dirty/pending sets,
  flushing the journal, and cloning dirty blob bytes.
- This keeps the W2D proof intact while removing writer-vs-writer
  serialization from the persistent write hot path. Writers on
  disjoint child blobs now contend on per-blob latches and the
  mutation-bookkeeping shard they actually touch, not a global
  commit mutex or global dirty mutex.

### P2 — NVMe-grade checkpoint I/O (done)

The background checkpointer already has planner / I/O / eviction
threads. v0.3 makes the I/O side worth that structure:

- Submit dirty blobs as batches, not one synchronous write at a
  time: `BlobStore::write_blobs` is the checkpoint write-through
  primitive.
- The default Unix backend sorts and coalesces slot-contiguous
  512 KB blob writes with `pwritev`; Linux `io_uring` keeps
  ring-depth batched SQE submission with fixed-file and
  fixed-buffer registration.
- Experimental `SQPOLL` / `IOPOLL` / linked-fsync modes are not
  part of the v0.3 path: on Holt's short checkpoint batches they
  did not improve throughput, and keeping them would make the I/O
  completion path harder to audit. The next real Linux step is a
  dedicated async I/O scheduler, not more ring flags.
- Manifest persistence now uses a base snapshot plus append-only
  `manifest.log` set/delete deltas. Checkpoint rounds append and
  fsync only the current delta batch; full `manifest.bin`
  rewrites happen only as log compaction. Deleted slots become
  reusable only after the manifest delta is durable, and reopen
  reconstructs reusable slots from final manifest holes using
  compact ranges rather than expanding sparse high-water files
  into one `u64` per free slot.
- Backends expose a conservative `needs_flush` hint. Clean manual
  checkpoints and background idle rounds skip the data-file Sync
  only when the backend has no outstanding data or manifest work,
  so a previous failed Sync still forces the next retry.
- The WAL worker tracks whether the current log has uncheckpointed
  records, including records found when reopening a nonempty WAL.
  Clean checkpoints now skip empty WAL flush/truncate work instead
  of paying a `sync_data` or temp-file rename on every idle round.
  It also tracks the durable WAL frontier, so checkpoint does not
  re-fsync records that already passed through durable group
  commit. A nonempty WAL discovered on reopen is treated as
  replayable but not proven fsync-durable, so the first checkpoint
  after replay still forces a WAL flush before backend durability.
  Background checkpoint rounds also handle WAL-only truncate work
  without paying an unrelated backend Sync. WAL truncate itself is
  an in-place `ftruncate` + `sync_data`, avoiding the older
  temp-file write, rename, and fd reopen after every checkpoint.
  Release benchmark coverage stays focused on the public
  RocksDB/SQLite metadata comparison. Checkpoint-specific probes
  should be reintroduced only when they are promoted to either a
  public comparator or a correctness integration test.

### P3 — CPU hot-path work

Most of the structural CPU hot path is already in place:

- `SearchKey` uses a virtual terminator, so point lookups and
  mutations no longer allocate a padded key per operation.
- Same-key insert/delete comparison borrows the stored leaf key
  directly. The API allocates only when it must return the old
  value or materialise split-prefix bytes.
- SIMD is already used for Node16 byte search, longest-common
  prefix, Node48 child scans, and Node256 child scans. CRC32 uses
  `crc32fast`.
- Node bodies stay packed in 512 KB blob frames; the hot path does
  not allocate heap node objects or chase per-node boxes.

Remaining P3 work should be profile-driven only: simplify
branches that show up in the flamegraph, then consider targeted
SIMD/copy-repack kernels where profiles prove the scalar path is
still material.

### P4 — Large-tree shape control

- Recursive occupancy-aware spillover is implemented. The picker
  skips existing `BlobNode` crossings, descends inside overfull
  path branches, and chooses a victim near the target child fill
  band instead of blindly peeling off the largest direct child.
- `BlobNode` inline-prefix divergence split is implemented: a
  bad blob boundary now recovers locally into
  `Prefix? -> Node4 -> {old BlobNode, new Leaf}` instead of
  failing the insert.
- Make merge/rebalance incremental. `compact()` and background
  merge are now candidate-driven and online with respect to
  foreground writers through the atomic `maintenance_gate`; the
  remaining large-tree work is policy quality (when to
  split/merge/rebalance), not basic safety.
- Large-tree release coverage is the public `{20 k, 100 k,
  500 k, 2 M}` scale curve in `benches/main.rs`. Additional
  shape probes should be profile-driven and promoted only if they
  become part of the release comparison surface.

## v0.4 — Scale-stable metadata performance

Goal: keep the v0.3 cache-resident read/list advantages while making
large-tree writes and negative metadata probes stay flat as the
namespace grows from millions to tens or hundreds of millions of
records. v0.4 should be judged by scale curves, not by small-tree
microbench wins.

The main risk at this point is not one more syscall on the durable
path. It is shape drift: overfull hot prefixes, too many cross-blob
hops, negative lookups walking full descents, checkpoint bursts
competing with writers, and bulk creates paying one full ART descent
per key. The v0.4 work is therefore workload-shaped rather than
generic KV tuning.

### P0 — Prefix shape policy and write-stability guardrails

- Add explicit shape-control metrics to release benches:
  blob fill distribution, blob-hop histogram, spillover/merge rate,
  checkpoint dirty backlog, and WAL sync batching ratio at each scale
  point.
- Promote prefix/subtree boundaries deliberately. The generic API
  should be a shape hint such as `promote_prefix(prefix)`, not an
  S3-specific "bucket" concept. Object-store users can map one
  bucket or tenant to a promoted prefix; filesystem users can map hot
  directories or mount roots.
- Make spill/merge policy backlog-aware. When writers are flooding a
  hot prefix, the engine should prefer bounded child splits and defer
  cosmetic merges instead of oscillating between split/compact/merge.
- Add write-stability benches that model untar/rsync/S3-sync bursts:
  sorted path creates, random creates under one hot prefix, mixed
  create/delete churn, and large-prefix list while writes continue.

Success criteria:

- `put` and `atomic` throughput should degrade smoothly with scale:
  no cliff when moving from 2M to larger trees.
- Average blob hops should stay low and p99 blob hops should be
  explainable by explicit promoted boundaries, not accidental shape
  drift.
- Checkpoint backlog should drain without forcing foreground writers
  through long global waits.

### P1 — Negative lookup filter, in-memory first

Metadata workloads are unusually negative-heavy: `open/stat/head`
often ask for keys that do not exist. Holt should exploit that, but
without changing the disk format prematurely.

- Add a per-cached-blob negative-lookup sidecar filter first. It must
  never introduce false negatives: on mutation uncertainty, mark the
  filter stale and fall back to the full walker.
- Use the filter only at cross-blob boundaries and within hot prefixes
  where avoiding another blob hop is measurable.
- Rebuild filters from blob contents on cache fill. Persisting the
  filter in the blob header is a later on-disk-format decision, not
  the first implementation.

Success criteria:

- Negative point lookup latency improves on path-shaped workloads
  without regressing positive lookups.
- Filter memory overhead is bounded and visible in `Tree::stats`.

### P2 — Explicit bulk-load path

Bulk create is a metadata-specific write pattern. Treating an archive
restore as N unrelated `put`s leaves performance on the table.

- Add an explicit sorted bulk-load API or builder rather than hidden
  timing heuristics. Callers that know they are loading a directory,
  bucket prefix, image layer, or archive can opt in.
- Build final ART shapes directly for sorted keys: avoid repeated
  Node4 -> Node16 -> Node48 -> Node256 promotion and repeated descent
  through the same prefix.
- Commit each bulk batch as one atomic WAL envelope with bounded byte
  and key-count limits, so crash replay and memory use stay simple.

Success criteria:

- Mass-create workloads improve by multiples, not percentages.
- Point-write latency for ordinary single-key writes is unchanged.

### P3 — Async checkpoint scheduler, not more io_uring flags

The v0.3 Linux backend already has the useful fixed-file /
fixed-buffer / batched-ring foundation. The next I/O gain is not
`SQPOLL` or linked-fsync by default; those did not help Holt's short
checkpoint batches enough to justify the complexity. The next gain is
execution-model work:

- Keep dirty-byte and dirty-age watermarks so checkpoint starts before
  the dirty set becomes a foreground-write problem.
- Let the I/O worker keep multiple write batches in flight when the
  backend can absorb queue depth, while preserving the WAL-before-data
  and data-before-manifest ordering boundaries.
- Add foreground write throttling only as a last resort, driven by
  dirty backlog and checkpoint lag, not by raw operation count.
- Keep the default Unix `pwritev` path simple and auditable; Linux
  `io_uring` should remain an internal backend optimization with the
  same correctness contract.

Success criteria:

- Under sustained writes, p99 put latency should remain bounded while
  checkpoint makes progress.
- Idle and lightly loaded trees should not pay extra scheduler cost.

## v0.5 — Stronger metadata semantics without MVCC tax

Goal: add the missing semantics that real metadata servers need while
preserving Holt's core design: path-shaped keys, opaque byte values,
single-node embedded library, no always-on MVCC version chains.

### Completed — Scoped snapshot view

- `Tree::view(prefix, |view| ...)` captures the blob frames reachable
  for a prefix, releases the live tree, and serves point reads and
  scans from the private frame set.
- Ordinary `range()` / `range_keys()` remain the hot restart-on-conflict
  iterators; `View` is the explicit stable-read path for list/readdir.

### P1 — WAL tail / change feed with explicit retention

- Expose a change-feed API only after WAL retention is designed.
  Today's checkpoint path truncates WAL, so `subscribe(from_seq)`
  cannot pretend old history is always available.
- The first version should support live-tail subscription and bounded
  retained segments for cache invalidation, audit, and follower-style
  experiments.
- Follower mode or replication should remain outside the core release
  until the single-node durability contract has a crisp tail-retention
  story.

### P2 — Sealed immutable subtree

- Add an optional `seal_prefix(prefix)` mode for image layers,
  archives, and package indexes that become read-only after build.
- Start with semantic enforcement and latch elision. Huge pages,
  mmap tiers, and TLB-focused placement are later profile-driven
  optimizations.

## v0.6+ — Optional adapters and research-grade hardware work

These are useful, but they should not pollute the core engine or the
0.4/0.5 scale-stability work.

- POSIX dirent/inode separation belongs in an adapter/example layer:
  e.g. `d/<path>` for dirents and `i/<inode>` for inode metadata.
- ACL inheritance, S3 bucket semantics, lifecycle policy, and object
  versioning are schema conventions above Holt's opaque byte values.
- Value compression should be an optional codec layer, not a core
  schema-aware dictionary unless benchmarks prove opaque values are
  the limiting factor.
- Prefix interning changes node layout and adds read indirection; do
  it only after path-memory profiles show it beats the existing ART
  prefix compression.
- NUMA, CXL.mem, ZNS, and lockless rename are server-appliance or
  research tracks. They are valid experiments, but not release
  blockers for an embedded metadata engine.

## Non-goals for the core engine

- Do not bake S3 object schemas, inode structs, ACL formats, JSON,
  protobuf, chunk lists, or lifecycle rules into the storage core.
  Holt stores opaque values; higher layers own schema.
- Do not add full MVCC until a concrete workload proves copy-on-list
  and atomic CAS batches are insufficient.
- Do not chase io_uring flags that make the backend harder to audit
  without moving a measured bottleneck.
- Do not add compatibility aliases for renamed public APIs. The crate
  is still young enough that clean API shape matters more than
  transition glue.

## v1.0 — Production-ready

- Post-v0.5 public surface and persistence format stabilized.
- Multi-platform stability across Linux + macOS (optional BSDs
  if anyone needs them).
- Real production deployments + case studies.
- Long-term API stability commitment — `holt::*` surface frozen,
  `#[non_exhaustive]` markers in place so additive changes stay
  non-breaking.

## Not on the roadmap

The library is **a metadata engine**, period. Single-node,
embed-in-your-process, Unix-only. Out of scope:

- **Windows support** — `O_DIRECT` (Linux) and `F_NOCACHE` (macOS)
  have no Windows analog this project wants to maintain. The
  crate `compile_error!`s on Windows targets.
- **Object-storage frontend / S3 layer** — the upstream that
  inspired holt's algorithm core wrapped its ART in an S3-style
  RPC server (PUT/GET/LIST inode handlers, multi-tenant bucket
  registry, RPC worker pool). holt does not reproduce any of
  that. The alignment is bounded to the **metadata engine**: ART
  core, blob layout, WAL, latching, range iterator. `WalOp`
  variants holt journals share wire shape with the upstream so a
  future RPC layer could re-use the format, but holt itself ships
  no multi-root registry, no bucket namespace, no RPC dispatcher.
- **Replication / consensus** — build it above this. Future hooks
  can support it, but holt itself does not implement Raft.
- **Network server** — this is a library. Wrap it in your gRPC /
  HTTP / whatever.
- **SQL** — not the right abstraction for this data shape.
- **Vector search** — combine with a dedicated vector DB.
- **Full-text search** — combine with Tantivy / Lucene-rs.

## Contributing

Early-stage project; design feedback most welcome. PRs welcome
too, but please open an issue first for non-trivial changes —
the architecture is still being shaped and we want to avoid
churn.