coordinode-lsm-tree 5.1.0

Embedded LSM-tree storage engine: BuRR filters, zstd dictionary compression, MVCC, range tombstones, merge operators, K/V separation, AES-256-GCM at rest.
Documentation

coordinode-lsm-tree

CI codecov Benchmarks Crates.io docs.rs MSRV dependency status License

LSM-tree storage engine in Rust. Embedded library; provides keyed point reads, prefix and range scans, MVCC snapshots, compaction, and a block cache. No write-ahead log — durability is the caller's responsibility. Built for CoordiNode; usable standalone.

Status

On-disk format version V5. V5 introduces a wire-format break for filter blocks (BuRR replaces Bloom); V3 and V4 databases are not readable by this version and vice versa. Versioning is single-monotonic — every breaking format change bumps to the next version with explicit migration notes.

Features

Read path

  • Point reads via get / multi_get (batch-optimized).
  • PinnableSlice for zero-copy reads.
  • BurrFilter AMQ filter (Bumped Ribbon Retrieval, Walzer & Dillinger 2022): ~1% memory overhead vs the information-theoretic minimum — ~30% smaller filter blocks than a same-FPR Bloom filter, or ~10× tighter FPR at the same memory budget. Used for both per-key and per-prefix membership checks.
  • Forward and reverse range / prefix iteration.
  • Block cache with size cap.
  • File-descriptor cache to bound fopen syscalls.

Write path

  • WriteBatch with seqno-grouped batch writes — caller-controlled atomic visibility.
  • Single deletion tombstones (remove_weak).
  • Range tombstones (delete_range / delete_prefix).
  • Merge operators for commutative LSM operations.
  • Optional key-value separation (BlobTree) for large-value workloads with automatic garbage collection.

Compaction

  • Leveled, size-tiered, dynamic-leveled, and FIFO strategies.
  • Intra-L0 compaction for overlapping runs.
  • Major compaction (full force flush + merge).
  • Optional compaction filters for custom logic during compactions.
  • Merge-aware compaction resolves operands lazily.

Storage & encoding

  • Block-based tables with optional compression (none / LZ4 / Zstd) and prefix truncation.
  • Per-table data block size policy and per-table compression policy.
  • Optional zstd dictionary compression — trained per-table or per-column for small (4-64 KiB) blocks and blob files.
  • Optional block-level encryption at rest — AES-256-GCM, key supplied by caller.
  • Optional per-table block hash indexes for faster point lookups [3].
  • Optional partitioned block index & filters for better cache efficiency [1].
  • Per-level filter/index block pinning configuration.

Concurrency & API

  • Thread-safe BTreeMap-like API.
  • SequenceNumberGenerator trait — pluggable seqno source.
  • Custom UserComparator for non-lexicographic ordering.
  • MVCC: snapshot reads at a chosen SeqNo.
  • Point-in-time recovery snapshots via Tree::create_checkpoint — hard-link every live SST + blob file into a fresh directory in O(1) per file, zero extra disk until the source files compact away. Compaction continues during the checkpoint (deletions are deferred), and the resulting directory opens as an independent tree.

Internals

  • 100% stable Rust, MSRV 1.92.
  • No FFI: zstd via structured-zstd (pure-Rust), LZ4 via lz4_flex, AES via aes-gcm.
  • Pluggable Fs trait — back the engine on the standard filesystem, on io_uring, on an in-memory MemFs, or on a custom implementation.
  • Pluggable CompressionProvider for third-party codecs.

Incremental scan / CDC

Tree::scan_since_seqno(target) streams every change committed at or after a sequence number as ScanSinceEvents (Insert, MergeOperand, PointTombstone, RangeTombstone), in increasing seqno order. It is a change-data-capture primitive: a downstream consumer (replica, Kafka connector, Debezium-style pipeline) replays the events in order to reconstruct the source's history. Superseded versions are preserved (no MVCC collapse) and tombstones are exposed, so deletes replay faithfully.

With the runtime-toggleable seqno_in_index policy, each SST index entry carries its data block's (seqno_min, seqno_max), and the scan skips any data block whose seqno_max is below the target without reading it. On a sparse-change workload (e.g. 1% of data changed since the last poll) this turns an O(data) scan into O(changed blocks). Trees with a mix of seqno-bounded and legacy SSTs are scanned transparently (legacy blocks fall back to a per-entry filter), so the policy can be enabled on a live tree and takes effect as compaction rewrites SSTs — no migration step, no format-version bump.

Engine CDC granularity Survives compaction Embeddable Block-skip
RocksDB GetUpdatesSince WAL events (lost after compaction) no yes n/a
Pebble SST file (64–128 MB) yes yes no
CockroachDB changefeed SST file yes no (distributed) no
FoundationDB per-event yes no (distributed) n/a
coordinode-lsm-tree data block (4–32 KB) yes yes yes

The trade-off: there is no write-ahead log, so we do not offer WAL-style millisecond tailing of in-flight updates. For arbitrary historical-seqno queries (not just "since X"), pair with Tree::create_checkpoint.

Limits

  • Keys: up to 65,535 bytes (the on-disk encoding caps the key-length field at u16).
  • Values: up to 4,294,967,295 bytes (2³² − 1; the encoding caps the value-length field at u32).
  • Larger keys and values carry a proportional performance cost.

Feature flags

All optional, all off by default. The default build is the minimal core (no compression, no encryption, std filesystem). Every flag below is gated because it pulls in extra dependencies or runtime overhead.

Flag Pulls in Enable when
lz4 lz4_flex Block compression wanted, decompression latency matters more than ratio.
zstd structured-zstd (pure-Rust, no FFI) Block compression wanted, ratio matters more than absolute decompression speed. Supports CompressionType::Zstd and dictionary-mode CompressionType::ZstdDict. Decompression is ~2-3.5× slower than C reference.
encryption aes-gcm, rand_chacha AES-256-GCM block encryption at rest. Keys are caller-managed.
io-uring (linux only) io-uring I/O-bound workload on a modern Linux kernel — adds an io_uring Fs backend.
bytes_1 bytes Consumer already speaks bytes::Bytes (tokio/hyper/tonic stack) and wants zero-copy interop with engine slices.
metrics Production observability or profiling. Compiles in atomic counters around block I/O, filter probes, compaction, and cache hit rates (tree.metrics()). Small but non-zero hot-path cost.
ribbon-serde serde Snapshotting the internal RibbonFilterRepr for debugging or out-of-band transport. Not used by the on-disk format.

Benchmarks

CI runs db_bench on every push to main and on pull requests. Results from main are published to the benchmark dashboard. PRs regressing performance by more than 15% trigger an alert; more than 25% fails CI.

Flamegraphs are generated on every merge to main from instrumented db_bench runs and published under flamegraphs/<commit-sha>/flamegraph.svg on gh-pages.

Local Criterion microbenchmarks:

cargo bench --features lz4

Local flamegraphs:

cd tools/db_bench
cargo run --release --features flamegraph -- \
  --benchmark all --num 100000 --flamegraph
# Folded stacks: target/flamegraphs/all.folded
# Render: cargo install inferno && inferno-flamegraph target/flamegraphs/all.folded > flame.svg

Operational tools

Tool Use
tools/db_bench RocksDB-compatible benchmark suite, also drives the CI perf dashboard.
tools/sst-dump Inspect / verify a single SST file out-of-band. Subcommands: verify (walk every block, check per-block XXH3, exit non-zero on corruption), properties (print the SST's stored metadata: id, key range, KV / tombstone counts, block counts, compression, timestamp), hex <offset> (raw hex dump of a region with optional Header decode; useful for inspecting a specific offset flagged by verify --verbose), index-dump (print TLI entries: end_key + offset + size + seqno per pointed-at block; useful for diagnosing range-read fan-out), dump (stream every KV entry to stdout with optional --from / --to / --max=N / --keys-only filters; full-index SSTs only), filter-stats (print BuRR filter sizing: section bytes, layer count, item count, approximate bits-per-key; single-block filters only, partitioned filters not yet supported).

Manifest hardening

The per-version manifest file (v{N}) is stored as a sequence of standard lsm-tree Blocks — one BlockType::Manifest Block per section, plus a BlockType::ManifestFooter Block at the tail carrying the table of contents and the manifest layout version. Every Block goes through the same XXH3-128 / optional ECC / optional AEAD pipeline data Blocks use, so every protection that applies to a data Block automatically applies to the manifest.

Five layers compose the manifest's integrity surface. L1 and L4 are always on as part of the on-disk format; L2, L3, and L5 are independently configurable (compile-time feature, encryption provider, or runtime config):

Layer Defends against Config knob Default
L1 — Block XXH3-128 Bit-rot detection per section / footer Block Always on (Block invariant) always on
L2 — Page ECC (Reed-Solomon (4, 2)) Bit-rot recovery per Block RuntimeConfig::page_ecc for manifest Blocks (current release) + Config::page_ecc for SST data Blocks (compile-time page_ecc feature gates both) off (opt-in)
L3 — AEAD encryption Tampering detection + confidentiality Config::with_encryption(provider) off
L4 — Footer Block tail hint Reader locates footer without scanning Always on (trailing u32 size hint) always on
L5 — Head footer mirror Partial-write / tail-bit-rot recovery via mirrored copy at offset 0 RuntimeConfig::manifest_footer_mirror on

Manifest-side ECC and the head mirror are reachable through Tree::update_runtime_config (for L2 on manifest Blocks and L5); SST data-block ECC is tree-static via Config::page_ecc at open time and not affected by runtime updates. AEAD is supplied at open via Config::with_encryption (L3). Runtime toggles take effect on the next manifest write, and existing on-disk manifests stay readable in their original format because each Block self-describes via its header.

Failure-mode coverage with the defaults (mirror on, ECC off, AEAD off):

  • Bit-flip inside one section Block → XXH3 surfaces it, other sections + the footer Block still read correctly (per-section isolation; see manifest_blocks::reader::tests::reader_isolates_corruption_to_one_section_other_sections_readable).
  • Tail footer Block corruption → reader falls back to the head mirror at offset 0 and continues.
  • Partial write mid-update on a fresh v{N+1} → the prior version's manifest stays intact in its own v{N} file, and the atomic CURRENT rewrite either lands fully (pointing at v{N+1}) or stays at v{N}. A torn or truncated v{N+1} file is detected at open time: the reader's tail-footer + head-mirror probes both fail and Tree::open surfaces ManifestFooterInvalid. The head mirror inside each v{N} is a copy of THAT file's own tail footer Block (for tail-bit-rot recovery within one version), not a snapshot of a prior version's TOC.
  • Accidental file substitution / mislinking → the CURRENT pointer carries an XXH3-128 of the referenced v{N} manifest's canonical footer content (version_id + TOC entries + per-section XXH3-128 each section Block already carries in its own header); Tree::open recomputes this from the parsed footer on read and refuses to follow a mismatched pointer. The digest binds logical content, not raw bytes — a section bit-flip that per-Block Page ECC heals on read does NOT trip this check (the section's TOC checksum is unchanged), preserving recovery. XXH3-128 is not a cryptographic MAC: an adversary with write access can craft matching content. Enable Config::with_encryption(...) (AEAD per Block) for adversarial tamper resistance.

Turning ECC on (page_ecc = true) upgrades the first three rows from "detect and refuse" to "detect and recover" without any change to call sites — the Block layer transparently emits / consumes the parity trailer per Block.

Manifest recovery modes

Config::manifest_recovery_mode controls how the engine reacts to a malformed MANIFEST record at Tree::open time. Each mode trades a different point on the strictness ↔ availability axis; pick the one whose contract matches the deployment.

Mode Behaviour on corruption When to use
AbsoluteConsistency (default) Any per-record decode mismatch (bad XXH3, invalid tag, truncated TOC entry, declared-count overrun) aborts the open with the original error. No data is silently dropped. Production default. Surfaces every byte of corruption before the tree comes back online; matches what most workloads actually want.
TolerateCorruptedTailRecords If the iteration over tables / blob_files runs out of bytes before the declared count is reached (truncated tail), keep everything that decoded cleanly before the cut and emit a warn! listing the dropped count. Any mid-record error that is NOT a clean tail truncation (bad checksum, etc.) still aborts. Power-loss-at-write-tail salvage. Use when a crash mid-fsync left the MANIFEST tail incomplete and you'd rather come up with the last consistent prefix than refuse to open. Not a general bit-rot tolerance, only "the writer never finished".
PointInTimeRecovery On the first record-decode mismatch inside the tables section, keeps the consistent prefix collected so far (records that decoded cleanly BEFORE the corrupt one in the current run, plus complete earlier runs in the same level, plus complete earlier levels) and drops everything after. "Record-decode mismatch" covers all three failure shapes the per-record loop produces: (a) framing-layer XXH3 mismatch, (b) framing-header structural failure (len > MAX_FRAME_PAYLOAD), and (c) payload-decode failure inside an otherwise-framed-OK record (e.g. InvalidTag from a corrupt checksum_type byte — the framing XXH3 happens to cover the corrupt byte, so the bytes pass the framing layer; the corruption only surfaces at per-entry decode). Analogous treatment on blob_files. Tail-truncation is still tolerated like TolerateCorruptedTailRecords. Post-corruption salvage with LSM invariants intact. Use when a manifest has acquired real bit-rot (not just a truncated write) and you want the largest internally-consistent prefix the engine can still vouch for; matches RocksDB's kPointInTimeRecovery accept-the-prefix rule adapted to the level/run/table nesting.
SkipAnyCorruptedRecords On any per-record decode mismatch (framing-layer XXH3 mismatch, payload-decode failure inside an otherwise-framed-OK record, or framing-header BadHeader), logs the skip and advances past the bad record using the framing-supplied length field. If the framing header itself is corrupt (length field outside the legal range, so the next-record boundary cannot be located), the rest of that section is dropped — there's no safe way to find the next record boundary in that case. Symmetric on tables and blob_files. Maximum-availability forensic mode. Use when you'd rather open the tree with whatever survives than refuse to open at all; pairs with the repair_db tooling tracked in #303 for the cases where even the surviving records aren't enough.

When a non-default mode drops records, the recovery path logs warn! lines describing what was tolerated. Individual table-IDs / blob-file-IDs are NOT enumerated because they were never decoded. Warnings fall into two categories:

Per-condition warns (one warn for each malformed shape encountered, at the point of detection):

  • tables section truncated before the level_count byte → tail-tolerant mode produces 0 levels.
  • tables declared table_count exceeds remaining section payload (count header forged or entries truncated) → loop walks bytes-actually-present and stops at the first EOF.
  • blob_files section truncated before its count header → 0 blob files.
  • blob_files declared count exceeds remaining section capacity → same forged-or-truncated shape, same walk-and-stop fallback.
  • blob_gc_stats payload truncated (power-loss between the blob_files commit and the blob_gc_stats payload landing) → tail-tolerant mode produces an empty FragmentationMap. GC stats are advisory (fragmentation re-accrues on the next compaction pass), so this is a "rebuild on next pass" outcome, not data loss. This is a single in-place warn with no later summary.

Per-section summary warns (at end of section processing, only if the section actually lost records):

  • tables section, emitted only when tables_dropped_to_tail > 0 OR tables_truncated_headers > 0. Reports two counters in one line: declared-but-missing table records (count header said N, only K < N records read before EOF → N-K dropped) and the separate counter for level / run / table_count headers cut mid-byte (no records were supposed to be present yet for those levels / runs, so the headers contribute zero to record loss but the levels / runs themselves are absent).
  • blob_files section, emitted only when blob_dropped_to_tail > 0. Reports the declared-but-missing blob-file records count, analogous to the tables-section record-drop counter. A blob_files section whose only damage was a missing count header surfaces the per-condition warn above but does NOT add a summary line.

Operators wanting a per-record audit trail should pair a tail-tolerant open with an out-of-band integrity scan (see verify::verify_integrity / tools/sst-dump verify).

For workflows where the MANIFEST is unrecoverable even under the lossy modes, the planned repair_db tool (#303) will rebuild the MANIFEST from the SST files themselves.

Support the project

USDT TRC-20 Donation QR Code

USDT (TRC-20): TFDsezHa1cBkoeZT5q2T49Wp66K8t2DmdA

Credits

Originally created by Marvin Blum as part of fjall-rs/lsm-tree; this codebase carries the original copyright (Copyright (c) 2024-present, fjall-rs). The vendored Ribbon filter (src/table/filter/ribbon/) is by William Rågstad — see src/table/filter/ribbon/_vendored/ for the upstream license texts.

License

All source code is licensed under Apache-2.0. Each first-party .rs file carries an SPDX-License-Identifier: Apache-2.0 header alongside the original-author copyright and the maintainer copyright (Structured World Foundation). Contributions are accepted under the same license.

The vendored Ribbon filter (src/table/filter/ribbon/) keeps its upstream layout — it carries William Rågstad's per-module licensing commentary rather than per-file SPDX headers, plus the original LICENSE-APACHE and LICENSE-MIT preserved verbatim in src/table/filter/ribbon/_vendored/. The upstream crate is dual-licensed (MIT OR Apache-2.0); we redistribute the vendored copy only under the Apache-2.0 arm per Apache-2.0 §4.

Maintained by Structured World Foundation.

Footnotes

[1] https://rocksdb.org/blog/2017/05/12/partitioned-index-filter.html

[2] https://github.com/facebook/rocksdb/wiki/BlobDB

[3] https://rocksdb.org/blog/2018/08/23/data-block-hash-index.html