coordinode-lsm-tree

LSM-tree storage engine in Rust. Embedded library; provides keyed point reads, prefix and range scans, MVCC snapshots, compaction, and a block cache. No write-ahead log: durability is the caller's responsibility. Built for CoordiNode; usable standalone.

Status

On-disk format version V5. V5 introduces a wire-format break for filter blocks (BuRR replaces Bloom); V3 and V4 databases are not readable by this version and vice versa. Versioning is single-monotonic: every breaking format change bumps to the next version with explicit migration notes.

Features

Read path

Point reads via get / multi_get (batch-optimized).
PinnableSlice for zero-copy reads.
BurrFilter AMQ filter (Bumped Ribbon Retrieval, Walzer & Dillinger 2022): ~1% memory overhead vs the information-theoretic minimum: ~30% smaller filter blocks than a same-FPR Bloom filter, or ~10× tighter FPR at the same memory budget. Used for both per-key and per-prefix membership checks.
Forward and reverse range / prefix iteration.
Block cache with size cap.
File-descriptor cache to bound fopen syscalls.

Write path

WriteBatch with seqno-grouped batch writes: caller-controlled atomic visibility.
Single deletion tombstones (remove_weak).
Range tombstones (delete_range / delete_prefix).
Merge operators for commutative LSM operations.
Optional key-value separation (BlobTree) for large-value workloads with automatic garbage collection.

Compaction

Leveled, size-tiered, dynamic-leveled, and FIFO strategies.
Intra-L0 compaction for overlapping runs.
Major compaction (full force flush + merge).
Optional compaction filters for custom logic during compactions.
Merge-aware compaction resolves operands lazily.

Storage & encoding

Block-based tables with optional compression (none / LZ4 / Zstd) and prefix truncation.
Per-table data block size policy and per-table compression policy.
Optional zstd dictionary compression, trained per-table or per-column for small (4-64 KiB) blocks and blob files.
Optional block-level encryption at rest: AES-256-GCM, key supplied by caller.
Optional per-table block hash indexes for faster point lookups [3].
Optional partitioned block index & filters for better cache efficiency [1].
Per-level filter/index block pinning configuration.

Concurrency & API

Thread-safe BTreeMap-like API.
SequenceNumberGenerator trait: pluggable seqno source.
Custom UserComparator for non-lexicographic ordering.
MVCC: snapshot reads at a chosen SeqNo.
Point-in-time recovery snapshots via Tree::create_checkpoint: hard-link every live SST + blob file into a fresh directory in O(1) per file, zero extra disk until the source files compact away. Compaction continues during the checkpoint (deletions are deferred), and the resulting directory opens as an independent tree.

Internals

100% stable Rust, MSRV 1.92.
No FFI: zstd via structured-zstd (pure-Rust), LZ4 via lz4_flex, AES via aes-gcm.
Pluggable Fs trait: back the engine on the standard filesystem, on io_uring, on an in-memory MemFs, or on a custom implementation.
Pluggable CompressionProvider for third-party codecs.

Incremental scan / CDC

Tree::scan_since_seqno(target) streams every change committed at or after a sequence number as ScanSinceEvents (Insert, MergeOperand, PointTombstone, RangeTombstone), in increasing seqno order. It is a change-data-capture primitive: a downstream consumer (replica, Kafka connector, Debezium-style pipeline) replays the events in order to reconstruct the source's history. Superseded versions are preserved (no MVCC collapse) and tombstones are exposed, so deletes replay faithfully.

With the runtime-toggleable seqno_in_index policy, each SST index entry carries its data block's (seqno_min, seqno_max), and the scan skips any data block whose seqno_max is below the target without reading it. On a sparse-change workload (e.g. 1% of data changed since the last poll) this turns an O(data) scan into O(changed blocks). Trees with a mix of seqno-bounded and legacy SSTs are scanned transparently (legacy blocks fall back to a per-entry filter), so the policy can be enabled on a live tree and takes effect as compaction rewrites SSTs: no migration step, no format-version bump.

Engine	CDC granularity	Survives compaction	Embeddable	Block-skip
RocksDB `GetUpdatesSince`	WAL events (lost after compaction)	no	yes	n/a
Pebble	SST file (64–128 MB)	yes	yes	no
CockroachDB changefeed	SST file	yes	no (distributed)	no
FoundationDB	per-event	yes	no (distributed)	n/a
coordinode-lsm-tree	data block (4–32 KB)	yes	yes	yes

The trade-off: there is no write-ahead log, so we do not offer WAL-style millisecond tailing of in-flight updates. For arbitrary historical-seqno queries (not just "since X"), pair with Tree::create_checkpoint.

Limits

Keys: up to 65,535 bytes (the on-disk encoding caps the key-length field at u16).
Values: up to 4,294,967,295 bytes (2³² − 1; the encoding caps the value-length field at u32).
Larger keys and values carry a proportional performance cost.

Feature flags

All optional, all off by default. The default build is the minimal core (no compression, no encryption, std filesystem). Every flag below is gated because it pulls in extra dependencies or runtime overhead.

Flag	Pulls in	Enable when
`lz4`	`lz4_flex`	Block compression wanted, decompression latency matters more than ratio.
`zstd`	`structured-zstd` (pure-Rust, no FFI)	Block compression wanted, ratio matters more than absolute decompression speed. Supports `CompressionType::Zstd` and dictionary-mode `CompressionType::ZstdDict`. Decompression is ~2-3.5× slower than C reference.
`encryption`	`aes-gcm`, `rand_chacha`	AES-256-GCM block encryption at rest. Keys are caller-managed.
`io-uring` (linux only)	`io-uring`	I/O-bound workload on a modern Linux kernel: adds an `io_uring` `Fs` backend.
`bytes_1`	`bytes`	Consumer already speaks `bytes::Bytes` (tokio/hyper/tonic stack) and wants zero-copy interop with engine slices.
`metrics`	(none)	Production observability or profiling. Compiles in atomic counters around block I/O, filter probes, compaction, and cache hit rates (`tree.metrics()`). Small but non-zero hot-path cost.
`ribbon-serde`	`serde`	Snapshotting the internal `RibbonFilterRepr` for debugging or out-of-band transport. Not used by the on-disk format.

Benchmarks

CI runs db_bench on every push to main and on pull requests. Results from main are published to the benchmark dashboard. PRs regressing performance by more than 15% trigger an alert; more than 25% fails CI.

Flamegraphs are generated on every merge to main from instrumented db_bench runs and published under flamegraphs/<commit-sha>/flamegraph.svg on gh-pages.

Local Criterion microbenchmarks:

cargo bench --features lz4

Local flamegraphs:

cd tools/db_bench
cargo run --release --features flamegraph -- \
  --benchmark all --num 100000 --flamegraph
# Folded stacks: target/flamegraphs/all.folded
# Render: cargo install inferno && inferno-flamegraph target/flamegraphs/all.folded > flame.svg

Operational tools

Tool Use

tools/db_bench RocksDB-compatible benchmark suite, also drives the CI perf dashboard.

tools/sst-dump Inspect / verify a single SST file out-of-band. Subcommands: verify (walk every block, check per-block XXH3, exit non-zero on corruption), properties (print the SST's stored metadata: id, key range, KV / tombstone counts, block counts, compression, timestamp), hex <offset> (raw hex dump of a region with optional Header decode; useful for inspecting a specific offset flagged by verify --verbose), index-dump (print TLI entries: end_key + offset + size + seqno per pointed-at block; useful for diagnosing range-read fan-out), dump (stream every KV entry to stdout with optional --from / --to / --max=N / --keys-only filters; full-index SSTs only), filter-stats (print BuRR filter sizing: section bytes, layer count, item count, approximate bits-per-key; single-block filters only, partitioned filters not yet supported).

Tool	Use
`tools/db_bench`	RocksDB-compatible benchmark suite, also drives the CI perf dashboard.
`tools/sst-dump`	Inspect / verify a single SST file out-of-band. Subcommands: `verify` (walk every block, check per-block XXH3, exit non-zero on corruption), `properties` (print the SST's stored metadata: id, key range, KV / tombstone counts, block counts, compression, timestamp), `hex <offset>` (raw hex dump of a region with optional `Header` decode; useful for inspecting a specific offset flagged by `verify --verbose`), `index-dump` (print TLI entries: end_key + offset + size + seqno per pointed-at block; useful for diagnosing range-read fan-out), `dump` (stream every KV entry to stdout with optional `--from` / `--to` / `--max=N` / `--keys-only` filters; full-index SSTs only), `filter-stats` (print BuRR filter sizing: section bytes, layer count, item count, approximate bits-per-key; single-block filters only, partitioned filters not yet supported).

Data integrity & durability

Every record is checksummed end to end, from the in-memory memtable, through each on-disk block, to the manifest, with optional Reed-Solomon ECC, self-healing scrub, and AAD-bound encryption at rest. Off by default, byte-identical when disabled; turned on, detection becomes recovery.

This is the tamper-evident, silent-corruption protection that clinical and regulated systems require: the integrity controls behind HIPAA §164.312(c), FDA 21 CFR Part 11, and ALCOA+.

See docs/data-integrity.md for the full integrity stack: block + per-KV checksums (including memtable-residence RAM bit-flip detection), the Page ECC spectrum, self-healing scrub, tamper-evident encryption, the five-layer manifest hardening surface, and the manifest recovery modes.

Support the project

USDT TRC-20 Donation QR Code

USDT (TRC-20): TFDsezHa1cBkoeZT5q2T49Wp66K8t2DmdA

Credits

Originally created by Marvin Blum as part of fjall-rs/lsm-tree; this codebase carries the original copyright (Copyright (c) 2024-present, fjall-rs). The vendored Ribbon filter (src/table/filter/ribbon/) is by William Rågstad; see src/table/filter/ribbon/_vendored/ for the upstream license texts.

License

All source code is licensed under Apache-2.0. Each first-party .rs file carries an SPDX-License-Identifier: Apache-2.0 header alongside the original-author copyright and the maintainer copyright (Structured World Foundation). Contributions are accepted under the same license.

The vendored Ribbon filter (src/table/filter/ribbon/) keeps its upstream layout: it carries William Rågstad's per-module licensing commentary rather than per-file SPDX headers, plus the original LICENSE-APACHE and LICENSE-MIT preserved verbatim in src/table/filter/ribbon/_vendored/. The upstream crate is dual-licensed (MIT OR Apache-2.0); we redistribute the vendored copy only under the Apache-2.0 arm per Apache-2.0 §4.

Maintained by Structured World Foundation.