Expand description
Embedded LSM-tree storage engine.
Provides keyed point reads, prefix and range scans, MVCC snapshots, block
and file-descriptor caching, and a configurable compaction subsystem. No
write-ahead log — durability is the caller’s responsibility (flush_active_memtable
forces persistence when needed).
§Highlights
- AMQ filter:
BuRR(Bumped Ribbon Retrieval, Walzer & Dillinger 2022) for per-key and per-prefix membership checks. ~30% smaller filter blocks than a same-FPR Bloom filter, or ~10× tighter FPR at the same memory budget. - Compression: pure-Rust zstd (incl. dictionary mode), LZ4, or none — per-table and per-level policy.
- Encryption at rest: AES-256-GCM block encryption with a caller-supplied key.
- Range tombstones:
delete_range/delete_prefix(SST-encoded; the feature was added in disk format V4 and remains supported in the current V5 format — the V5 break extends the block header for per-block Reed- Solomon Page ECC, not the tombstone encoding). - Merge operators: commutative-merge LSM operations with lazy resolution.
- K/V separation (
BlobTree): large-value workloads with automatic GC. - Pluggable
Fs: standard, in-memory,io_uring, or custom backends. - MVCC: snapshot reads at a chosen
SeqNo, customUserComparator. - Concurrency: thread-safe
BTreeMap-like API.
Keys: up to 65,535 bytes (u16 length field). Values: up to 4,294,967,295
bytes (u32 length field, 2³² − 1). Larger keys and values
carry a proportional performance cost.
§Quick start
use lsm_tree::{AbstractTree, Config, SequenceNumberCounter};
let folder = tempfile::tempdir().unwrap();
let seqno = SequenceNumberCounter::default();
let tree = Config::new(&folder, seqno.clone(), SequenceNumberCounter::default())
.open()
.unwrap();
tree.insert("key", "value", seqno.next());
let value = tree.get("key", lsm_tree::SeqNo::MAX).unwrap();
assert_eq!(value.map(|v| v.to_vec()), Some(b"value".to_vec()));§On-disk format
Current version: V5. V5 introduces the BuRR filter wire format,
per-block Reed-Solomon Page ECC, and per-entry (per-KV) checksum
footers (collapsed into the same version because V5 had not shipped
when they landed): the self-describing block types (Meta / Manifest /
ManifestFooter) gain a block_flags byte whose ECC_PARITY bit marks a
parity trailer and whose KV_CHECKSUM_FOOTER bit marks a per-entry
checksum footer. SST block types (Data / Index / Filter /
RangeTombstone) keep the compact header WITHOUT that byte and derive parity / footer
presence from the per-SST meta descriptors (descriptor#page_ecc,
descriptor#kv_checksum). The block magic is bumped so a pre-V5 reader
rejects V5 blocks immediately at header decode.
A V5 SST written with every optional transform off (no Page ECC, no per-KV
footers) is still NOT byte-identical to a pre-V5 table — the bumped block
magic and the per-SST meta descriptor keys always differ. Any
“byte-identical when off” guarantees in the feature docs are within-V5 and
payload-level (e.g. index entries when seqno_in_index = false), not a
cross-version equivalence.
V3-V4 databases are not readable by this version and vice versa. The
manifest version gate rejects pre-V5 databases at Tree::open time.
V4 introduced range tombstones (still supported).
Re-exports§
pub use storage_stats::ApproximateRangeStats;pub use storage_stats::LevelStats;pub use storage_stats::RangeCardinality;pub use storage_stats::SegmentStats;pub use storage_stats::StorageStatistics;pub use storage_stats::StorageStats;pub use storage_stats::StorageStatus;pub use encryption::EncryptionProvider;pub use encryption::Aes256GcmProvider;pub use repair::RepairReport;pub use config::Config;pub use config::KvSeparationOptions;pub use config::TreeType;pub use backpressure::Backpressure;pub use backpressure::BackpressureThresholds;
Modules§
- backpressure
- Computed write-backpressure verdict (caller-honoured, see
backpressure). Computed write-backpressure verdict. - compaction
- Contains compaction strategies
- config
- Configuration
- ecc
- Shard-based Page ECC (XOR single-parity and Reed-Solomon).
- encryption
- Block-level encryption at rest. Block-level encryption at rest.
- fs
- Pluggable filesystem abstraction for I/O backends. Pluggable filesystem abstraction for I/O backends.
- hash
- Stable hash functions used across the filter and table subsystems.
- heal_
hints - Tree-wide queue of SSTs whose blocks were ECC-corrected on a read that was confirmed persistent (the on-disk bytes are still faulty after a cache-bypassing re-read).
- inspect
- Out-of-band inspection of a single SST file.
- io
- Local I/O trait surface mirroring
std::io::{Read, Write, Seek}. - repair
- Disaster-recovery: rebuild a missing/corrupt manifest from on-disk SSTs.
Last-resort
MANIFESTreconstruction from the SST files on disk. - runtime_
config - Runtime-toggleable configuration (
RuntimeConfig+ atomic-swap handle). Runtime-toggleable configuration. - salvage
- Block-granular SST salvage: recover the readable blocks of an SST whose whole-file verification fails, quarantining the corrupted ones.
- scrub
- ECC patrol scrub: a proactive sweep over Page-ECC-protected SST blocks.
- storage_
stats - Read-only storage introspection: how much is stored, the average shape of a stored entry, and an estimate of how many more entries fit in a byte budget.
- util
- Utility functions
- verify
- Integrity verification for SST and blob files.
Structs§
- Blob
File - A blob file stores large values and is part of the value log
- Blob
Tree - A key-value-separated log-structured merge tree
- Cache
- Cache, in which blocks or blobs are cached in-memory after being retrieved from disk
- Cache
Stats - A point-in-time snapshot of block-cache effectiveness and occupancy.
- Checkpoint
Info - Summary of a checkpoint produced by
AbstractTree::create_checkpoint. - Default
User Comparator - Default comparator using lexicographic byte ordering.
- Descriptor
Table - Caches file descriptors to tables and blob files
- Memtable
- The memtable serves as an intermediary, ephemeral, sorted storage for new items
- Metrics
- Runtime metrics
- Sequence
Number Counter - Thread-safe sequence number generator
- Slice
- An immutable byte slice that can be cloned without additional heap allocation
- System
Clock - The system wall-clock, backed by
std::time::SystemTime. - Tree
- A log-structured merge tree (LSM-tree/LSMT)
- Write
Batch - Batch of write operations applied with a shared seqno.
- Zstd
Dictionary - Pre-trained zstd dictionary for improved compression of small blocks.
Enums§
- AnyIngestion
- Unified ingestion builder over
AnyTree - AnyTree
- May be a standard
Treeor aBlobTree - Compression
Type - Compression algorithm to use
- Error
- Represents errors that can occur in the LSM-tree
- Format
Version - Block / SST disk format version.
- Pinnable
Slice - A value reference that may share the decompressed block buffer.
- Scan
Since Event - A single change event emitted by
Tree::scan_since_seqno. - Value
Type - Value type (regular value or tombstone)
Constants§
- MAX_
SEQNO - The maximum allowed sequence number value.
Traits§
- Abstract
Tree - Generic Tree API
- Clock
- A source of wall-clock time.
- Guard
- Guard to access key-value pairs
- Merge
Operator - A user-defined merge operator for commutative LSM operations.
- Prefix
Extractor - Extracts prefixes from keys for prefix bloom filter indexing.
- Sequence
Number Generator - Trait for custom sequence number generation.
- User
Comparator - Trait for custom user key comparison.
Type Aliases§
- KvPair
- KV-tuple (key + value)
- Memtable
Id - Unique memtable ID
- Result
- Tree result
- SeqNo
- Sequence number - a monotonically increasing counter
- Shared
Comparator - Shared reference to a
UserComparator. - Shared
Sequence Number Generator - A shared, cloneable sequence number generator.
- UserKey
- User defined key (byte array)
- User
Value - User defined data (byte array)