Skip to main content

Crate lsm_tree

Crate lsm_tree 

Source
Expand description

Embedded LSM-tree storage engine.

Provides keyed point reads, prefix and range scans, MVCC snapshots, block and file-descriptor caching, and a configurable compaction subsystem. No write-ahead log — durability is the caller’s responsibility (flush_active_memtable forces persistence when needed).

§Highlights

  • AMQ filter: BuRR (Bumped Ribbon Retrieval, Walzer & Dillinger 2022) for per-key and per-prefix membership checks. ~30% smaller filter blocks than a same-FPR Bloom filter, or ~10× tighter FPR at the same memory budget.
  • Compression: pure-Rust zstd (incl. dictionary mode), LZ4, or none — per-table and per-level policy.
  • Encryption at rest: AES-256-GCM block encryption with a caller-supplied key.
  • Range tombstones: delete_range / delete_prefix (SST-encoded; the feature was added in disk format V4 and remains supported in the current V5 format — the V5 break extends the block header for per-block Reed- Solomon Page ECC, not the tombstone encoding).
  • Merge operators: commutative-merge LSM operations with lazy resolution.
  • K/V separation (BlobTree): large-value workloads with automatic GC.
  • Pluggable Fs: standard, in-memory, io_uring, or custom backends.
  • MVCC: snapshot reads at a chosen SeqNo, custom UserComparator.
  • Concurrency: thread-safe BTreeMap-like API.

Keys: up to 65,535 bytes (u16 length field). Values: up to 4,294,967,295 bytes (u32 length field, 2³² − 1). Larger keys and values carry a proportional performance cost.

§Quick start

use lsm_tree::{AbstractTree, Config, SequenceNumberCounter};

let folder = tempfile::tempdir().unwrap();
let seqno = SequenceNumberCounter::default();
let tree = Config::new(&folder, seqno.clone(), SequenceNumberCounter::default())
    .open()
    .unwrap();

tree.insert("key", "value", seqno.next());
let value = tree.get("key", lsm_tree::SeqNo::MAX).unwrap();
assert_eq!(value.map(|v| v.to_vec()), Some(b"value".to_vec()));

§On-disk format

Current version: V5. V5 introduces the BuRR filter wire format, per-block Reed-Solomon Page ECC, and per-entry (per-KV) checksum footers (collapsed into the same version because V5 had not shipped when they landed): the self-describing block types (Meta / Manifest / ManifestFooter) gain a block_flags byte whose ECC_PARITY bit marks a parity trailer and whose KV_CHECKSUM_FOOTER bit marks a per-entry checksum footer. SST block types (Data / Index / Filter / RangeTombstone) keep the compact header WITHOUT that byte and derive parity / footer presence from the per-SST meta descriptors (descriptor#page_ecc, descriptor#kv_checksum). The block magic is bumped so a pre-V5 reader rejects V5 blocks immediately at header decode. A V5 SST written with every optional transform off (no Page ECC, no per-KV footers) is still NOT byte-identical to a pre-V5 table — the bumped block magic and the per-SST meta descriptor keys always differ. Any “byte-identical when off” guarantees in the feature docs are within-V5 and payload-level (e.g. index entries when seqno_in_index = false), not a cross-version equivalence. V3-V4 databases are not readable by this version and vice versa. The manifest version gate rejects pre-V5 databases at Tree::open time. V4 introduced range tombstones (still supported).

Re-exports§

pub use storage_stats::ApproximateRangeStats;
pub use storage_stats::LevelStats;
pub use storage_stats::RangeCardinality;
pub use storage_stats::SegmentStats;
pub use storage_stats::StorageStatistics;
pub use storage_stats::StorageStats;
pub use storage_stats::StorageStatus;
pub use encryption::EncryptionProvider;
pub use encryption::Aes256GcmProvider;
pub use repair::RepairReport;
pub use config::Config;
pub use config::KvSeparationOptions;
pub use config::TreeType;
pub use backpressure::Backpressure;
pub use backpressure::BackpressureThresholds;

Modules§

backpressure
Computed write-backpressure verdict (caller-honoured, see backpressure). Computed write-backpressure verdict.
compaction
Contains compaction strategies
config
Configuration
ecc
Shard-based Page ECC (XOR single-parity and Reed-Solomon).
encryption
Block-level encryption at rest. Block-level encryption at rest.
fs
Pluggable filesystem abstraction for I/O backends. Pluggable filesystem abstraction for I/O backends.
hash
Stable hash functions used across the filter and table subsystems.
heal_hints
Tree-wide queue of SSTs whose blocks were ECC-corrected on a read that was confirmed persistent (the on-disk bytes are still faulty after a cache-bypassing re-read).
inspect
Out-of-band inspection of a single SST file.
io
Local I/O trait surface mirroring std::io::{Read, Write, Seek}.
repair
Disaster-recovery: rebuild a missing/corrupt manifest from on-disk SSTs. Last-resort MANIFEST reconstruction from the SST files on disk.
runtime_config
Runtime-toggleable configuration (RuntimeConfig + atomic-swap handle). Runtime-toggleable configuration.
salvage
Block-granular SST salvage: recover the readable blocks of an SST whose whole-file verification fails, quarantining the corrupted ones.
scrub
ECC patrol scrub: a proactive sweep over Page-ECC-protected SST blocks.
storage_stats
Read-only storage introspection: how much is stored, the average shape of a stored entry, and an estimate of how many more entries fit in a byte budget.
util
Utility functions
verify
Integrity verification for SST and blob files.

Structs§

BlobFile
A blob file stores large values and is part of the value log
BlobTree
A key-value-separated log-structured merge tree
Cache
Cache, in which blocks or blobs are cached in-memory after being retrieved from disk
CacheStats
A point-in-time snapshot of block-cache effectiveness and occupancy.
CheckpointInfo
Summary of a checkpoint produced by AbstractTree::create_checkpoint.
DefaultUserComparator
Default comparator using lexicographic byte ordering.
DescriptorTable
Caches file descriptors to tables and blob files
Memtable
The memtable serves as an intermediary, ephemeral, sorted storage for new items
Metrics
Runtime metrics
SequenceNumberCounter
Thread-safe sequence number generator
Slice
An immutable byte slice that can be cloned without additional heap allocation
SystemClock
The system wall-clock, backed by std::time::SystemTime.
Tree
A log-structured merge tree (LSM-tree/LSMT)
WriteBatch
Batch of write operations applied with a shared seqno.
ZstdDictionary
Pre-trained zstd dictionary for improved compression of small blocks.

Enums§

AnyIngestion
Unified ingestion builder over AnyTree
AnyTree
May be a standard Tree or a BlobTree
CompressionType
Compression algorithm to use
Error
Represents errors that can occur in the LSM-tree
FormatVersion
Block / SST disk format version.
PinnableSlice
A value reference that may share the decompressed block buffer.
ScanSinceEvent
A single change event emitted by Tree::scan_since_seqno.
ValueType
Value type (regular value or tombstone)

Constants§

MAX_SEQNO
The maximum allowed sequence number value.

Traits§

AbstractTree
Generic Tree API
Clock
A source of wall-clock time.
Guard
Guard to access key-value pairs
MergeOperator
A user-defined merge operator for commutative LSM operations.
PrefixExtractor
Extracts prefixes from keys for prefix bloom filter indexing.
SequenceNumberGenerator
Trait for custom sequence number generation.
UserComparator
Trait for custom user key comparison.

Type Aliases§

KvPair
KV-tuple (key + value)
MemtableId
Unique memtable ID
Result
Tree result
SeqNo
Sequence number - a monotonically increasing counter
SharedComparator
Shared reference to a UserComparator.
SharedSequenceNumberGenerator
A shared, cloneable sequence number generator.
UserKey
User defined key (byte array)
UserValue
User defined data (byte array)