mmdb-2.2.0 has been yanked.

MMDB

A Modern LSM-Tree Storage Engine!

A pure-Rust, synchronous LSM-Tree key-value storage engine. Designed as a native Rust alternative to RocksDB with competitive performance — scan throughput matches or exceeds RocksDB on equivalent configurations.

Performance

Benchmarked via vsdb (500K entries, 128-byte values, 64MB write buffer, 256MB block cache):

Operation	RocksDB	MMDB	Result
scan 100 entries	14.1 us	11.7 us	MMDB 17% faster
scan 1000 entries	112 us	108 us	MMDB 4% faster
scan 10000 entries	1.10 ms	1.09 ms	MMDB 1% faster
full scan 500K	54.5 ms	52.4 ms	MMDB 4% faster
point read	405 ns	168 ns	MMDB 2.4x faster

Features

Feature	Status
Put / Get / Delete	Implemented
WriteBatch (atomic multi-key writes)	Implemented
WAL with group commit & crash recovery	Implemented
SST files with prefix-compressed blocks	Implemented
Bloom filters (per-key + prefix bloom)	Implemented
Block cache (moka LRU) with L0 pinning (insert_pinned)	Implemented
Table cache (open file handles)	Implemented
MANIFEST version tracking + compaction	Implemented
Leveled compaction with trivial move	Implemented
Multi-threaded background compaction	Implemented
Streaming compaction (O(block) memory)	Implemented
MVCC snapshots via sequence numbers	Implemented
Forward/backward/prefix/range iterators	Implemented
Compression: None, LZ4, Zstd (per-level)	Implemented
Write backpressure (slowdown/stop)	Implemented
DeleteRange (range tombstones)	Implemented
CompactRange API (with range filtering)	Implemented
Compaction filter	Implemented
Rate limiter (token bucket, wired to compaction)	Implemented
DB properties/statistics (wired to all paths)	Implemented
Configurable options (RocksDB parity)	Implemented

Architecture

                      +-------------+
                      |   DB (API)  |   get / put / delete / write_batch
                      +------+------+
                             |
                +------------+------------+
                v            v            v
          +----------+  +---------+  +-----------+
          | MemTable |  |  WAL    |  | MANIFEST  |
          | (active) |  | (append)|  | (versions)|
          +----+-----+  +---------+  +-----------+
               | freeze
               v
          +----------+
          | MemTable |
          | (immut.) |
          +----+-----+
               | flush
               v
    +------------------------------------+
    |      SST Layer (L0 .. Ln)          |
    |  L0: overlapping files             |
    |  L1+: sorted, non-overlapping      |
    +------------------------------------+
               ^
               | background compaction (N threads)
    +----------+-----------+
    | Leveled Compaction   |
    +----------------------+

Write Path

Encode as InternalKey (user_key + !pack(sequence, type) — inverted big-endian for descending sequence order)
Append to WAL (group commit: leader batches multiple writers into one fsync)
Insert into active MemTable (concurrent skiplist)
When MemTable exceeds write_buffer_size, freeze and flush to L0 SST
Signal background compaction threads if L0 file count exceeds threshold

Read Path

Active MemTable (newest writes)
Immutable MemTables (newest first)
L0 SST files (newest first, per-file bloom filter + range check)
L1+ SST files (LevelIterator with lazy file opening, binary search by key range)

Iterator Architecture (key performance innovation)

The scan path uses a MergingIterator (min-heap) that merges entries from:

MemTable cursors (O(1) per entry via skiplist level-0 chain)
L0 TableIterators (one per overlapping SST file)
LevelIterators (one per L1+ level — lazy two-level iterator)

Key optimizations that match/exceed RocksDB:

Single-source bypass: When only one source exists (memtable-only reads), MergingIterator bypasses heap machinery entirely — direct delegation with zero overhead (mirrors RocksDB's MergeIteratorBuilder n==1 optimization)
SuperVersion lock-free reads: Read path (get/iter) uses RwLock<Arc<SuperVersion>> snapshot instead of locking the inner mutex, similar to RocksDB's SuperVersion mechanism
Buffer-reusing memtable iteration: MemTableCursorIter::next_into() copies directly into caller buffers via extend_from_slice — after warm-up, zero heap allocation per entry
In-place key truncation: DBIterator::next_visible() truncates internal keys to user keys in-place instead of allocating a new Vec
Zero-copy block cache: Block stores Arc<Vec<u8>> — cache hits avoid memcpy
Cursor-based block iteration: Decodes entries one-at-a-time from raw block data (decode_entry_reuse) — no per-block Vec allocation
Deferred block read: SST index stores first_key per block; Seek positions without reading data blocks
Buffer reuse: IterSource reuses key/value buffers; next_into() decodes directly into caller buffers
Sequential readahead: posix_fadvise(WILLNEED) after detecting sequential block access
L0 metadata pinning: Index entries and first data blocks pinned (non-evictable) for L0 files via insert_pinned; unpinned on compaction
Sweep-line range tombstone tracking: O(1) amortized instead of O(T) per key
Lazy index loading: TableIterator::new() is O(1) — index parsed on first use
Atomic L0 counter: Write-throttle checks use an atomic counter instead of locking inner, eliminating mutex contention on every put

SST Format

Data Block 0: [entries with prefix compression + restart points]
Data Block 1: ...
...
Filter Block: [bloom filter bits]
Prefix Filter Block: [prefix bloom bits]
Metaindex Block: ["filter.bloom" -> handle, "filter.prefix" -> handle]
Index Block: [last_key -> BlockHandle + first_key per block]
Footer (48 bytes): [metaindex_handle, index_handle, magic]

Source Code Structure

src/
+-- lib.rs                  # Public API re-exports
+-- db.rs                   # DB: open/get/put/delete/write/flush/compact/close
+-- options.rs              # DbOptions, ReadOptions, WriteOptions (RocksDB-compatible)
+-- types.rs                # InternalKey, ValueType, SequenceNumber, WriteBatch
+-- error.rs                # Error types
+-- rate_limiter.rs         # Token-bucket rate limiter
+-- stats.rs                # Database statistics
+-- memtable/
|   +-- mod.rs              # MemTable (put/get/iter with approximate_size tracking)
|   +-- skiplist.rs         # OrdInternalKey + skiplist (O(1) iteration)
+-- wal/
|   +-- writer.rs           # WAL append with block-based fragmentation
|   +-- reader.rs           # WAL replay for crash recovery
|   +-- record.rs           # Record format: checksum + length + type + payload
+-- sst/
|   +-- block.rs            # Data block: prefix compression + restart points + seek
|   +-- block_builder.rs    # Block construction
|   +-- table_builder.rs    # SST writer (data + filter + index with first_key + footer)
|   +-- table_reader.rs     # SST reader + TableIterator (cursor-based, deferred block read)
|   +-- filter.rs           # Bloom filter (double hashing)
|   +-- format.rs           # Footer, BlockHandle, CompressionType, IndexEntry encoding
+-- compaction/
|   +-- leveled.rs          # Leveled compaction: streaming merge, trivial move, filter
+-- manifest/
|   +-- version_edit.rs     # VersionEdit encode/decode
|   +-- version.rs          # Version (immutable SST file set snapshot)
|   +-- version_set.rs      # VersionSet (MANIFEST management + version chain)
+-- cache/
|   +-- block_cache.rs      # Block LRU cache (moka) with L0 pinning support
|   +-- table_cache.rs      # SST reader cache
+-- iterator/
    +-- merge.rs            # Min-heap MergingIterator with buffer reuse + prefetch
    +-- db_iter.rs          # DBIterator: dedup, snapshot, tombstone/range-del filtering
    +-- level_iter.rs       # LevelIterator: lazy two-level iterator for L1+
    +-- range_del.rs        # RangeTombstoneTracker: sweep-line O(1) amortized
    +-- bidi_iter.rs        # BidiIterator: bidirectional iteration

Public API

impl DB {
    pub fn open(options: DbOptions, path: impl AsRef<Path>) -> Result<Self>;
    pub fn put(&self, key: &[u8], value: &[u8]) -> Result<()>;
    pub fn put_with_options(&self, options: &WriteOptions, key: &[u8], value: &[u8]) -> Result<()>;
    pub fn get(&self, key: &[u8]) -> Result<Option<Vec<u8>>>;
    pub fn get_with_options(&self, options: &ReadOptions, key: &[u8]) -> Result<Option<Vec<u8>>>;
    pub fn delete(&self, key: &[u8]) -> Result<()>;
    pub fn delete_with_options(&self, options: &WriteOptions, key: &[u8]) -> Result<()>;
    pub fn delete_range(&self, begin: &[u8], end: &[u8]) -> Result<()>;
    pub fn delete_range_with_options(&self, options: &WriteOptions, begin: &[u8], end: &[u8]) -> Result<()>;
    pub fn write(&self, batch: WriteBatch) -> Result<()>;
    pub fn write_with_options(&self, batch: WriteBatch, options: &WriteOptions) -> Result<()>;
    pub fn iter(&self) -> Result<DBIterator>;

    /// Prefix-optimized iteration. Uses prefix bloom filters to skip entire SST
    /// files that don't contain `prefix`. Supports bounds via ReadOptions.
    pub fn iter_with_prefix(&self, prefix: &[u8], options: &ReadOptions) -> Result<DBIterator>;

    /// Full-scan iteration with optional key-range bounds.
    /// WARNING: Does NOT use prefix bloom filters. For prefix-scoped queries,
    /// prefer `iter_with_prefix()` which is typically 10-100x faster.
    pub fn iter_with_range(&self, options: &ReadOptions, lower: Option<&[u8]>, upper: Option<&[u8]>) -> Result<DBIterator>;

    /// RAII snapshot — automatically released on drop.
    pub fn snapshot(&self) -> Snapshot<'_>;

    pub fn flush(&self) -> Result<()>;
    pub fn compact(&self) -> Result<()>;
    pub fn compact_range(&self, begin: Option<&[u8]>, end: Option<&[u8]>) -> Result<()>;
    pub fn get_property(&self, name: &str) -> Option<String>;
    pub fn close(&self) -> Result<()>;
}

impl DBIterator {
    pub fn valid(&mut self) -> bool;
    pub fn key(&mut self) -> Option<&[u8]>;     // None if invalid (no panic)
    pub fn value(&mut self) -> Option<&[u8]>;   // None if invalid (no panic)
    pub fn advance(&mut self);
    pub fn seek(&mut self, target: &[u8]);       // first key >= target
    pub fn seek_for_prev(&mut self, target: &[u8]); // last key <= target
    pub fn seek_to_first(&mut self);
    pub fn seek_to_last(&mut self);
    pub fn prev(&mut self);
}

impl Snapshot<'_> {
    pub fn sequence(&self) -> SequenceNumber;
    pub fn read_options(&self) -> ReadOptions;  // pre-configured for this snapshot
}

struct ReadOptions {
    pub snapshot: Option<u64>,
    pub iterate_lower_bound: Option<Vec<u8>>,  // inclusive
    pub iterate_upper_bound: Option<Vec<u8>>,  // exclusive
    // ... other options
}

Configuration

let opts = DbOptions {
    write_buffer_size: 64 * 1024 * 1024,    // 64 MB memtable
    block_cache_capacity: 256 * 1024 * 1024, // 256 MB block cache
    block_size: 16384,                        // 16 KB blocks
    l0_compaction_trigger: 8,
    compression: CompressionType::Lz4,
    bloom_bits_per_key: 10,
    max_background_compactions: 4,            // parallel compaction
    pin_l0_filter_and_index_blocks_in_cache: true,
    ..Default::default()
};

// Or use a preset profile:
let opts = DbOptions::write_heavy();  // 128MB memtable, 4 compaction threads, LZ4
let opts = DbOptions::read_heavy();   // large cache, 14 bits/key bloom

Build & Test

cargo build
cargo test               # 250+ tests (unit + integration + e2e + proptest)
make all                 # fmt + lint + check + test
make bench               # criterion benchmarks (warm + cold cache scenarios)
cargo bench -- "cold"    # cold-cache benchmarks only
cargo bench -- "warm"    # warm-cache benchmarks only

Benchmarks test both warm-cache (256MB, data in memory) and cold-cache (256KB, I/O-bound) scenarios. Cold-cache tests use smaller datasets to keep runtime reasonable.

License

MIT

mmdb 2.2.0