Why emdb
Bitcask-style architecture on top of fsys:
one fsys-journal-backed append-only log, sharded in-memory hash
index, lock-free reads + lock-free writes via fsys's atomic LSN
reservation. fsys handles the platform-specific durability layer
(NVMe passthrough flush, io_uring on Linux, WRITE_THROUGH on
Windows where appropriate); emdb handles the engine-level concerns.
Performance vs. peers
5 M records, 24-byte random keys, 150-byte random values — same workload
shape as redb's published bench. Lower is better; numbers in
milliseconds. Run on a Windows 11 NVMe consumer box. Reproduce with
cargo bench --bench lmdb_style --features ttl,bench-compare.
| phase | emdb | redb | sled | emdb vs redb |
|---|---|---|---|---|
| bulk load | 13 724 ms | 43 660 ms | 31 116 ms | 3.2× faster |
| individual writes (fsync/op) | 406 ms | 544 ms | 429 ms | 1.3× faster |
| batch writes | 292 ms | 5 970 ms | 1 286 ms | 20.4× faster |
| nosync writes | 127 ms | 1 025 ms | 675 ms | 8.1× faster |
| random reads (1 M) | 322 ms | 2 765 ms | 6 079 ms | 8.6× faster |
| random reads (4 threads) | 703 ms | 11 210 ms | 22 884 ms | 15.9× faster |
| random reads (8 threads) | 511 ms | 13 026 ms | 23 392 ms | 25.5× faster |
| removals | 5 662 ms | 33 348 ms | 25 631 ms | 5.9× faster |
| compaction | 8 268 ms | 12 540 ms | N/A | 1.5× faster |
| uncompacted size | 1.10 GiB | 4.00 GiB | 2.15 GiB | 3.6× smaller |
| compacted size | 508 MiB | 1.64 GiB | N/A | 3.3× smaller |
| random range reads | opt-in | 2 376 ms | 6 133 ms | see note 1 |
emdb now wins every column. The single-thread individual writes
phase — where v0.8.x was 39× behind redb because each db.flush()
hit one Windows FlushFileBuffers per record — is now
1.3× faster than redb and 1.06× faster than sled thanks
to the fsys journal substrate (lock-free LSN reservation,
group-commit fsync, NVMe passthrough flush where supported). One
note on the column where the table doesn't tell the whole story:
- Range reads are opt-in, not unsupported. emdb's primary
index is hash-keyed, so the default open does not pay the memory
tax for sorted iteration. Set
EmdbBuilder::enable_range_scans(true)to maintain a parallel lock-freecrossbeam_skiplist::SkipMapsecondary index per namespace — see the Range scans section below for the API and the memory-cost trade-off. v0.8 added streamingEmdb::range_iter/range_prefix_iterso consumers that only read the first few elements pay only for what they consume.
Read scaling under fan-out
The MT random-read columns above show emdb scaling to 9.94 M
reads/sec aggregate at 8 threads on a 4-core consumer box, while
redb stalls near 347 K/sec past one thread. The lock-free Arc<Mmap>
read path plus the 64-shard hash index keep the hot path contention-
free; past core count, shared memory bandwidth is the only cap.
For more thread-count granularity, run
cargo bench --bench concurrent_reads.
Group commit: multi-threaded per-record durability
FlushPolicy::Group lets concurrent flush() calls share a single
fdatasync. The shape that motivates it is N independent producer
threads each writing one record then calling flush for per-record
durability — a pattern where OnEachFlush pays N syncs even though
one would do.
Run with cargo bench --bench group_commit --features ttl. Default
workload is 8 threads × 200 writes/thread:
| policy | wall time (ms) | writes/sec | speedup |
|---|---|---|---|
| OnEachFlush | 2192 | 730 | 1.00× |
| Group | 272 | 5 880 | 8.06× |
max_batch should be set close to the expected concurrent flusher
count (typically num_cpus). Setting it higher means the leader
waits the full max_wait for followers that can never arrive,
turning batching into pure tail latency.
use Duration;
use ;
let db = builder
.flush_policy
.build?;
# Ok::
FlushPolicy::WriteThrough: opt-in per-pwrite durability
For workloads where OnEachFlush's per-flush() cost is dominated
by FlushFileBuffers latency (the canonical Windows
single-thread-per-record-durability pain), v0.8.5 adds
FlushPolicy::WriteThrough as a third policy. The file is opened
with FILE_FLAG_WRITE_THROUGH (Windows) / O_SYNC (Unix) so every
pwrite is durable on return; flush() becomes near-free.
The trade-off is real: bulk loads under WriteThrough are slower
because every individual pwrite waits for disk instead of
benefiting from the OS write-back cache. Whether WriteThrough
beats OnEachFlush depends on the workload, the file's existing
size, and the OS's FlushFileBuffers cost on that file. Benchmark
on your actual data to decide.
Reproduce on your machine:
cargo bench --bench write_through --features ttl
use ;
let db = builder
.flush_policy
.build?;
# Ok::
See docs/BENCH.md for full run instructions and tuning notes.
Status
v0.9.0. Major architectural change from v0.8.5 — the storage
substrate is now a fsys journal
(lock-free LSN reservation, group-commit fsync, NVMe passthrough
flush, io_uring on Linux). emdb's read path keeps its own
Arc<Mmap> over the journal file for zero-copy lookups; the
write path delegates entirely to fsys. Existing v0.7 / v0.8.x
file formats are not compatible — v0.9 uses fsys's frame format
on the data file and a new <path>.meta sidecar for emdb's
metadata.
The API surface from v0.8.5 carries over: optional at-rest
encryption (AES-256-GCM or ChaCha20-Poly1305, raw key or
Argon2id passphrase); optional sorted-iteration secondary index
via EmdbBuilder::enable_range_scans(true); three flush-policy
variants (OnEachFlush, Group, WriteThrough); streaming
iter / keys / range; cursor-style iter_from / iter_after;
zero-copy get_zerocopy; atomic backup_to(path); point-in-time
stats(); stale-lockfile recovery (lock_holder + break_lock).
Pre-1.0. The remaining work before v1.0 is API stabilisation:
an audit pass for pub vs pub(crate), a cargo-fuzz target
for the record decoder, and a docs/stability.md SemVer
commitment. No further architectural changes are planned before
1.0.
Installation
[]
= "0.9.0"
# All optional features
= { = "0.9.0", = ["ttl", "nested", "encrypt"] }
MSRV: Rust 1.75.
Quick start
use Emdb;
let db = open_in_memory;
db.insert?;
assert_eq!;
# Ok::
Persistence
use Emdb;
let path = temp_dir.join;
let reopened = open?;
assert_eq!;
# let _cleanup = remove_file;
# Ok::
flush() durably writes the record bytes; it does not rewrite the
file header. The header carries a tail_hint that lets the next
open skip past the bulk of the log instead of scanning from byte
4096. Call checkpoint() at quiescent points (after a bulk load,
on graceful shutdown) to update that hint and pay one extra fsync
in exchange for fast reopens. The drop of the last handle attempts
a checkpoint as a backstop; explicit calls are recommended for
long-lived processes that care about reopen latency.
Storage path resolution
Emdb::open(path) is the simplest entry point. For library / app
authors who want platform-aware path resolution, set both app_name
and database_name so your project gets a clearly-scoped subdirectory
under the platform data root.
use Emdb;
// Resolves to:
// Linux: $XDG_DATA_HOME/hivedb-kv/sessions.emdb
// macOS: ~/Library/Application Support/hivedb-kv/sessions.emdb
// Windows: %LOCALAPPDATA%\hivedb-kv\sessions.emdb
let db = builder
.app_name
.database_name
.build?;
# Ok::
| builder method | default if unset | notes |
|---|---|---|
app_name(name) |
"emdb" |
Single folder name under the platform data root. |
database_name(name) |
"emdb-default.emdb" |
Bare filename; no extension auto-added. |
data_root(path) |
platform default | Escape hatch for tests / containers / sandboxes. |
app_name is a single folder name by design — path separators (/,
\), .. components, and the empty string are rejected at build time.
Mixing path() with any of the OS-resolution methods returns
Error::InvalidConfig.
Bulk loading
For high-volume inserts, prefer insert_many — it packs every record
into a single buffer and does one pwrite, which is the path that beats
redb 2.4× in the bench above.
use Emdb;
let db = open_in_memory;
let items: =
.map
.collect;
db.insert_many?;
db.flush?;
# Ok::
Transactions
use Emdb;
let db = open_in_memory;
db.transaction?;
assert_eq!;
# Ok::
Transactions buffer writes and commit them as one bulk insert on
success. Err from the closure drops the buffered writes — nothing
hits disk.
use ;
let db = open_in_memory;
let failed = db.;
assert!;
assert_eq!;
# Ok::
Durability model
Each record is framed with a CRC32. On crash recovery the engine walks
records from header.tail_hint and treats the first bad CRC as the
truncation point. Per-record atomicity is guaranteed; batch
atomicity across a transaction is not — a crash mid-commit leaves a
prefix of the batch durable. Callers that need true all-or-nothing
across N records must layer that on top.
Compaction
The append-only log accumulates tombstoned and superseded records over
time. Emdb::compact() rewrites the live records into a sibling file,
truncates to logical size, and atomically swaps it in.
use Emdb;
let path = temp_dir.join;
let db = open?;
db.insert?;
db.remove?; // tombstone added to log
db.compact?; // log now holds only the live records
db.flush?;
# let _cleanup = remove_file;
# let _cleanup2 = remove_file;
# Ok::
Compaction is a heavier operation than flush — call it on maintenance
windows, not on every write. Existing readers holding Arc<Mmap>
snapshots from before the compaction continue reading from the old
inode until they release; new reads see the compacted layout.
Range scans
emdb's primary index is a sharded hash, so unsorted iteration is the
default. To support range / prefix queries, opt in at open time with
EmdbBuilder::enable_range_scans(true). The engine maintains a
parallel lock-free crossbeam_skiplist::SkipMap<Vec<u8>, u64>
secondary index per namespace; range queries scan the skiplist and
resolve values through the mmap. Inserts and range iteration are
concurrent-safe without a global lock.
use Emdb;
let db = builder
.enable_range_scans
.build?;
db.insert?;
db.insert?;
db.insert?;
// Half-open range: ["user:", "user;").
let users = db.range?;
assert_eq!;
assert_eq!;
assert_eq!;
// Prefix shorthand: builds the half-open `[prefix, prefix++)` range.
let same = db.range_prefix?;
assert_eq!;
# Ok::
Cost: one Vec<u8> clone of the key per insert plus the skiplist
node overhead — roughly doubles in-memory index size for a typical
workload. Calling db.range(...) without enabling this at open time
returns Error::InvalidConfig.
Namespace::range and Namespace::range_prefix give the same view
scoped to a named namespace.
Cargo features
ttl(default) — per-record expiration anddefault_ttl.nested— dotted-prefix group operations andFocushandles.encrypt— AES-256-GCM + ChaCha20-Poly1305 at-rest encryption with raw-key or Argon2id-derived passphrase. Pulls inaes-gcm,chacha20poly1305,argon2,rand_core.bench-compare— pulls inredbandsledfor the comparative bench (dev-only; not for production builds).bench-rocksdb/bench-redis— additional comparative bench peers.
Concurrency
Emdb is Send + Sync and cheap to clone — clones share the same
underlying engine via Arc. Pass clones across threads instead of
synchronising access to a single handle.
Reads scale. A 64-shard sharded parking_lot::RwLock<HashMap>
primary index plus zero-copy slices from a shared Arc<Mmap> keep
the hot path contention-free: the comparative bench above hits
7.66 M reads/sec aggregate at 8 threads on a 4-core consumer box.
Writes scale too. There is no writer mutex on the hot append
path — fsys::JournalHandle reserves the write slot via a single
atomic fetch_add on the next-LSN counter, and concurrent appenders
issue independent pwrites to their reserved byte ranges. Producers
on N threads do not serialise on a global writer lock. Group-commit
durability is handled by fsys's leader/follower fsync coordinator,
so multiple concurrent flush() calls coalesce into one
fdatasync. High-throughput producers should still batch through
db.insert_many(...) or db.transaction(|tx| ...), which route
through fsys's vectored append_batch (one LSN reservation + one
pwrite for the whole batch) — strictly faster than N independent
appends.
use Arc;
use thread;
use Emdb;
let db = new;
db.insert?;
let mut workers = Vecnew;
for i in 0_u32..4
for worker in workers
assert!;
# Ok::
Performance tuning
The defaults are tuned for storage-engine workloads — emdb opens its
fsys handle with tune_for(Workload::Database) (8 MiB resident
buffer pool, 256-deep io_uring ring, 4 K-deep batch queue) and
applies WriteLifetimeHint::Long to the journal on Linux so the
SSD groups journal data into long-lived NAND blocks. Bulk inserts
and transactions route through fsys's vectored
JournalHandle::append_batch, which submits the whole batch as one
LSN reservation + one pwrite — strictly faster than calling
append in a tight loop. None of these require caller action.
Two opt-in knobs go past the defaults:
use ;
// Linux io_uring kernel-side SQPOLL submission polling. The kernel
// spawns a polling thread that drains the SQ without requiring
// `io_uring_enter` syscalls; idles after `idle_ms` of no submissions.
// Sustained-throughput WAL writers see measurable wins; bursty
// workloads pay for the polling thread and are better off without it.
// Linux-only. Falls back cleanly to non-SQPOLL on EPERM / unsupported
// kernels — same durability contract, slower path.
let db = builder
.iouring_sqpoll // idle window in milliseconds
.flush_policy // pairs well with group-commit
.build?;
# Ok::
FlushPolicy::Group enables the group-commit coordinator so
concurrent flush() calls share one fdatasync. The default
OnEachFlush is the right choice for single-writer workloads or
when the application already batches durability.
TTL example
#
#
# Ok::
Nested example
#
#
# Ok::
Encryption
#
#
# Ok::
The cipher is creation-time-fixed and stored in the header — reopens
auto-dispatch. Wrong passphrase surfaces as
Error::EncryptionKeyMismatch from a verification block check, not
from a corrupted-data read. Three offline admin functions
(Emdb::enable_encryption, disable_encryption, rotate_encryption_key)
let you toggle encryption or rotate keys on an existing file via
atomic rewrite-then-rename, leaving an .encbak backup.
Goals
- Embedded-first — runs in-process; no separate server, no network.
- High performance — zero-copy reads, allocation-free hot paths, cache-friendly layout, batched writes amortise lock and syscall costs.
- Safe — strict
clippyprofile, nounwrapin library code, everyunsafeblock documented with its invariant. - Small footprint — minimal dependency graph, fast compile times.
- Portable — Linux, macOS, Windows on x86_64 and ARM64.
Non-goals
- Client/server operation (use a dedicated DBMS for that).
- SQL.
- Distributed replication.
- Range scans on a single namespace (the index is hash-based; insert a prefix-sorted secondary structure on top if you need ranges).
Benchmarking
emdb ships Criterion benches. The comparative bench can include redb,
sled, optionally RocksDB, and optionally Redis.
- Core: benches/kv.rs
- Comparative: benches/comparative.rs
# Just emdb
cargo bench --bench kv --features ttl
# emdb vs sled vs redb
cargo bench --bench comparative --features ttl,bench-compare
# Add RocksDB
cargo bench --bench comparative --features ttl,bench-compare,bench-rocksdb
# Add Redis (set EMDB_REDIS_URL first)
$env:EMDB_REDIS_URL = "redis://127.0.0.1/"
cargo bench --bench comparative --features ttl,bench-compare,bench-redis
Full bench workflow and tuning notes: docs/BENCH.md.
Related projects
emdb is the Rust implementation. Implementations in other languages
(Go, C, etc.) are planned and will live under their own repositories.
License
Licensed under the Apache License, Version 2.0.
Copyright © 2026 James Gober.