wal-db 0.5.0

Write-ahead log primitive for Rust storage engines. Durable, recoverable, lock-free append path. The WAL substrate under lsm-db, txn-db, raft-io, and Hive DB.
Documentation
  • Append-only durable log of arbitrary byte records
  • Lock-free multi-writer append — many threads append at once with no global lock
  • Group commit — concurrent sync calls coalesce into one fsync, amortising the durability cost
  • Segment rotation — optionally stripe the log across bounded segment files for bounded recovery and archival
  • Explicit durability barriersappend is in-memory-fast; sync is the durability point
  • Platform-correct flushfdatasync on Linux, FlushFileBuffers on Windows, fcntl(F_FULLFSYNC) on macOS
  • Torn-write detection — a CRC32C checksum per record; recovery stops at the first damaged record
  • Self-healing recovery — a torn tail from a crash mid-append is truncated on open, leaving a clean boundary
  • Fuzz-hardened recovery — arbitrary bytes never panic or over-allocate; a continuous cargo-fuzz harness proves it
  • Recovery policies — stop at the first damaged record, or skip past it for forensic partial recovery
  • LSN seeking & truncation — replay from any LSN (iter_from); drop everything after one (truncate_after) for compaction
  • Iterator-based replay — walk the log forward to rebuild state
  • Typed records (optional) — serialise any value via pack-io behind a feature; the byte-record API is unchanged when off
  • Pluggable storage backend — file-backed by default; injectable for in-memory testing and custom stores

The durability contract

Two operations, two distinct guarantees. Confusing them is the single most common way to lose data with a WAL, so wal-db keeps them explicit:

  • append returns when the record is in the operating system's page cache. A crash after append but before sync may lose that record.
  • sync returns only when every record appended before it is on stable storage and will survive a power loss.

That flush is not the same call on every platform, and getting it wrong is silent:

Platform Durability call
Linux fdatasync
Windows FlushFileBuffers
macOS fcntl(F_FULLFSYNC)not plain fsync, which leaves data in the drive's write cache

Installation

[dependencies]
wal-db = "0.4"

Quick Start

use wal_db::Wal;

# fn apply(_lsn: wal_db::Lsn, _bytes: &[u8]) -> Result<(), wal_db::WalError> { Ok(()) }
// Open (or create) the log.
let wal = Wal::open("/var/lib/myapp/app.wal")?;

// Append returns once the record is in the OS page cache. It does not flush.
let lsn = wal.append(b"a state change")?;

// Sync is the durability barrier: it returns once the record is on stable storage.
wal.sync()?;

// On restart, replay the log from the start to rebuild state.
for entry in wal.iter()? {
    let entry = entry?;
    apply(entry.lsn(), entry.data())?;
}

Recovery

Every record carries a CRC32C checksum over its own bytes. On open, the log scans forward and stops at the first record that is incomplete or fails its checksum — a torn write left by a crash mid-append — and truncates that tail. The records before it are kept; the next append continues from a clean boundary with no gap in the sequence numbers. A corrupt length prefix can never trigger a wild allocation: lengths are validated against the configured maximum before a single payload byte is read.

use wal_db::Wal;

# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
# let path = dir.path().join("app.wal");
// After a crash, reopening the log truncates any torn tail automatically.
let wal = Wal::open(&path)?;

// Iteration yields a Result per record; a damaged record surfaces once, then ends.
for entry in wal.iter()? {
    match entry {
        Ok(record) => { /* apply record.data() at record.lsn() */ }
        Err(e) => eprintln!("recovery stopped: {e}"),
    }
}
# Ok(())
# }

Configuration

Tunables live on WalConfig, a builder passed to Wal::open_with:

use wal_db::{Wal, WalConfig};

# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
# let path = dir.path().join("app.wal");
let config = WalConfig::new().with_max_record_size(1024 * 1024); // cap records at 1 MiB
let wal = Wal::open_with(&path, config)?;
# let _ = wal;
# Ok(())
# }

Concurrency and group commit

Wal is built for many writers. append is lock-free: each call reserves its byte range with a single atomic step — that range's start offset is the record's LSN — then writes its record without blocking the others. Share one Wal behind an Arc and append from every thread.

Durability is where threads cooperate. When several call sync at once they coalesce into a single fsync — group commit — so the cost of making data durable is amortised across everyone committing together rather than paid N times. append_and_sync does an append and a group-commit-aware sync in one call:

use std::sync::Arc;
use std::thread;
use wal_db::{MemStore, Wal};

# fn main() -> Result<(), wal_db::WalError> {
let wal = Arc::new(Wal::with_store(MemStore::new())?);

let workers: Vec<_> = (0..4)
    .map(|t| {
        let wal = Arc::clone(&wal);
        thread::spawn(move || {
            for i in 0..100 {
                // Each thread appends and commits; the fsyncs coalesce.
                wal.append_and_sync(format!("worker {t} record {i}").as_bytes()).unwrap();
            }
        })
    })
    .collect();
for w in workers {
    w.join().unwrap();
}

assert_eq!(wal.iter()?.count(), 400);
# Ok(())
# }

LSNs are byte offsets. The LSN returned by append is the record's position in the log — monotonic and unique, but not consecutive. The first record is 0; the next sits at its end. This is what lets the append path reserve with a single atomic and never reorder. See docs/ON_DISK_FORMAT.md.

Custom backends

Wal::open uses the file-backed FileStore. Any type implementing the WalStore trait can stand in — an in-memory store for tests, or an alternative storage layer. The crate ships MemStore for the in-memory case:

use wal_db::{MemStore, Wal};

# fn main() -> Result<(), wal_db::WalError> {
let wal = Wal::with_store(MemStore::new())?;
let lsn = wal.append(b"no filesystem involved")?;
assert_eq!(lsn.get(), 0);
# Ok(())
# }

Segments

By default a log is a single file. For bounded recovery time and archival, stripe it across fixed-size segment files in a directory instead — Wal::open_segmented. The log stays one continuous byte stream; records span segment boundaries freely (the same scheme PostgreSQL uses), so nothing about the API or the records changes:

use wal_db::Wal;

# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
// 16 MiB segments. Old, superseded segment files can be archived or pruned.
let wal = Wal::open_segmented(dir.path(), 16 * 1024 * 1024)?;
wal.append(b"striped across files")?;
wal.sync()?;
# Ok(())
# }

Typed records

By default a record is bytes. With the pack-io feature, a record can be any type that derives Serialize/Deserializeappend_typed writes it, Record::decode reads it back. The derives come from the re-exported wal_db::pack_io, so no extra dependency is needed.

[dependencies]
wal-db = { version = "0.4", features = ["pack-io"] }
use wal_db::{MemStore, Wal};
use wal_db::pack_io::{Serialize, Deserialize};

#[derive(Serialize, Deserialize, PartialEq, Debug)]
struct Event { id: u64, name: String }

# fn main() -> Result<(), wal_db::WalError> {
let wal = Wal::with_store(MemStore::new())?;
wal.append_typed(&Event { id: 1, name: "start".into() })?;

let event: Event = wal.iter()?.next().unwrap()?.decode()?;
assert_eq!(event, Event { id: 1, name: "start".into() });
# Ok(())
# }

Recovery policies

Wal::open always truncates a torn tail so the append boundary is clean. For corruption inside an already-recovered log — bit rot, say — a WalConfig recovery policy controls how iteration reacts:

use wal_db::{RecoveryPolicy, Wal, WalConfig};

# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
# let path = dir.path().join("app.wal");
// Default: stop at the first damaged record. Or skip past it for partial recovery:
let config = WalConfig::new().with_recovery_policy(RecoveryPolicy::SkipBadRecords);
let wal = Wal::open_with(&path, config)?;

for entry in wal.iter()? {
    match entry {
        Ok(record) => { /* use it */ }
        Err(e) => eprintln!("skipped a damaged record: {e}"), // iteration continues
    }
}
# Ok(())
# }

Seeking and compaction

An LSN is a byte offset, so replaying from a checkpoint is O(1) — iter_from starts at the LSN instead of scanning from the beginning. When a consumer has durably applied the log up to some point, truncate_after drops everything after that record, the durable building block of compaction:

use wal_db::Wal;

# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
# let path = dir.path().join("app.wal");
let wal = Wal::open(&path)?;
let _ = wal.append(b"applied")?;
let checkpoint = wal.append(b"also applied")?;
let _ = wal.append(b"not yet applied")?;

// Replay only what came at or after the checkpoint.
for entry in wal.iter_from(checkpoint)? { let _ = entry?; }

// Or compact: keep up to the checkpoint, drop the rest (made durable).
wal.truncate_after(checkpoint)?;
# Ok(())
# }

Async

The core is synchronous on purpose — a WAL's calls map to blocking syscalls (write, fsync), and a runtime is the consumer's choice, not the library's. From an async context, offload to a blocking pool:

let wal = wal.clone(); // Arc<Wal>
let lsn = tokio::task::spawn_blocking(move || wal.append_and_sync(b"record")).await??;

Skipping is never silent — each damaged record is still surfaced as an error — and it only works while a record's length prefix is intact enough to locate the next one.

Performance

Numbers from the criterion suite (cargo bench) on the development machine, with 256-byte records. They are honest measurements, not marketing — the group-commit figure in particular is bounded by this machine's fsync latency and scales with faster storage and more concurrent writers.

Benchmark Result What it measures
append/single ~107 ns the lock-free hot path: framing one record into memory, no I/O
append/multi (8 writers) ~3.6 M appends/s many writers appending at once
commit/single ~0.9 ms one writer, append + fsync each time (unbatched durability)
commit/group (8 writers) ~4× the single rate concurrent append-and-sync, fsyncs coalesced by group commit

Run them yourself:

cargo bench --bench wal_bench

Examples

Example Run Shows
basic cargo run --example basic the four-call API: open, append, sync, replay
recovery cargo run --example recovery a simulated torn write and self-healing recovery
concurrent cargo run --example concurrent many writers, one log, group commit
typed cargo run --example typed --features pack-io typed records via pack-io

Testing

cargo test --all-features                       # unit, integration, doc tests
cargo test --test torn_write                    # torn-write recovery property test
cargo test --test durability                    # durability across a real process restart
cargo test --test segmented                     # segment rotation and spanning records
RUSTFLAGS="--cfg loom" cargo test --test loom_wal  # model-checked concurrency
cargo +nightly fuzz run recover                 # fuzz the recovery path
cargo bench --bench wal_bench                    # append and commit throughput

The loom run model-checks the lock-free append and the group-commit handshake: it explores every meaningful thread interleaving and asserts no overlapping records, no reorder, and at most one fsync per syncer. The fuzz run feeds arbitrary bytes to the recovery path and proves it never panics or over-allocates.

Where It Fits

wal-db is the durability substrate. It is consumed by:

  • lsm-db — memtable durability
  • txn-db — transaction log
  • raft-io — Raft log persistence
  • Hive DB — primary write-ahead log

It stays foreign-compatible: usable standalone in any project that needs a durable append-only log.

Cross-Platform Support

Tier 1 Support:

  • Linux (x86_64, aarch64) — fdatasync
  • macOS (x86_64, Apple Silicon) — fcntl(F_FULLFSYNC) for true durability
  • Windows (x86_64) — FlushFileBuffers

Durability semantics are equivalent across platforms; the CI matrix runs the full suite — including the cross-process durability test — on each.

Contributing

Before opening a PR, cargo fmt --all, cargo clippy --all-targets --all-features -- -D warnings, and cargo test --all-features must be clean. Any change touching the durability path requires a torn-write recovery test and a benchmark.