wal-db 1.0.0

Write-ahead log primitive for Rust storage engines. Durable, recoverable, lock-free append path. The WAL substrate under lsm-db, txn-db, raft-io, and Hive DB.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
<h1 align="center">
    <img width="99" alt="Rust logo" src="https://raw.githubusercontent.com/jamesgober/rust-collection/72baabd71f00e14aa9184efcb16fa3deddda3a0a/assets/rust-logo.svg">
    <br>
    <b>wal-db</b>
    <br>
    <sub><sup>WRITE-AHEAD LOG PRIMITIVE</sup></sub>
</h1>

<div align="center">
    <a href="https://crates.io/crates/wal-db"><img alt="Crates.io" src="https://img.shields.io/crates/v/wal-db"></a>
    <a href="https://crates.io/crates/wal-db" alt="Download wal-db"><img alt="Crates.io Downloads" src="https://img.shields.io/crates/d/wal-db?color=%230099ff"></a>
    <a href="https://docs.rs/wal-db" title="wal-db Documentation"><img alt="docs.rs" src="https://img.shields.io/docsrs/wal-db"></a>
    <a href="https://github.com/jamesgober/wal-db/actions"><img alt="GitHub CI" src="https://github.com/jamesgober/wal-db/actions/workflows/ci.yml/badge.svg"></a>
    <a href="https://github.com/rust-lang/rfcs/blob/master/text/2495-min-rust-version.md" title="MSRV"><img alt="MSRV" src="https://img.shields.io/badge/MSRV-1.85%2B-blue"></a>
</div>

<br>

<div align="left">
    <p>
        <strong>wal-db</strong> is a <b>write-ahead log primitive</b> for Rust storage engines. It is the durability substrate underneath every database, transaction system, and distributed log in the portfolio — <code>lsm-db</code>, <code>txn-db</code>, <code>raft-io</code>, and Hive DB all build on it. The append path is <b>lock-free</b>, durability is <b>explicit</b> and <b>platform-correct</b> on Linux, macOS, and Windows, recovery is <b>provable</b> from a torn write, and concurrent commits <b>coalesce into a single fsync</b>.
    </p>
    <p>
        A WAL is the workhorse no database can avoid: every state change is appended to a durable log <em>before</em> it is acknowledged, and the log is the source of truth used to rebuild state after a crash. Most Rust databases ship their WAL privately inside the engine; <code>wal-db</code> publishes it as a clean, composable primitive so multiple storage engines (LSM, B-tree, document store) can share a single, well-tested implementation.
    </p>
    <p>
        The common case is four calls — <code>open</code>, <code>append</code>, <code>sync</code>, <code>iter</code>. The core is synchronous; async is left to the consumer, where it belongs.
    </p>
    <br>
    <hr>
    <p>
        <strong>MSRV is 1.85+</strong> (Rust 2024 edition). Lock-free append. Group commit. Explicit fsync. Crash-safe recovery.
    </p>
    <blockquote>
        <strong>Status: <code>1.0</code> — stable.</strong> The public API is frozen until <code>2.0</code> and the <a href="./docs/ON_DISK_FORMAT.md">on-disk format</a> is frozen for the 1.x line. Full feature set — lock-free append, group commit, segment rotation, suffix and prefix compaction — hardened with a fuzz harness, loom model checks, adversarial recovery tests, injected I/O-failure tests, and property tests, and measured against a hand-rolled WAL (<a href="./docs/BENCHMARKS.md">benchmarks</a>). See <a href="./CHANGELOG.md"><code>CHANGELOG.md</code></a> for detail.
    </blockquote>
</div>

<hr>
<br>

<h2>What it does</h2>

- **Append-only durable log** of arbitrary byte records
- **Lock-free multi-writer append** — many threads append at once with no global lock
- **Group commit** — concurrent `sync` calls coalesce into one fsync, amortising the durability cost
- **Segment rotation** — optionally stripe the log across bounded segment files for bounded recovery and archival
- **Explicit durability barriers** — `append` is in-memory-fast; `sync` is the durability point
- **Platform-correct flush** — `fdatasync` on Linux, `FlushFileBuffers` on Windows, `fcntl(F_FULLFSYNC)` on macOS
- **Torn-write detection** — a CRC32C checksum per record; recovery stops at the first damaged record
- **Self-healing recovery** — a torn tail from a crash mid-append is truncated on open, leaving a clean boundary
- **Fuzz-hardened recovery** — arbitrary bytes never panic or over-allocate; a continuous `cargo-fuzz` harness proves it
- **Recovery policies** — stop at the first damaged record, or skip past it for forensic partial recovery
- **LSN seeking & truncation** — replay from any LSN (`iter_from`); drop everything after one (`truncate_after`) or, on a segmented log, reclaim everything before one (`truncate_before`) for compaction
- **Iterator-based replay** — walk the log forward to rebuild state
- **Typed records (optional)** — serialise any value via `pack-io` behind a feature; the byte-record API is unchanged when off
- **Pluggable storage backend** — file-backed by default; injectable for in-memory testing and custom stores

<br>

## The durability contract

Two operations, two distinct guarantees. Confusing them is the single most common way to lose data with a WAL, so `wal-db` keeps them explicit:

- **`append`** returns when the record is in the operating system's page cache. A crash after `append` but before `sync` may lose that record.
- **`sync`** returns only when every record appended before it is on stable storage and will survive a power loss.

That flush is not the same call on every platform, and getting it wrong is silent:

| Platform | Durability call |
|----------|-----------------|
| Linux    | `fdatasync` |
| Windows  | `FlushFileBuffers` |
| macOS    | `fcntl(F_FULLFSYNC)` — **not** plain `fsync`, which leaves data in the drive's write cache |

<br>

## Installation

```toml
[dependencies]
wal-db = "1.0"
```

<br>

## Quick Start

```rust
use wal_db::Wal;

# fn apply(_lsn: wal_db::Lsn, _bytes: &[u8]) -> Result<(), wal_db::WalError> { Ok(()) }
// Open (or create) the log.
let wal = Wal::open("/var/lib/myapp/app.wal")?;

// Append returns once the record is in the OS page cache. It does not flush.
let lsn = wal.append(b"a state change")?;

// Sync is the durability barrier: it returns once the record is on stable storage.
wal.sync()?;

// On restart, replay the log from the start to rebuild state.
for entry in wal.iter()? {
    let entry = entry?;
    apply(entry.lsn(), entry.data())?;
}
```

<br>

## Recovery

Every record carries a CRC32C checksum over its own bytes. On `open`, the log scans forward and stops at the first record that is incomplete or fails its checksum — a torn write left by a crash mid-append — and truncates that tail. The records before it are kept; the next append continues from a clean boundary with no gap in the sequence numbers. A corrupt length prefix can never trigger a wild allocation: lengths are validated against the configured maximum before a single payload byte is read.

```rust
use wal_db::Wal;

# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
# let path = dir.path().join("app.wal");
// After a crash, reopening the log truncates any torn tail automatically.
let wal = Wal::open(&path)?;

// Iteration yields a Result per record; a damaged record surfaces once, then ends.
for entry in wal.iter()? {
    match entry {
        Ok(record) => { /* apply record.data() at record.lsn() */ }
        Err(e) => eprintln!("recovery stopped: {e}"),
    }
}
# Ok(())
# }
```

<br>

## Configuration

Tunables live on `WalConfig`, a builder passed to `Wal::open_with`:

```rust
use wal_db::{Wal, WalConfig};

# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
# let path = dir.path().join("app.wal");
let config = WalConfig::new().with_max_record_size(1024 * 1024); // cap records at 1 MiB
let wal = Wal::open_with(&path, config)?;
# let _ = wal;
# Ok(())
# }
```

<br>

## Concurrency and group commit

`Wal` is built for many writers. `append` is lock-free: each call reserves its byte range with a single atomic step — that range's start offset *is* the record's LSN — then writes its record without blocking the others. Share one `Wal` behind an `Arc` and append from every thread.

Durability is where threads cooperate. When several call `sync` at once they coalesce into a single fsync — **group commit** — so the cost of making data durable is amortised across everyone committing together rather than paid N times. `append_and_sync` does an append and a group-commit-aware sync in one call:

```rust
use std::sync::Arc;
use std::thread;
use wal_db::{MemStore, Wal};

# fn main() -> Result<(), wal_db::WalError> {
let wal = Arc::new(Wal::with_store(MemStore::new())?);

let workers: Vec<_> = (0..4)
    .map(|t| {
        let wal = Arc::clone(&wal);
        thread::spawn(move || {
            for i in 0..100 {
                // Each thread appends and commits; the fsyncs coalesce.
                wal.append_and_sync(format!("worker {t} record {i}").as_bytes()).unwrap();
            }
        })
    })
    .collect();
for w in workers {
    w.join().unwrap();
}

assert_eq!(wal.iter()?.count(), 400);
# Ok(())
# }
```

> **LSNs are byte offsets.** The LSN returned by `append` is the record's position in the log — monotonic and unique, but not consecutive. The first record is `0`; the next sits at its end. This is what lets the append path reserve with a single atomic and never reorder. See [`docs/ON_DISK_FORMAT.md`](./docs/ON_DISK_FORMAT.md).

<br>

## Custom backends

`Wal::open` uses the file-backed `FileStore`. Any type implementing the `WalStore` trait can stand in — an in-memory store for tests, or an alternative storage layer. The crate ships `MemStore` for the in-memory case:

```rust
use wal_db::{MemStore, Wal};

# fn main() -> Result<(), wal_db::WalError> {
let wal = Wal::with_store(MemStore::new())?;
let lsn = wal.append(b"no filesystem involved")?;
assert_eq!(lsn.get(), 0);
# Ok(())
# }
```

<br>

## Segments

By default a log is a single file. For bounded recovery time and archival, stripe it across fixed-size segment files in a directory instead — `Wal::open_segmented`. The log stays one continuous byte stream; records span segment boundaries freely (the same scheme PostgreSQL uses), so nothing about the API or the records changes:

```rust
use wal_db::Wal;

# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
// 16 MiB segments. Old, superseded segment files can be archived or pruned.
let wal = Wal::open_segmented(dir.path(), 16 * 1024 * 1024)?;
wal.append(b"striped across files")?;
wal.sync()?;
# Ok(())
# }
```

<br>

## Typed records

By default a record is bytes. With the `pack-io` feature, a record can be any type that derives `Serialize`/`Deserialize` — `append_typed` writes it, `Record::decode` reads it back. The derives come from the re-exported `wal_db::pack_io`, so no extra dependency is needed.

```toml
[dependencies]
wal-db = { version = "1.0", features = ["pack-io"] }
```

```rust
use wal_db::{MemStore, Wal};
use wal_db::pack_io::{Serialize, Deserialize};

#[derive(Serialize, Deserialize, PartialEq, Debug)]
struct Event { id: u64, name: String }

# fn main() -> Result<(), wal_db::WalError> {
let wal = Wal::with_store(MemStore::new())?;
wal.append_typed(&Event { id: 1, name: "start".into() })?;

let event: Event = wal.iter()?.next().unwrap()?.decode()?;
assert_eq!(event, Event { id: 1, name: "start".into() });
# Ok(())
# }
```

<br>

## Recovery policies

`Wal::open` always truncates a torn tail so the append boundary is clean. For corruption *inside* an already-recovered log — bit rot, say — a `WalConfig` recovery policy controls how iteration reacts:

```rust
use wal_db::{RecoveryPolicy, Wal, WalConfig};

# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
# let path = dir.path().join("app.wal");
// Default: stop at the first damaged record. Or skip past it for partial recovery:
let config = WalConfig::new().with_recovery_policy(RecoveryPolicy::SkipBadRecords);
let wal = Wal::open_with(&path, config)?;

for entry in wal.iter()? {
    match entry {
        Ok(record) => { /* use it */ }
        Err(e) => eprintln!("skipped a damaged record: {e}"), // iteration continues
    }
}
# Ok(())
# }
```

<br>

## Seeking and compaction

An LSN is a byte offset, so replaying from a checkpoint is O(1) — `iter_from` starts at the LSN instead of scanning from the beginning. `truncate_after` drops everything *after* a record (rolling back a tail, the way a Raft log does on a conflict), and on a segmented log `truncate_before` reclaims everything *before* a record (prefix compaction once it has been applied and flushed elsewhere). Both preserve the LSNs of surviving records:

```rust
use wal_db::Wal;

# fn main() -> Result<(), wal_db::WalError> {
# let dir = tempfile::tempdir().map_err(wal_db::WalError::from)?;
# let path = dir.path().join("app.wal");
let wal = Wal::open(&path)?;
let _ = wal.append(b"applied")?;
let checkpoint = wal.append(b"also applied")?;
let _ = wal.append(b"not yet applied")?;

// Replay only what came at or after the checkpoint.
for entry in wal.iter_from(checkpoint)? { let _ = entry?; }

// Or compact: keep up to the checkpoint, drop the rest (made durable).
wal.truncate_after(checkpoint)?;
# Ok(())
# }
```

<br>

## Async

The core is synchronous on purpose — a WAL's calls map to blocking syscalls (`write`, `fsync`), and a runtime is the consumer's choice, not the library's. From an async context, offload to a blocking pool:

```rust,ignore
let wal = wal.clone(); // Arc<Wal>
let lsn = tokio::task::spawn_blocking(move || wal.append_and_sync(b"record")).await??;
```

<br>

## Performance

Numbers from the criterion suite on the development machine, 256-byte records. They are honest measurements, not marketing — the commit figures are bounded by this machine's fsync latency and scale with faster storage and more writers. Full detail and method in [`docs/BENCHMARKS.md`](./docs/BENCHMARKS.md).

| Benchmark | Result | What it measures |
|-----------|--------|------------------|
| LSN reservation | ~4 ns | the single atomic that allocates an LSN and reserves a byte range |
| `append/single` | ~105 ns | the lock-free hot path: reserve, frame, write one record to memory, no syscall |
| `append/multi` (8, file) | ~160 K/s | file-backed multi-writer append — syscall-bound (one `pwrite` each) |
| `commit/group` (8 writers) | **~1.9× a hand-rolled inline WAL** | concurrent append-and-sync; group commit coalesces the fsyncs |
| `recovery/replay` (10k) | ~215 K records/s | reopen and replay a file-backed log |

A file-backed append is syscall-bound, not lock-bound — the `pwrite` the durability contract requires dominates the negligible commit-watermark lock — so the throughput lever is **group commit**, which beats the inline WAL an engine hand-rolls before it has batching. Run them yourself:

```bash
cargo bench --bench wal_bench   # append, commit, recovery, reservation
cargo bench --bench compare     # wal-db vs a hand-rolled inline WAL
```

<br>

## Examples

| Example | Run | Shows |
|---------|-----|-------|
| [`basic`](./examples/basic.rs) | `cargo run --example basic` | the four-call API: open, append, sync, replay |
| [`recovery`](./examples/recovery.rs) | `cargo run --example recovery` | a simulated torn write and self-healing recovery |
| [`concurrent`](./examples/concurrent.rs) | `cargo run --example concurrent` | many writers, one log, group commit |
| [`checkpoint`](./examples/checkpoint.rs) | `cargo run --example checkpoint` | replay from a checkpoint (`iter_from`) and truncate back to one (`truncate_after`) |
| [`typed`](./examples/typed.rs) | `cargo run --example typed --features pack-io` | typed records via `pack-io` |

<br>

## Testing

```bash
cargo test --all-features                       # unit, integration, doc tests
cargo test --test torn_write                    # torn-write recovery property test
cargo test --test durability                    # durability across a real process restart
cargo test --test segmented                     # segment rotation and spanning records
RUSTFLAGS="--cfg loom" cargo test --test loom_wal  # model-checked concurrency
cargo +nightly fuzz run recover                 # fuzz the recovery path
cargo bench --bench wal_bench                    # append and commit throughput
```

The `loom` run model-checks the lock-free append and the group-commit handshake: it explores every meaningful thread interleaving and asserts no overlapping records, no reorder, and at most one fsync per syncer. The `fuzz` run feeds arbitrary bytes to the recovery path and proves it never panics or over-allocates.

<hr>
<br>

## Where It Fits

`wal-db` is the durability substrate. It is consumed by:
- [`lsm-db`](https://github.com/jamesgober/lsm-db) — memtable durability
- [`txn-db`](https://github.com/jamesgober/txn-db) — transaction log
- [`raft-io`](https://github.com/jamesgober/raft-io) — Raft log persistence
- Hive DB — primary write-ahead log

It stays foreign-compatible: usable standalone in any project that needs a durable append-only log.

<br>

## Cross-Platform Support

**Tier 1 Support:**
- Linux (x86_64, aarch64) — `fdatasync`
- macOS (x86_64, Apple Silicon) — `fcntl(F_FULLFSYNC)` for true durability
- Windows (x86_64) — `FlushFileBuffers`

Durability semantics are equivalent across platforms; the CI matrix runs the full suite — including the cross-process durability test — on each.

<br>

## Contributing

Before opening a PR, `cargo fmt --all`, `cargo clippy --all-targets --all-features -- -D warnings`, and `cargo test --all-features` must be clean. Any change touching the durability path requires a torn-write recovery test and a benchmark.

<br>

<div id="license">
    <h2>License</h2>
    <p>Licensed under either of</p>
    <ul>
        <li><b>Apache License, Version 2.0</b> — see <a href="./LICENSE-APACHE">LICENSE-APACHE</a></li>
        <li><b>MIT License</b> — see <a href="./LICENSE-MIT">LICENSE-MIT</a></li>
    </ul>
    <p>at your option.</p>
</div>

<div align="center">
  <h2></h2>
  <sup>COPYRIGHT <small>&copy;</small> 2026 <strong>JAMES GOBER.</strong></sup>
</div>