# raft-wal
A minimal append-only WAL (Write-Ahead Log) optimized for Raft consensus.
General-purpose KV stores like sled or RocksDB carry unnecessary overhead for Raft log storage. raft-wal focuses on four operations: **append**, **range read**, **truncate**, and **metadata** — nothing else.
## Features
- **Fast** — ~210ns append (with HW-accelerated CRC32C), ~1ns get (O(1) via `VecDeque`)
- **Minimal dependencies** — only `crc32c` required; `tokio` and `openraft` are optional
- **Sync & async** — `RaftWal` and `AsyncRaftWal` share the same optimized core
- **Raft-correct durability** — metadata (term/vote) is always fsynced; log entries are buffered with opt-in `sync()`
- **Integrity** — every entry is protected by a CRC32C checksum; corrupted or partial writes are detected on recovery
- **Segment-based storage** — log is split into segment files (default 64 MB); `compact()` deletes old segments without rewriting
- **Parallel recovery** — segment files are read and CRC-verified in parallel across CPU cores
- **openraft integration** — optional `RaftLogStorage` trait implementation
- **Cross-platform** — Linux, macOS, Windows
## Usage
```toml
[dependencies]
raft-wal = "0.2"
# For async support:
# raft-wal = { version = "0.2", features = ["tokio"] }
# For openraft integration:
# raft-wal = { version = "0.2", features = ["openraft-storage"] }
```
```rust
use raft_wal::RaftWal;
let mut wal = RaftWal::open("./my-raft-data").unwrap();
// Append log entries
wal.append(1, b"entry-1").unwrap();
wal.append(2, b"entry-2").unwrap();
// Read entries
assert_eq!(wal.get(1), Some(b"entry-1".as_slice()));
let entries: Vec<_> = wal.iter_range(1..=2).collect();
assert_eq!(entries.len(), 2);
// Store Raft metadata (always fsynced)
wal.set_meta("term", b"5").unwrap();
wal.set_meta("vote", b"node-2").unwrap();
// Snapshot compaction — deletes old segment files
wal.compact(1).unwrap(); // discard index <= 1
// Conflict resolution
wal.truncate(2).unwrap(); // discard index >= 2
// Opt-in durable write
wal.append(3, b"entry-3").unwrap();
wal.sync().unwrap(); // fsync to disk
```
### Async
```rust
use raft_wal::AsyncRaftWal;
let mut wal = AsyncRaftWal::open("./my-raft-data").await.unwrap();
wal.append(1, b"entry-1").await.unwrap();
wal.set_meta("term", b"1").await.unwrap();
// Must call close() — tokio can't flush in Drop
wal.close().await.unwrap();
```
### openraft Integration
Enable `openraft-storage` to get `RaftLogStorage` + `RaftLogReader` implementations:
```toml
raft-wal = { version = "0.2", features = ["openraft-storage"] }
```
```rust,ignore
use raft_wal::OpenRaftLogStorage;
let storage = OpenRaftLogStorage::<MyTypeConfig>::open("./raft-data").await?;
```
`C::Entry`, `VoteOf<C>`, and `LogIdOf<C>` must implement `serde::Serialize + serde::Deserialize`.
## Durability
| `set_meta` / `remove_meta` | Always fsynced (Raft election safety) |
| `append` / `append_batch` | Buffered, no fsync |
| `sync()` | Flushes + fsyncs log entries |
| `flush()` | Flushes to OS without fsync |
**Metadata** (term, votedFor) must survive crashes per the Raft paper. `set_meta` writes to a temp file, fsyncs, then atomically renames.
**Log entries** are buffered for performance. Call `sync()` after append if your Raft implementation requires durable entries before acknowledging `AppendEntries`.
## Integrity
Each entry on disk is prefixed with a CRC32C checksum covering the index, payload length, and payload bytes. On recovery, entries with invalid checksums or incomplete writes are silently discarded from the tail — the WAL recovers up to the last good entry.
## Benchmarks
Measured on Linux with 128-byte entries:
| `append` | ~210 ns |
| `append_batch` (10 entries) | ~2.9 µs |
| `get` | ~1 ns |
| `read_range` (100 entries) | ~3.2 µs |
| `recovery` (10k entries, 1 segment) | ~1.2 ms |
| `recovery` (10k entries, multi-segment) | ~2.0 ms |
```sh
cargo bench
cargo bench --bench wal_async --features tokio
```
## Design
- **In-memory index**: `VecDeque<Vec<u8>>` with a base offset — O(1) append and lookup. All entries are held in memory; call `compact()` periodically after snapshots to free memory. Use `estimated_memory()` to monitor usage.
- **Segment files**: the log is split into segment files (`{index}.seg`). When the active segment exceeds `max_segment_size` (default 64 MB), it is sealed and a new segment begins. `compact()` deletes old segments with a file remove — no rewrite needed.
- **Entry format**: `[u32 crc32c LE][u64 index LE][u32 payload_len LE][payload]` — 16-byte header per entry
- **Buffered writes**: 64 KB `BufWriter` (sync) or userspace buffer (async) — syscalls only when the buffer fills
- **Parallel recovery**: segment files are read and CRC-verified concurrently using one thread per CPU core (`std::thread::scope`) or `tokio::spawn` (async)
- **Atomic metadata**: `set_meta` writes to a temp file, fsyncs, then renames — crash-safe
- **CRC32C**: hardware-accelerated via the `crc32c` crate (SSE4.2 on x86, ARM CRC on aarch64, software fallback elsewhere)
## Status
This crate is in early development and has not been battle-tested in production yet. It is planned for use in [Nyx Studio](https://github.com/NyxStudio-Labs) infrastructure. Use at your own risk.
## Contributing
Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for commit conventions and guidelines.
## License
Licensed under either of [Apache License, Version 2.0](LICENSE-APACHE) or [MIT License](LICENSE-MIT) at your option.