# Write-Ahead Log (WAL) Format
This document describes the WAL format and architecture for AletheiaDB.
## Architecture Overview
AletheiaDB uses a **Concurrent WAL with Striped Lock-Free Ring Buffers** for high-throughput write operations while maintaining ACID compliance.
### Concurrent WAL Architecture
```
┌─────────────────────┐
│ LSN Allocator │
│ AtomicU64::fetch_add
└──────────┬──────────┘
│
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Stripe 0 │ │ Stripe 1 │ │ Stripe N │
│ Ring Buffer │ │ Ring Buffer │ │ Ring Buffer │
│ (Lock-free) │ │ (Lock-free) │ │ (Lock-free) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
└───────────────────────┼───────────────────────┘
▼
┌─────────────────────┐
│ Flush Coordinator │
│ - Drains all stripes│
│ - Sorts by LSN │
│ - Writes to segment │
│ - fsync per mode │
└─────────────────────┘
```
**Key Design Principles:**
1. **Lock-free append path**: Multiple threads can append concurrently without mutex contention
2. **Global LSN ordering**: Single atomic counter ensures total ordering of all operations
3. **Sorted flush**: Entries are sorted by LSN before writing to disk
4. **Same segment format**: On-disk format is identical to sequential WAL
### Performance Characteristics
| Append latency (async) | <100ns | Lock-free path |
| Throughput (GroupCommit) | 100K+/sec | ACID-compliant |
| Throughput (Async) | 500K+/sec | Eventual consistency |
| Concurrent writers | 64 | Linear scaling |
## ACID Compliance
The concurrent WAL maintains full ACID compliance for Synchronous and GroupCommit durability modes:
### Atomicity ✅
- All operations within a transaction are either fully persisted or not at all
- Flush coordinator writes entries atomically to segment files
- Recovery only replays complete transactions
### Consistency ✅
- LSN ordering ensures operations are applied in correct order
- Checksum verification detects any corruption
- Database invariants are preserved across crashes
### Isolation ✅
- Isolation handled by MVCC layer (not WAL)
- WAL only logs committed operations
- Snapshot isolation semantics unchanged
### Durability ✅ (Mode-Dependent)
| **Synchronous** | Immediate (fsync on every commit) | ✅ Yes |
| **GroupCommit** | Epoch-based (transactions wait for flush) | ✅ Yes |
| **Async** | Eventual (background flush) | ❌ No |
**Why GroupCommit is ACID-Compliant:**
```
Transaction Flow (GroupCommit):
1. Append operation to stripe buffer (fast, lock-free)
2. Register with epoch N
3. WAIT for epoch N to be flushed ← Blocks here
4. Background thread: drain stripes → sort by LSN → write → fsync
5. Background thread: mark_flushed(epoch N) → wake all waiters
6. Return to caller (data is now durable)
```
The transaction does not return success until the fsync completes, guaranteeing durability.
## WAL Versioning
The WAL uses a versioned binary format to enable future evolution.
### Binary Format
**Segment Header (5 bytes):**
```
[magic: 4 bytes "GWAL"][version: 1 byte]
```
**Entry Format:**
```
[LSN: 8 bytes][timestamp: 8 bytes][checksum: 4 bytes][op_type: 1 byte][operation data...]
```
### Format (Version 1)
- Full serialization of properties (PropertyMap)
- Full serialization of bi-temporal intervals (32 bytes each)
- Labels serialized for all operation types
- CRC32 checksum verification
## WAL Recovery
This section describes the comprehensive recovery process for AletheiaDB, including checkpoint-based recovery, WAL replay, crash scenarios, and performance characteristics.
### Current Recovery Path (Important)
- Recovery is implemented in `src/storage/checkpoint.rs` via `CheckpointManager`.
- `StringInterner` is loaded from persisted indexes (`strings/interner.idx`) before WAL replay.
- WAL replay begins at `manifest.lsn + 1` when persisted state exists.
- Legacy mentions of `storage::persistence` / `PersistenceManager` are obsolete.
### Recovery Algorithm
The recovery process follows a **checkpoint-then-replay** strategy to minimize recovery time:
```
┌─────────────────────────────────────────────────────────────────┐
│ Database Startup Flow │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────┐
│ Find Latest │
│ Checkpoint │
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ Checkpoint Exists? │
└──────────┬───────────┘
│
┌─────────────┼─────────────┐
│ Yes No │
▼ ▼
┌─────────────────────┐ ┌──────────────────────┐
│ Load Checkpoint │ │ Start with │
│ Metadata │ │ Empty Storage │
│ - LSN │ │ - LSN = 1 │
│ - Node/Edge counts │ └──────────┬───────────┘
│ - Vector config │ │
└─────────┬───────────┘ │
│ │
└────────────┬───────────────┘
▼
┌────────────────────────┐
│ Restore Vector Index │
│ Configuration │
│ (if enabled) │
└────────────┬───────────┘
▼
┌────────────────────────┐
│ Read WAL Entries │
│ from (checkpoint LSN+1)│
└────────────┬───────────┘
▼
┌────────────────────────┐
│ For each WAL entry: │
│ ───────────────────── │
│ 1. Verify checksum │
│ 2. Replay operation │
│ 3. Track max IDs │
│ 4. Update storage │
└────────────┬───────────┘
▼
┌────────────────────────┐
│ Initialize ID │
│ Generators │
│ - NodeId: max_id + 1 │
│ - EdgeId: max_id + 1 │
│ - VersionId: max_id+1 │
└────────────┬───────────┘
▼
┌────────────────────────┐
│ Database Ready │
└────────────────────────┘
```
#### Phase 1: Checkpoint Loading
**File Location:** `src/storage/checkpoint.rs` (`recover`)
Recovery loads the persisted index manifest when available and uses its LSN
as the WAL replay starting point.
**Checkpoint Files (Current):**
```
indexes/
├── manifest.idx # LSN and index registry
├── strings/interner.idx # Interned string table
├── graph/adjacency.idx # Current graph state
├── temporal/versions.idx # Historical versions
└── vector/* # Vector index files
```
**Recovery Point:**
- `CheckpointManager::recover()` loads persisted indexes
- WAL replay starts at `manifest.lsn + 1`
#### Phase 2: Vector Index Restoration
**File Location:** `src/storage/checkpoint.rs`
**CRITICAL:** Vector index configuration MUST be restored BEFORE WAL replay. This ensures that vectors are automatically indexed during node creation, maintaining index consistency.
```rust
// Recovery loads indexes first, then replays WAL from manifest LSN + 1.
let (current, historical, lsn) = manager.recover(&wal)?;
```
**Why This Matters:**
- If vector index is enabled during recovery, all replayed `CreateNode` operations automatically add vectors to the HNSW index
- Without this, vectors would be stored but not indexed, breaking similarity queries
- Recovery fails if vector index restoration fails (maintains data integrity)
#### Phase 3: WAL Replay
**File Location:** `src/storage/checkpoint.rs` (`replay_wal`)
The recovery process replays all WAL entries since the checkpoint LSN:
```rust
let start_lsn = persisted_lsn.map_or(LSN::initial(), |lsn| lsn.next());
// Track max IDs seen in replayed WAL entries
let mut max_node_id: Option<u64> = None;
let mut max_edge_id: Option<u64> = None;
let mut max_version_id: Option<u64> = None;
let mut next_version_id: u64 = 1;
// Replay WAL entries
let wal_entries = wal.read_from(start_lsn)?;
for entry in wal_entries {
match entry.operation {
WalOperation::CreateNode { node_id, .. } => {
max_node_id = Some(max_node_id.map_or(node_id.as_u64(), |m| m.max(node_id.as_u64())));
// ... apply operation to current + historical ...
}
// Similar handling for update/delete and edge operations...
}
}
```
**Replay Semantics:**
1. **CreateNode/CreateEdge:**
- Generate sequential version IDs during replay
- Intern label strings into global interner
- Insert into current storage (triggers vector indexing if enabled)
- Add version to historical storage with bi-temporal interval
- All operations use `TxId(0)` (recovery transaction ID)
2. **UpdateNode/UpdateEdge:**
- Close previous version's transaction_time in historical storage
- Create new version with update's version_id
- Update current storage (triggers vector re-indexing if embedding changed)
- Track version_id to ensure next operations don't conflict
3. **DeleteNode/DeleteEdge:**
- Close current version's transaction_time
- Create tombstone version with closed valid_time
- Remove from current storage (triggers vector index removal)
- Preserve label and properties in tombstone for historical queries
**Important:** Entries are already in LSN order on disk (sorted during flush), so replay happens in correct temporal order.
#### Phase 4: ID Generator Initialization
**File Location:** `src/storage/checkpoint.rs` (`replay_wal`)
After WAL replay completes, ID generators are initialized to prevent ID conflicts:
```rust
// Initialize ID generators with max_id + 1 (Issue #291)
// If no IDs were replayed for a type, keep existing generator state.
if let Some(max_node_id) = max_node_id {
current.init_node_id_generator(max_node_id + 1);
}
if let Some(max_edge_id) = max_edge_id {
current.init_edge_id_generator(max_edge_id + 1);
}
if let Some(max_version_id) = max_version_id {
current.init_version_id_generator(max_version_id + 1);
}
```
**Why This Matters:**
- Prevents new operations from generating IDs that conflict with recovered entities
- Example: If recovery replays NodeId(100), next new node will be NodeId(101)
- Without this, new nodes could overwrite recovered data
### Recovery Correctness with Concurrent WAL
The concurrent WAL writes entries to disk **sorted by LSN**, which is identical to the sequential WAL behavior:
```
Concurrent writes: On-disk (after flush):
Thread 1: LSN 3 Entry 1: LSN 1
Thread 2: LSN 1 Entry 2: LSN 2 ← Sorted!
Thread 3: LSN 2 Entry 3: LSN 3
```
**Key Invariant:** Entries are always written to disk in LSN order, regardless of which stripe they originated from. This guarantees deterministic replay.
### Crash Scenarios and Recovery Behavior
The following table describes recovery behavior for different crash scenarios:
| **Crash during write** | Synchronous | Recover to last committed transaction | None - fsync completed |
| | GroupCommit | Recover to last flushed epoch | None - epoch fsync completed |
| | Async | Recover to last background flush | **Possible** - unflushed commits lost |
| **Crash during checkpoint** | Any | Load previous checkpoint + replay WAL | None - checkpoints are atomic |
| **WAL segment corruption** | Any | Partial recovery to last good entry | **All entries after corruption** |
| **Missing checkpoint file** | Any | Replay entire WAL from beginning | None (slower recovery) |
| **Checkpoint + WAL corruption** | Any | Recovery fails, manual intervention needed | **Depends on extent of corruption** |
| **Power failure mid-fsync** | Synchronous/GroupCommit | OS guarantees atomicity, last entry may be incomplete | None or last transaction only |
| **Disk full during WAL write** | Any | Recovery stops at last complete entry | Recent uncommitted writes |
| **Vector index config invalid** | Any | **Recovery fails** to maintain integrity | None - prevents inconsistent state |
#### Durability Guarantees by Mode
**Synchronous Mode:**
- **Guarantee:** Every committed transaction is durable (survives any crash)
- **Mechanism:** `fsync()` completes before returning to caller
- **Trade-off:** ~1.5ms latency per commit
**GroupCommit Mode:**
- **Guarantee:** Committed transactions are durable (ACID-compliant)
- **Mechanism:** Caller blocks until epoch is flushed and fsynced
- **Trade-off:** ~10-50ms latency (batched), 100K+/sec throughput
**Async Mode:**
- **Guarantee:** **NO durability guarantee** until background flush completes
- **Mechanism:** Background thread flushes every `flush_interval_ms`
- **Risk:** Crash before flush = **data loss** of unflushed commits
- **Trade-off:** <100ns latency, 500K+/sec throughput
**AsyncBatched Mode:**
- **Guarantee:** Same as Async (no durability until flush)
- **Mechanism:** Background batching + epoch tracking for metrics
- **Use Case:** High-throughput scenarios where some data loss is acceptable
#### Example: GroupCommit Recovery Guarantee
```rust
// Transaction timeline with GroupCommit:
1. tx1.commit() - appends to stripe, registers with epoch 5
2. tx2.commit() - appends to stripe, registers with epoch 5
3. Background thread: drains stripes, sorts by LSN, writes, fsync()
4. Background thread: mark_flushed(epoch 5)
5. tx1.commit() returns success ← Data is durable NOW
6. tx2.commit() returns success ← Data is durable NOW
// If crash happens:
- Before step 3: tx1 and tx2 NOT recovered (never fsynced)
- After step 3 but before step 4: tx1 and tx2 RECOVERED (fsync completed)
- After step 5: tx1 and tx2 RECOVERED (committed)
```
**Key Point:** GroupCommit blocks the caller until fsync completes, guaranteeing durability. This is why it's ACID-compliant despite batching.
### Handling Corrupted Segments
If a segment fails checksum verification during recovery:
```rust
match read_segment(path) {
Ok(entries) => replay_entries(entries),
Err(Error::ChecksumMismatch { lsn, expected, actual }) => {
// Log corruption details
eprintln!("Segment corrupted at LSN {}", lsn);
eprintln!("Expected checksum: {}, Actual: {}", expected, actual);
// Attempt partial recovery
let recovered = recover_until_corruption(path, lsn)?;
replay_entries(recovered);
// Mark segment as corrupted for manual inspection
mark_corrupted(path)?;
// Recovery succeeds with partial data
eprintln!("Recovered {} entries before corruption", recovered.len());
}
Err(e) => return Err(e),
}
```
**Partial Recovery Behavior:**
- Entries before corruption are replayed normally
- Corrupted entry and all following entries in segment are discarded
- Database state is consistent up to last good entry
- Application must re-apply lost transactions from other sources (if available)
- Subsequent segments (if any) are NOT processed to avoid inconsistent state
**Best Practice:** Enable GroupCommit or Synchronous mode for production to minimize corruption risk.
### API Reference
#### CheckpointManager::recover()
**File Location:** `src/storage/checkpoint.rs`
Recovers database state from persisted indexes plus WAL replay.
```rust
pub fn recover(
&mut self,
wal: &ConcurrentWalSystem,
) -> Result<(CurrentStorage, HistoricalStorage, LSN)>
```
**Example:**
```rust
use aletheiadb::storage::checkpoint::{CheckpointConfig, CheckpointManager};
use aletheiadb::storage::wal::concurrent_system::ConcurrentWalSystem;
let wal = ConcurrentWalSystem::new(wal_config)?;
let mut manager = CheckpointManager::new(CheckpointConfig::with_data_dir("data/mydb"))?;
let (current, historical, recovered_lsn) = manager.recover(&wal)?;
```
**Internal Behavior:**
1. Loads persisted string/graph/temporal/vector indexes (if present)
2. Reads checkpoint/manifest LSN
3. Replays WAL entries starting from `lsn + 1`
4. Returns recovered state and final LSN
### Performance Characteristics
Recovery performance depends on checkpoint frequency and WAL size:
```
Recovery Time = Checkpoint Load Time + WAL Replay Time
```
#### Recovery Time vs. Dataset Size
The following table shows typical recovery times for different dataset sizes with default checkpoint settings (checkpoint every 5 minutes or 1000 entries):
| 1K nodes | ~1ms | ~5ms | ~50ms | ~500ms | ~500ms |
| 10K nodes | ~10ms | ~5ms | ~50ms | ~500ms | ~510ms |
| 100K nodes | ~100ms | ~5ms | ~50ms | ~500ms | ~600ms |
| 1M nodes | ~1s | ~5ms | ~50ms | ~500ms | ~1.5s |
| 10M nodes | ~10s | ~5ms | ~50ms | ~500ms | ~10.5s |
**Key Insights:**
1. **Checkpoint/index load** is typically much smaller than replay cost
2. **WAL replay time** scales with number of entries since last checkpoint (not total DB size)
3. **Worst case:** Crash immediately before checkpoint creation (max WAL replay)
4. **Best case:** Crash immediately after checkpoint (minimal WAL replay)
**Modern Full-State Path:**
For full-state persistence, prefer index-persistence checkpointing (`storage::checkpoint` +
`storage::index_persistence`) which persists string/graph/temporal/vector state to disk and
replays WAL from manifest LSN + 1.
#### Recovery Time Estimation
```rust
// Estimate recovery time for your workload:
fn estimate_recovery_time(
node_count: usize,
edge_count: usize,
wal_entries_since_checkpoint: usize,
) -> Duration {
// Checkpoint/index load (small, near-constant)
let checkpoint_load_ms = (node_count + edge_count) as f64 * 0.00001;
// WAL replay (~5µs per entry average)
let wal_replay_ms = wal_entries_since_checkpoint as f64 * 0.005;
Duration::from_millis((checkpoint_load_ms + wal_replay_ms) as u64)
}
// Example: 1M nodes, 100K WAL entries
let recovery_time = estimate_recovery_time(1_000_000, 2_000_000, 100_000);
println!("Estimated recovery time: {:?}", recovery_time); // ~1.5s
```
#### Memory Usage During Recovery
Recovery memory usage is bounded by:
```
Memory = Current Storage + Historical Storage + WAL Replay Buffer
```
**Component Breakdown:**
| **Current Storage** | ~200 bytes/node, ~150 bytes/edge | Grows linearly during replay |
| **Historical Storage** | ~80 bytes/version | One version per create/update/delete |
| **WAL Replay Buffer** | ~1KB per entry | Temporary, released after replay |
| **Vector Index (if enabled)** | ~(4 * dimensions + 64) bytes/node | HNSW index overhead |
| **String Interner** | ~20 bytes/unique string | Shared across all entities |
**Example Memory Calculations:**
```rust
// Dataset: 1M nodes, 2M edges, 500K historical versions
// Vector index: 384 dimensions, enabled on 500K nodes
let node_memory = 1_000_000 * 200; // 200 MB
let edge_memory = 2_000_000 * 150; // 300 MB
let historical_memory = 500_000 * 80; // 40 MB
let vector_memory = 500_000 * (4 * 384 + 64); // 775 MB
let total_memory_mb = (node_memory + edge_memory + historical_memory + vector_memory) / 1_000_000;
println!("Total recovery memory: {} MB", total_memory_mb); // ~1315 MB
```
**Memory Optimization Tips:**
1. **Frequent checkpoints** reduce WAL replay memory (fewer entries buffered)
2. **Disable vector indexing** during recovery if not immediately needed (can rebuild later)
3. **Streaming replay** (future enhancement) will reduce replay buffer memory
4. **Tiered storage** (Issue #XYZ) moves cold historical data to disk, reducing RAM usage
#### Benchmark Results
The following benchmarks were measured on a standard development machine (16-core CPU, NVMe SSD):
**Recovery Throughput:**
| Checkpoint load (metadata) | ~100K nodes/sec | Sequential read |
| WAL replay (nodes) | ~200K nodes/sec | CreateNode operations |
| WAL replay (edges) | ~300K edges/sec | CreateEdge operations |
| WAL replay (updates) | ~150K updates/sec | UpdateNode operations |
| Vector indexing during replay | ~50K vectors/sec | 384-dim HNSW insertion |
**Real-World Recovery Examples:**
```
Scenario 1: Small database, frequent checkpoints
- 10K nodes, 20K edges
- Checkpoint: 5 minutes old (200 WAL entries)
- Recovery time: ~15ms
- Memory usage: ~3 MB
Scenario 2: Medium database, standard checkpoints
- 100K nodes, 300K edges, vector index enabled
- Checkpoint: 5 minutes old (5K WAL entries)
- Recovery time: ~150ms
- Memory usage: ~85 MB
Scenario 3: Large database, worst-case scenario
- 1M nodes, 3M edges, vector index enabled
- No checkpoint (replay entire WAL: 1M entries)
- Recovery time: ~8 seconds
- Memory usage: ~1.5 GB
Scenario 4: Production database with optimal settings
- 5M nodes, 15M edges, vector index on 2M nodes
- Checkpoint: 2 minutes old (50K WAL entries)
- Recovery time: ~2.5 seconds
- Memory usage: ~4 GB
```
### Performance Tuning Recommendations
To optimize recovery performance:
1. **Checkpoint Frequency:**
```rust
// More frequent checkpoints = faster recovery (but more I/O overhead)
CheckpointConfig {
checkpoint_interval: Duration::from_secs(120), // 2 minutes (vs default 5)
min_wal_entries: 500, // Lower threshold (vs default 1000)
..Default::default()
}
```
2. **Durability Mode:**
```rust
DurabilityMode::GroupCommit {
max_delay_ms: 10,
max_batch_size: 200,
}
```
3. **WAL Segment Size:**
```rust
ConcurrentWalSystemConfig {
segment_size: 32 * 1024 * 1024, ..Default::default()
}
```
4. **Vector Index Configuration:**
```rust
HnswConfig::new(384, DistanceMetric::Cosine)
.with_ef_construction(100) .with_capacity(expected_node_count) ```
### Known Limitations
1. **Transaction ID Loss:** Original transaction IDs are not preserved during recovery. All replayed operations use `TxId(0)`. This means temporal queries cannot distinguish which operations were part of the same transaction after a crash/restart. (Issue #86 tracks WAL format extension to preserve transaction IDs)
2. **No Incremental Checkpoints Yet:** The current checkpoint system persists full index snapshots rather than incremental diffs. For very large datasets, snapshot writes can take noticeable time.
3. **Single-Threaded Replay:** WAL replay is currently single-threaded. For databases with >100K WAL entries since checkpoint, this can be a bottleneck. Parallel replay (by partitioning entries) is planned for Phase 3.
4. **No Streaming Replay:** All WAL entries since checkpoint are loaded into memory before replay begins. For extremely large WAL files (>1M entries), this can consume significant memory. Streaming replay is planned for future enhancement.
See [docs/ARCHITECTURE.md](ARCHITECTURE.md) for long-term recovery optimization plans.
## Configuration
### ConcurrentWalSystem Configuration
```rust
use aletheiadb::storage::wal::concurrent_system::ConcurrentWalSystemConfig;
use aletheiadb::storage::wal::DurabilityMode;
let config = ConcurrentWalSystemConfig {
/// Directory for WAL segments
wal_dir: PathBuf::from("data/wal"),
/// Number of stripes (should be power of 2, default: 16)
num_stripes: 16,
/// Ring buffer capacity per stripe (default: 1024)
stripe_capacity: 1024,
/// Maximum segment size before rotation (default: 64MB)
segment_size: 64 * 1024 * 1024,
/// Number of segments to retain (default: 10)
segments_to_retain: 10,
/// Background flush interval in ms (default: 10)
flush_interval_ms: 10,
/// Durability mode
durability_mode: DurabilityMode::GroupCommit {
max_batch_size: 200,
max_delay_ms: 10,
},
/// Write buffer size for segment files (default: 64KB)
write_buffer_size: 64 * 1024,
};
let wal = ConcurrentWalSystem::new(config)?;
```
### Durability Modes
```rust
pub enum DurabilityMode {
/// fsync on every commit - maximum durability
/// Latency: ~1.5ms, Throughput: ~600/sec
Synchronous,
/// Batched fsync with epoch-based waiting - ACID compliant
/// Latency: ~10-50ms, Throughput: ~100K/sec
GroupCommit {
max_delay_ms: u64, // default: 10
max_batch_size: usize, // default: 200
},
/// Background fsync, commits return immediately - NOT ACID
/// Latency: <100ns, Throughput: ~500K/sec
Async {
flush_interval_ms: u64, // default: 10
},
/// Like Async but with epoch tracking (for metrics)
AsyncBatched {
max_delay_ms: u64,
max_batch_size: usize,
},
}
```
### Production Recommendations
| `num_stripes` | 16-32 | Match expected concurrent writers |
| `stripe_capacity` | 1024 | Balance memory vs backpressure |
| `segment_size` | 64-128 MB | Balance rotation overhead vs recovery time |
| `segments_to_retain` | 10-20 | Enough for recovery + debugging |
| `durability_mode` | `GroupCommit` | Best balance of ACID + performance |
## Component Details
### LSN Allocator
The LSN allocator provides globally unique, monotonically increasing sequence numbers:
```rust
pub struct LsnAllocator {
next_lsn: AtomicU64,
}
impl LsnAllocator {
/// Allocate a single LSN (atomic operation)
pub fn allocate(&self) -> LSN {
LSN(self.next_lsn.fetch_add(1, Ordering::SeqCst))
}
/// Set next LSN (for recovery)
pub fn set_next_lsn(&self, lsn: LSN) {
self.next_lsn.store(lsn.0, Ordering::SeqCst);
}
}
```
### Flush Coordinator
The flush coordinator manages segment files and coordinates flushing:
```rust
impl FlushCoordinator {
/// Flush entries to disk
pub fn flush(&self, mut entries: Vec<PendingEntry>, sync: bool) -> Result<FlushStats> {
// 1. Sort by LSN to restore global order
entries.sort_by_key(|e| e.lsn);
// 2. Write to segment file
for entry in &entries {
self.write_entry(entry)?;
}
// 3. fsync if required
if sync {
self.sync()?;
}
// 4. Notify completion handles
for entry in entries {
if let Some(notifier) = entry.completion {
notifier.complete(Ok(()));
}
}
Ok(stats)
}
}
```
## Debugging Tools
### Inspecting WAL Contents
```rust
use aletheiadb::storage::wal_reader::read_wal_entries;
// Print all entries
let entries = read_wal_entries(Path::new("data/wal"), LSN(1))?;
for entry in entries {
println!("LSN {}: {:?}", entry.lsn.0, entry.operation);
}
```
### Checking WAL Metrics
```rust
let wal = ConcurrentWalSystem::new(config)?;
// After some operations...
println!("Total appends: {}", wal.total_appends());
println!("Total flushed: {}", wal.total_flushed());
println!("Current LSN: {:?}", wal.current_lsn());
```
## Adding New WAL Versions
When adding new serialization features:
1. Increment `WAL_VERSION` constant
2. Add version-aware serialization logic
3. Add version-aware deserialization logic
4. Update tests for new version
```rust
// In src/storage/wal.rs
const WAL_VERSION: u8 = 2; // Increment version
fn serialize_entry(entry: &WalEntry, version: u8) -> Vec<u8> {
let mut buf = Vec::new();
buf.extend_from_slice(b"GWAL");
buf.push(version);
// ... serialize fields, with version-specific logic
buf
}
```
## References
- [ADR-0007: Write-Ahead Log for Durability](adr/0007-wal-durability.md)
- [ADR-0012: Configurable Durability Modes](adr/0012-configurable-durability-modes.md)
- [ADR-0020: Concurrent WAL Architecture](adr/0020-concurrent-wal-architecture.md)
- [Write-Ahead Logging on Wikipedia](https://en.wikipedia.org/wiki/Write-ahead_logging)
- [PostgreSQL WAL Documentation](https://www.postgresql.org/docs/current/wal-intro.html)
- [LMAX Disruptor](https://lmax-exchange.github.io/disruptor/) - Lock-free ring buffer design