aletheiadb 0.1.0

# Write-Ahead Log (WAL) Format


This document describes the WAL format and architecture for AletheiaDB.

## Architecture Overview


AletheiaDB uses a **Concurrent WAL with Striped Lock-Free Ring Buffers** for high-throughput write operations while maintaining ACID compliance.

### Concurrent WAL Architecture


```
                    ┌─────────────────────┐
                    │    LSN Allocator    │
                    │  AtomicU64::fetch_add
                    └──────────┬──────────┘
                               │
       ┌───────────────────────┼───────────────────────┐
       ▼                       ▼                       ▼
┌─────────────┐         ┌─────────────┐         ┌─────────────┐
│   Stripe 0  │         │   Stripe 1  │         │  Stripe N   │
│ Ring Buffer │         │ Ring Buffer │         │ Ring Buffer │
│ (Lock-free) │         │ (Lock-free) │         │ (Lock-free) │
└──────┬──────┘         └──────┬──────┘         └──────┬──────┘
       └───────────────────────┼───────────────────────┘
                               ▼
                    ┌─────────────────────┐
                    │  Flush Coordinator  │
                    │ - Drains all stripes│
                    │ - Sorts by LSN      │
                    │ - Writes to segment │
                    │ - fsync per mode    │
                    └─────────────────────┘
```

**Key Design Principles:**
1. **Lock-free append path**: Multiple threads can append concurrently without mutex contention
2. **Global LSN ordering**: Single atomic counter ensures total ordering of all operations
3. **Sorted flush**: Entries are sorted by LSN before writing to disk
4. **Same segment format**: On-disk format is identical to sequential WAL

### Performance Characteristics


| Metric | Value | Notes |
|--------|-------|-------|
| Append latency (async) | <100ns | Lock-free path |
| Throughput (GroupCommit) | 100K+/sec | ACID-compliant |
| Throughput (Async) | 500K+/sec | Eventual consistency |
| Concurrent writers | 64 | Linear scaling |

## ACID Compliance


The concurrent WAL maintains full ACID compliance for Synchronous and GroupCommit durability modes:

### Atomicity ✅

- All operations within a transaction are either fully persisted or not at all
- Flush coordinator writes entries atomically to segment files
- Recovery only replays complete transactions

### Consistency ✅

- LSN ordering ensures operations are applied in correct order
- Checksum verification detects any corruption
- Database invariants are preserved across crashes

### Isolation ✅

- Isolation handled by MVCC layer (not WAL)
- WAL only logs committed operations
- Snapshot isolation semantics unchanged

### Durability ✅ (Mode-Dependent)


| Mode | Durability | ACID Compliant |
|------|------------|----------------|
| **Synchronous** | Immediate (fsync on every commit) | ✅ Yes |
| **GroupCommit** | Epoch-based (transactions wait for flush) | ✅ Yes |
| **Async** | Eventual (background flush) | ❌ No |

**Why GroupCommit is ACID-Compliant:**
```
Transaction Flow (GroupCommit):
1. Append operation to stripe buffer (fast, lock-free)
2. Register with epoch N
3. WAIT for epoch N to be flushed ← Blocks here
4. Background thread: drain stripes → sort by LSN → write → fsync
5. Background thread: mark_flushed(epoch N) → wake all waiters
6. Return to caller (data is now durable)
```

The transaction does not return success until the fsync completes, guaranteeing durability.

## WAL Versioning

The WAL uses a versioned binary format to enable future evolution.

### Binary Format

**Segment Header (5 bytes):**
```
[magic: 4 bytes "GWAL"][version: 1 byte]
```

**Entry Format:**
```
[LSN: 8 bytes][timestamp: 8 bytes][checksum: 4 bytes][op_type: 1 byte][operation data...]
```

### Format (Version 1)

- Full serialization of properties (PropertyMap)
- Full serialization of bi-temporal intervals (32 bytes each)
- Labels serialized for all operation types
- CRC32 checksum verification

## WAL Recovery

This section describes the comprehensive recovery process for AletheiaDB, including checkpoint-based recovery, WAL replay, crash scenarios, and performance characteristics.

### Current Recovery Path (Important)

- Recovery is implemented in `src/storage/checkpoint.rs` via `CheckpointManager`.
- `StringInterner` is loaded from persisted indexes (`strings/interner.idx`) before WAL replay.
- WAL replay begins at `manifest.lsn + 1` when persisted state exists.
- Legacy mentions of `storage::persistence` / `PersistenceManager` are obsolete.

### Recovery Algorithm

The recovery process follows a **checkpoint-then-replay** strategy to minimize recovery time:

```
┌─────────────────────────────────────────────────────────────────┐
│                     Database Startup Flow                        │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                   ┌──────────────────────┐
                   │ Find Latest          │
                   │ Checkpoint           │
                   └──────────┬───────────┘
                              │
                   ┌──────────▼───────────┐
                   │ Checkpoint Exists?   │
                   └──────────┬───────────┘
                              │
                ┌─────────────┼─────────────┐
                │ Yes                   No  │
                ▼                           ▼
    ┌─────────────────────┐     ┌──────────────────────┐
    │ Load Checkpoint     │     │ Start with           │
    │ Metadata            │     │ Empty Storage        │
    │ - LSN               │     │ - LSN = 1            │
    │ - Node/Edge counts  │     └──────────┬───────────┘
    │ - Vector config     │                │
    └─────────┬───────────┘                │
              │                            │
              └────────────┬───────────────┘
                           ▼
              ┌────────────────────────┐
              │ Restore Vector Index   │
              │ Configuration          │
              │ (if enabled)           │
              └────────────┬───────────┘
                           ▼
              ┌────────────────────────┐
              │ Read WAL Entries       │
              │ from (checkpoint LSN+1)│
              └────────────┬───────────┘
                           ▼
              ┌────────────────────────┐
              │ For each WAL entry:    │
              │ ─────────────────────  │
              │ 1. Verify checksum     │
              │ 2. Replay operation    │
              │ 3. Track max IDs       │
              │ 4. Update storage      │
              └────────────┬───────────┘
                           ▼
              ┌────────────────────────┐
              │ Initialize ID          │
              │ Generators             │
              │ - NodeId: max_id + 1   │
              │ - EdgeId: max_id + 1   │
              │ - VersionId: max_id+1  │
              └────────────┬───────────┘
                           ▼
              ┌────────────────────────┐
              │ Database Ready         │
              └────────────────────────┘
```

#### Phase 1: Checkpoint Loading

**File Location:** `src/storage/checkpoint.rs` (`recover`)

Recovery loads the persisted index manifest when available and uses its LSN
as the WAL replay starting point.

**Checkpoint Files (Current):**
```
indexes/
├── manifest.idx          # LSN and index registry
├── strings/interner.idx  # Interned string table
├── graph/adjacency.idx   # Current graph state
├── temporal/versions.idx # Historical versions
└── vector/*              # Vector index files
```

**Recovery Point:**
- `CheckpointManager::recover()` loads persisted indexes
- WAL replay starts at `manifest.lsn + 1`

#### Phase 2: Vector Index Restoration

**File Location:** `src/storage/checkpoint.rs`

**CRITICAL:** Vector index configuration MUST be restored BEFORE WAL replay. This ensures that vectors are automatically indexed during node creation, maintaining index consistency.

```rust
// Recovery loads indexes first, then replays WAL from manifest LSN + 1.
let (current, historical, lsn) = manager.recover(&wal)?;
```

**Why This Matters:**
- If vector index is enabled during recovery, all replayed `CreateNode` operations automatically add vectors to the HNSW index
- Without this, vectors would be stored but not indexed, breaking similarity queries
- Recovery fails if vector index restoration fails (maintains data integrity)

#### Phase 3: WAL Replay


**File Location:** `src/storage/checkpoint.rs` (`replay_wal`)

The recovery process replays all WAL entries since the checkpoint LSN:

```rust
let start_lsn = persisted_lsn.map_or(LSN::initial(), |lsn| lsn.next());

// Track max IDs seen in replayed WAL entries
let mut max_node_id: Option<u64> = None;
let mut max_edge_id: Option<u64> = None;
let mut max_version_id: Option<u64> = None;
let mut next_version_id: u64 = 1;

// Replay WAL entries
let wal_entries = wal.read_from(start_lsn)?;
for entry in wal_entries {
    match entry.operation {
        WalOperation::CreateNode { node_id, .. } => {
            max_node_id = Some(max_node_id.map_or(node_id.as_u64(), |m| m.max(node_id.as_u64())));
            // ... apply operation to current + historical ...
        }
        // Similar handling for update/delete and edge operations...
    }
}
```

**Replay Semantics:**

1. **CreateNode/CreateEdge:**
   - Generate sequential version IDs during replay
   - Intern label strings into global interner
   - Insert into current storage (triggers vector indexing if enabled)
   - Add version to historical storage with bi-temporal interval
   - All operations use `TxId(0)` (recovery transaction ID)

2. **UpdateNode/UpdateEdge:**
   - Close previous version's transaction_time in historical storage
   - Create new version with update's version_id
   - Update current storage (triggers vector re-indexing if embedding changed)
   - Track version_id to ensure next operations don't conflict

3. **DeleteNode/DeleteEdge:**
   - Close current version's transaction_time
   - Create tombstone version with closed valid_time
   - Remove from current storage (triggers vector index removal)
   - Preserve label and properties in tombstone for historical queries

**Important:** Entries are already in LSN order on disk (sorted during flush), so replay happens in correct temporal order.

#### Phase 4: ID Generator Initialization


**File Location:** `src/storage/checkpoint.rs` (`replay_wal`)

After WAL replay completes, ID generators are initialized to prevent ID conflicts:

```rust
// Initialize ID generators with max_id + 1 (Issue #291)
// If no IDs were replayed for a type, keep existing generator state.
if let Some(max_node_id) = max_node_id {
    current.init_node_id_generator(max_node_id + 1);
}
if let Some(max_edge_id) = max_edge_id {
    current.init_edge_id_generator(max_edge_id + 1);
}
if let Some(max_version_id) = max_version_id {
    current.init_version_id_generator(max_version_id + 1);
}
```

**Why This Matters:**
- Prevents new operations from generating IDs that conflict with recovered entities
- Example: If recovery replays NodeId(100), next new node will be NodeId(101)
- Without this, new nodes could overwrite recovered data

### Recovery Correctness with Concurrent WAL


The concurrent WAL writes entries to disk **sorted by LSN**, which is identical to the sequential WAL behavior:

```
Concurrent writes:         On-disk (after flush):
Thread 1: LSN 3           Entry 1: LSN 1
Thread 2: LSN 1           Entry 2: LSN 2  ← Sorted!
Thread 3: LSN 2           Entry 3: LSN 3
```

**Key Invariant:** Entries are always written to disk in LSN order, regardless of which stripe they originated from. This guarantees deterministic replay.

### Crash Scenarios and Recovery Behavior


The following table describes recovery behavior for different crash scenarios:

| Crash Scenario | Durability Mode | Recovery Behavior | Data Loss |
|----------------|-----------------|-------------------|-----------|
| **Crash during write** | Synchronous | Recover to last committed transaction | None - fsync completed |
| | GroupCommit | Recover to last flushed epoch | None - epoch fsync completed |
| | Async | Recover to last background flush | **Possible** - unflushed commits lost |
| **Crash during checkpoint** | Any | Load previous checkpoint + replay WAL | None - checkpoints are atomic |
| **WAL segment corruption** | Any | Partial recovery to last good entry | **All entries after corruption** |
| **Missing checkpoint file** | Any | Replay entire WAL from beginning | None (slower recovery) |
| **Checkpoint + WAL corruption** | Any | Recovery fails, manual intervention needed | **Depends on extent of corruption** |
| **Power failure mid-fsync** | Synchronous/GroupCommit | OS guarantees atomicity, last entry may be incomplete | None or last transaction only |
| **Disk full during WAL write** | Any | Recovery stops at last complete entry | Recent uncommitted writes |
| **Vector index config invalid** | Any | **Recovery fails** to maintain integrity | None - prevents inconsistent state |

#### Durability Guarantees by Mode


**Synchronous Mode:**
- **Guarantee:** Every committed transaction is durable (survives any crash)
- **Mechanism:** `fsync()` completes before returning to caller
- **Trade-off:** ~1.5ms latency per commit

**GroupCommit Mode:**
- **Guarantee:** Committed transactions are durable (ACID-compliant)
- **Mechanism:** Caller blocks until epoch is flushed and fsynced
- **Trade-off:** ~10-50ms latency (batched), 100K+/sec throughput

**Async Mode:**
- **Guarantee:** **NO durability guarantee** until background flush completes
- **Mechanism:** Background thread flushes every `flush_interval_ms`
- **Risk:** Crash before flush = **data loss** of unflushed commits
- **Trade-off:** <100ns latency, 500K+/sec throughput

**AsyncBatched Mode:**
- **Guarantee:** Same as Async (no durability until flush)
- **Mechanism:** Background batching + epoch tracking for metrics
- **Use Case:** High-throughput scenarios where some data loss is acceptable

#### Example: GroupCommit Recovery Guarantee


```rust
// Transaction timeline with GroupCommit:
1. tx1.commit() - appends to stripe, registers with epoch 5
2. tx2.commit() - appends to stripe, registers with epoch 5
3. Background thread: drains stripes, sorts by LSN, writes, fsync()
4. Background thread: mark_flushed(epoch 5)
5. tx1.commit() returns success ← Data is durable NOW
6. tx2.commit() returns success ← Data is durable NOW

// If crash happens:
- Before step 3: tx1 and tx2 NOT recovered (never fsynced)
- After step 3 but before step 4: tx1 and tx2 RECOVERED (fsync completed)
- After step 5: tx1 and tx2 RECOVERED (committed)
```

**Key Point:** GroupCommit blocks the caller until fsync completes, guaranteeing durability. This is why it's ACID-compliant despite batching.

### Handling Corrupted Segments

If a segment fails checksum verification during recovery:

```rust
match read_segment(path) {
    Ok(entries) => replay_entries(entries),
    Err(Error::ChecksumMismatch { lsn, expected, actual }) => {
        // Log corruption details
        eprintln!("Segment corrupted at LSN {}", lsn);
        eprintln!("Expected checksum: {}, Actual: {}", expected, actual);

        // Attempt partial recovery
        let recovered = recover_until_corruption(path, lsn)?;
        replay_entries(recovered);

        // Mark segment as corrupted for manual inspection
        mark_corrupted(path)?;

        // Recovery succeeds with partial data
        eprintln!("Recovered {} entries before corruption", recovered.len());
    }
    Err(e) => return Err(e),
}
```

**Partial Recovery Behavior:**
- Entries before corruption are replayed normally
- Corrupted entry and all following entries in segment are discarded
- Database state is consistent up to last good entry
- Application must re-apply lost transactions from other sources (if available)
- Subsequent segments (if any) are NOT processed to avoid inconsistent state

**Best Practice:** Enable GroupCommit or Synchronous mode for production to minimize corruption risk.

### API Reference


#### CheckpointManager::recover()


**File Location:** `src/storage/checkpoint.rs`

Recovers database state from persisted indexes plus WAL replay.

```rust
pub fn recover(
    &mut self,
    wal: &ConcurrentWalSystem,
) -> Result<(CurrentStorage, HistoricalStorage, LSN)>
```

**Example:**

```rust
use aletheiadb::storage::checkpoint::{CheckpointConfig, CheckpointManager};
use aletheiadb::storage::wal::concurrent_system::ConcurrentWalSystem;

let wal = ConcurrentWalSystem::new(wal_config)?;
let mut manager = CheckpointManager::new(CheckpointConfig::with_data_dir("data/mydb"))?;
let (current, historical, recovered_lsn) = manager.recover(&wal)?;
```

**Internal Behavior:**

1. Loads persisted string/graph/temporal/vector indexes (if present)
2. Reads checkpoint/manifest LSN
3. Replays WAL entries starting from `lsn + 1`
4. Returns recovered state and final LSN

### Performance Characteristics


Recovery performance depends on checkpoint frequency and WAL size:

```
Recovery Time = Checkpoint Load Time + WAL Replay Time
```

#### Recovery Time vs. Dataset Size


The following table shows typical recovery times for different dataset sizes with default checkpoint settings (checkpoint every 5 minutes or 1000 entries):

| Dataset Size | Checkpoint Load | WAL Replay (1K entries) | WAL Replay (10K entries) | WAL Replay (100K entries) | Total (worst case) |
|--------------|----------------|------------------------|-------------------------|--------------------------|-------------------|
| 1K nodes | ~1ms | ~5ms | ~50ms | ~500ms | ~500ms |
| 10K nodes | ~10ms | ~5ms | ~50ms | ~500ms | ~510ms |
| 100K nodes | ~100ms | ~5ms | ~50ms | ~500ms | ~600ms |
| 1M nodes | ~1s | ~5ms | ~50ms | ~500ms | ~1.5s |
| 10M nodes | ~10s | ~5ms | ~50ms | ~500ms | ~10.5s |

**Key Insights:**

1. **Checkpoint/index load** is typically much smaller than replay cost
2. **WAL replay time** scales with number of entries since last checkpoint (not total DB size)
3. **Worst case:** Crash immediately before checkpoint creation (max WAL replay)
4. **Best case:** Crash immediately after checkpoint (minimal WAL replay)

**Modern Full-State Path:**

For full-state persistence, prefer index-persistence checkpointing (`storage::checkpoint` +
`storage::index_persistence`) which persists string/graph/temporal/vector state to disk and
replays WAL from manifest LSN + 1.

#### Recovery Time Estimation


```rust
// Estimate recovery time for your workload:
fn estimate_recovery_time(
    node_count: usize,
    edge_count: usize,
    wal_entries_since_checkpoint: usize,
) -> Duration {
    // Checkpoint/index load (small, near-constant)
    let checkpoint_load_ms = (node_count + edge_count) as f64 * 0.00001;

    // WAL replay (~5µs per entry average)
    let wal_replay_ms = wal_entries_since_checkpoint as f64 * 0.005;

    Duration::from_millis((checkpoint_load_ms + wal_replay_ms) as u64)
}

// Example: 1M nodes, 100K WAL entries
let recovery_time = estimate_recovery_time(1_000_000, 2_000_000, 100_000);
println!("Estimated recovery time: {:?}", recovery_time); // ~1.5s
```

#### Memory Usage During Recovery


Recovery memory usage is bounded by:

```
Memory = Current Storage + Historical Storage + WAL Replay Buffer
```

**Component Breakdown:**

| Component | Memory Usage | Notes |
|-----------|-------------|-------|
| **Current Storage** | ~200 bytes/node, ~150 bytes/edge | Grows linearly during replay |
| **Historical Storage** | ~80 bytes/version | One version per create/update/delete |
| **WAL Replay Buffer** | ~1KB per entry | Temporary, released after replay |
| **Vector Index (if enabled)** | ~(4 * dimensions + 64) bytes/node | HNSW index overhead |
| **String Interner** | ~20 bytes/unique string | Shared across all entities |

**Example Memory Calculations:**

```rust
// Dataset: 1M nodes, 2M edges, 500K historical versions
// Vector index: 384 dimensions, enabled on 500K nodes

let node_memory = 1_000_000 * 200;                    // 200 MB
let edge_memory = 2_000_000 * 150;                    // 300 MB
let historical_memory = 500_000 * 80;                 // 40 MB
let vector_memory = 500_000 * (4 * 384 + 64);         // 775 MB
let total_memory_mb = (node_memory + edge_memory + historical_memory + vector_memory) / 1_000_000;

println!("Total recovery memory: {} MB", total_memory_mb); // ~1315 MB
```

**Memory Optimization Tips:**

1. **Frequent checkpoints** reduce WAL replay memory (fewer entries buffered)
2. **Disable vector indexing** during recovery if not immediately needed (can rebuild later)
3. **Streaming replay** (future enhancement) will reduce replay buffer memory
4. **Tiered storage** (Issue #XYZ) moves cold historical data to disk, reducing RAM usage

#### Benchmark Results


The following benchmarks were measured on a standard development machine (16-core CPU, NVMe SSD):

**Recovery Throughput:**

| Operation | Throughput | Notes |
|-----------|-----------|-------|
| Checkpoint load (metadata) | ~100K nodes/sec | Sequential read |
| WAL replay (nodes) | ~200K nodes/sec | CreateNode operations |
| WAL replay (edges) | ~300K edges/sec | CreateEdge operations |
| WAL replay (updates) | ~150K updates/sec | UpdateNode operations |
| Vector indexing during replay | ~50K vectors/sec | 384-dim HNSW insertion |

**Real-World Recovery Examples:**

```
Scenario 1: Small database, frequent checkpoints
- 10K nodes, 20K edges
- Checkpoint: 5 minutes old (200 WAL entries)
- Recovery time: ~15ms
- Memory usage: ~3 MB

Scenario 2: Medium database, standard checkpoints
- 100K nodes, 300K edges, vector index enabled
- Checkpoint: 5 minutes old (5K WAL entries)
- Recovery time: ~150ms
- Memory usage: ~85 MB

Scenario 3: Large database, worst-case scenario
- 1M nodes, 3M edges, vector index enabled
- No checkpoint (replay entire WAL: 1M entries)
- Recovery time: ~8 seconds
- Memory usage: ~1.5 GB

Scenario 4: Production database with optimal settings
- 5M nodes, 15M edges, vector index on 2M nodes
- Checkpoint: 2 minutes old (50K WAL entries)
- Recovery time: ~2.5 seconds
- Memory usage: ~4 GB
```

### Performance Tuning Recommendations

To optimize recovery performance:

1. **Checkpoint Frequency:**
   ```rust
   // More frequent checkpoints = faster recovery (but more I/O overhead)
   CheckpointConfig {
       checkpoint_interval: Duration::from_secs(120), // 2 minutes (vs default 5)
       min_wal_entries: 500, // Lower threshold (vs default 1000)
       ..Default::default()
   }
   ```

2. **Durability Mode:**
   ```rust
   // Use GroupCommit for production (balance of durability + throughput)
   DurabilityMode::GroupCommit {
       max_delay_ms: 10,
       max_batch_size: 200,
   }
   ```

3. **WAL Segment Size:**
   ```rust
   // Smaller segments = better parallelism during recovery (future enhancement)
   ConcurrentWalSystemConfig {
       segment_size: 32 * 1024 * 1024, // 32MB (vs default 64MB)
       ..Default::default()
   }
   ```

4. **Vector Index Configuration:**
   ```rust
   // Lower ef_construction during recovery for faster indexing
   // Rebuild with higher quality later if needed
   HnswConfig::new(384, DistanceMetric::Cosine)
       .with_ef_construction(100) // Lower quality, faster (vs default 200)
       .with_capacity(expected_node_count) // Pre-allocate to avoid resizing
   ```

### Known Limitations


1. **Transaction ID Loss:** Original transaction IDs are not preserved during recovery. All replayed operations use `TxId(0)`. This means temporal queries cannot distinguish which operations were part of the same transaction after a crash/restart. (Issue #86 tracks WAL format extension to preserve transaction IDs)

2. **No Incremental Checkpoints Yet:** The current checkpoint system persists full index snapshots rather than incremental diffs. For very large datasets, snapshot writes can take noticeable time.

3. **Single-Threaded Replay:** WAL replay is currently single-threaded. For databases with >100K WAL entries since checkpoint, this can be a bottleneck. Parallel replay (by partitioning entries) is planned for Phase 3.

4. **No Streaming Replay:** All WAL entries since checkpoint are loaded into memory before replay begins. For extremely large WAL files (>1M entries), this can consume significant memory. Streaming replay is planned for future enhancement.

See [docs/ARCHITECTURE.md](ARCHITECTURE.md) for long-term recovery optimization plans.

## Configuration


### ConcurrentWalSystem Configuration


```rust
use aletheiadb::storage::wal::concurrent_system::ConcurrentWalSystemConfig;
use aletheiadb::storage::wal::DurabilityMode;

let config = ConcurrentWalSystemConfig {
    /// Directory for WAL segments
    wal_dir: PathBuf::from("data/wal"),

    /// Number of stripes (should be power of 2, default: 16)
    num_stripes: 16,

    /// Ring buffer capacity per stripe (default: 1024)
    stripe_capacity: 1024,

    /// Maximum segment size before rotation (default: 64MB)
    segment_size: 64 * 1024 * 1024,

    /// Number of segments to retain (default: 10)
    segments_to_retain: 10,

    /// Background flush interval in ms (default: 10)
    flush_interval_ms: 10,

    /// Durability mode
    durability_mode: DurabilityMode::GroupCommit {
        max_batch_size: 200,
        max_delay_ms: 10,
    },

    /// Write buffer size for segment files (default: 64KB)
    write_buffer_size: 64 * 1024,
};

let wal = ConcurrentWalSystem::new(config)?;
```

### Durability Modes


```rust
pub enum DurabilityMode {
    /// fsync on every commit - maximum durability
    /// Latency: ~1.5ms, Throughput: ~600/sec
    Synchronous,

    /// Batched fsync with epoch-based waiting - ACID compliant
    /// Latency: ~10-50ms, Throughput: ~100K/sec
    GroupCommit {
        max_delay_ms: u64,      // default: 10
        max_batch_size: usize,  // default: 200
    },

    /// Background fsync, commits return immediately - NOT ACID
    /// Latency: <100ns, Throughput: ~500K/sec
    Async {
        flush_interval_ms: u64, // default: 10
    },

    /// Like Async but with epoch tracking (for metrics)
    AsyncBatched {
        max_delay_ms: u64,
        max_batch_size: usize,
    },
}
```

### Production Recommendations


| Parameter | Recommendation | Rationale |
|-----------|---------------|-----------|
| `num_stripes` | 16-32 | Match expected concurrent writers |
| `stripe_capacity` | 1024 | Balance memory vs backpressure |
| `segment_size` | 64-128 MB | Balance rotation overhead vs recovery time |
| `segments_to_retain` | 10-20 | Enough for recovery + debugging |
| `durability_mode` | `GroupCommit` | Best balance of ACID + performance |

## Component Details


### LSN Allocator


The LSN allocator provides globally unique, monotonically increasing sequence numbers:

```rust
pub struct LsnAllocator {
    next_lsn: AtomicU64,
}

impl LsnAllocator {
    /// Allocate a single LSN (atomic operation)
    pub fn allocate(&self) -> LSN {
        LSN(self.next_lsn.fetch_add(1, Ordering::SeqCst))
    }

    /// Set next LSN (for recovery)
    pub fn set_next_lsn(&self, lsn: LSN) {
        self.next_lsn.store(lsn.0, Ordering::SeqCst);
    }
}
```

### Flush Coordinator


The flush coordinator manages segment files and coordinates flushing:

```rust
impl FlushCoordinator {
    /// Flush entries to disk
    pub fn flush(&self, mut entries: Vec<PendingEntry>, sync: bool) -> Result<FlushStats> {
        // 1. Sort by LSN to restore global order
        entries.sort_by_key(|e| e.lsn);

        // 2. Write to segment file
        for entry in &entries {
            self.write_entry(entry)?;
        }

        // 3. fsync if required
        if sync {
            self.sync()?;
        }

        // 4. Notify completion handles
        for entry in entries {
            if let Some(notifier) = entry.completion {
                notifier.complete(Ok(()));
            }
        }

        Ok(stats)
    }
}
```

## Debugging Tools


### Inspecting WAL Contents


```rust
use aletheiadb::storage::wal_reader::read_wal_entries;

// Print all entries
let entries = read_wal_entries(Path::new("data/wal"), LSN(1))?;
for entry in entries {
    println!("LSN {}: {:?}", entry.lsn.0, entry.operation);
}
```

### Checking WAL Metrics


```rust
let wal = ConcurrentWalSystem::new(config)?;

// After some operations...
println!("Total appends: {}", wal.total_appends());
println!("Total flushed: {}", wal.total_flushed());
println!("Current LSN: {:?}", wal.current_lsn());
```

## Adding New WAL Versions


When adding new serialization features:

1. Increment `WAL_VERSION` constant
2. Add version-aware serialization logic
3. Add version-aware deserialization logic
4. Update tests for new version

```rust
// In src/storage/wal.rs
const WAL_VERSION: u8 = 2;  // Increment version

fn serialize_entry(entry: &WalEntry, version: u8) -> Vec<u8> {
    let mut buf = Vec::new();
    buf.extend_from_slice(b"GWAL");
    buf.push(version);
    // ... serialize fields, with version-specific logic
    buf
}
```

## References


- [ADR-0007: Write-Ahead Log for Durability](adr/0007-wal-durability.md)
- [ADR-0012: Configurable Durability Modes](adr/0012-configurable-durability-modes.md)
- [ADR-0020: Concurrent WAL Architecture](adr/0020-concurrent-wal-architecture.md)
- [Write-Ahead Logging on Wikipedia](https://en.wikipedia.org/wiki/Write-ahead_logging)
- [PostgreSQL WAL Documentation](https://www.postgresql.org/docs/current/wal-intro.html)
- [LMAX Disruptor](https://lmax-exchange.github.io/disruptor/) - Lock-free ring buffer design