Module group_commit

Expand description

Cooperative group-commit coordinator for the WAL.

Mirrors PostgreSQL’s XLogFlush waiter logic (src/backend/access/transam/xlog.c). The single-writer commit path used to call wal.sync() once per commit, so N concurrent writers paid N independent fsyncs (~N × 100 µs on SSD). Group commit collapses those into one fsync that covers every byte appended up to the slowest writer’s LSN.

§Algorithm

A writer appends its records under the WAL lock and captures the resulting commit_lsn = wal.current_lsn() after its Commit record.
The writer releases the WAL lock and calls [GroupCommit::commit_at_least(commit_lsn, &wal)].
Inside commit_at_least:
- Fast path: if flushed_lsn >= commit_lsn, the write is already durable from a piggyback on a previous fsync. Return immediately.
- Otherwise take the coordinator state lock. Re-check the flushed LSN (another leader may have raced).
- If a leader is already mid-flush, wait on the condvar until that flush finishes. Re-check flushed_lsn; if the batch now covers commit_lsn, return, otherwise race to become the next leader for the remaining tail.
- If no leader is in progress, become the leader: mark in_progress = true, drop the state lock, take the WAL lock, call wal.sync(), publish the new flushed_lsn, take the state lock again, clear in_progress, and notify all waiters.

§Why this works

Between the first writer’s append and the leader’s wal.sync(), other writers can grab the WAL lock and append more records. When the leader finally calls sync(), it flushes everything that has been appended so far — not just its own records. Writers whose LSN is now covered return immediately; writers that appended after the leader captured target_lsn wake up, see they still need more durability, and one of them becomes the next leader.

So commit_at_least produces one fsync per batch of concurrent writers, not per writer. On a workload with 8 concurrent committers, the throughput goes from ~8 × 100 µs ≈ 1 250 commits/s to ~1 × 100 µs ≈ 10 000 commits/s, an 8× win.

§Correctness

flushed_lsn is monotonic: only the leader writes it, and only after a successful sync().
The state lock + condvar guarantee that exactly one leader is ever in flight, so we never have two parallel fsyncs racing.
Waiters re-check flushed_lsn under the state lock before sleeping, so we never miss a wake-up.
The leader does only the WAL sync() while holding leadership — no extra work — to keep the critical section as short as possible.

Structs§

GroupCommit: Cooperative WAL flush coordinator.