Expand description
Cooperative group-commit coordinator for the WAL.
Mirrors PostgreSQL’s XLogFlush waiter logic
(src/backend/access/transam/xlog.c). The single-writer commit
path used to call wal.sync() once per commit, so N concurrent
writers paid N independent fsyncs (~N × 100 µs on SSD). Group
commit collapses those into one fsync that covers every byte
appended up to the slowest writer’s LSN.
§Algorithm
- A writer appends its records under the WAL lock and captures
the resulting
commit_lsn = wal.current_lsn()after itsCommitrecord. - The writer releases the WAL lock and calls
[
GroupCommit::commit_at_least(commit_lsn, &wal)]. - Inside
commit_at_least:- Fast path: if
flushed_lsn >= commit_lsn, the write is already durable from a piggyback on a previous fsync. Return immediately. - Otherwise take the coordinator state lock. Re-check the flushed LSN (another leader may have raced).
- If a leader is already mid-flush, wait on the condvar until
that flush finishes. Re-check
flushed_lsn; if the batch now coverscommit_lsn, return, otherwise race to become the next leader for the remaining tail. - If no leader is in progress, become the leader: mark
in_progress = true, drop the state lock, take the WAL lock, callwal.sync(), publish the newflushed_lsn, take the state lock again, clearin_progress, and notify all waiters.
- Fast path: if
§Why this works
Between the first writer’s append and the leader’s wal.sync(),
other writers can grab the WAL lock and append more records.
When the leader finally calls sync(), it flushes everything
that has been appended so far — not just its own records. Writers
whose LSN is now covered return immediately; writers that appended
after the leader captured target_lsn wake up, see they still need
more durability, and one of them becomes the next leader.
So commit_at_least produces one fsync per batch of concurrent
writers, not per writer. On a workload with 8 concurrent
committers, the throughput goes from ~8 × 100 µs ≈ 1 250
commits/s to ~1 × 100 µs ≈ 10 000 commits/s, an 8× win.
§Correctness
flushed_lsnis monotonic: only the leader writes it, and only after a successfulsync().- The state lock + condvar guarantee that exactly one leader is ever in flight, so we never have two parallel fsyncs racing.
- Waiters re-check
flushed_lsnunder the state lock before sleeping, so we never miss a wake-up. - The leader does only the WAL
sync()while holding leadership — no extra work — to keep the critical section as short as possible.
Structs§
- Group
Commit - Cooperative WAL flush coordinator.