Expand description
Double-write buffer for torn write protection.
NVMe drives guarantee atomic 4 KiB sector writes but NOT atomic writes for larger pages (e.g., 16 KiB). If power fails mid-write on a 16 KiB page, the WAL page can be partially written (torn).
CRC32C detects torn writes during replay, but without the double-write buffer, the record is lost — even though it was acknowledged to the client.
The double-write buffer solves this:
- Before writing to WAL, write the record to the double-write file.
fsyncthe double-write file.- Write to the WAL file.
fsyncthe WAL file.
On recovery, if a WAL record’s CRC fails:
- Check the double-write buffer for an intact copy (verify CRC).
- If found, use the double-write copy to reconstruct the WAL page.
- If not found, the record is truly lost (pre-fsync crash).
The double-write file is a fixed-size circular buffer. Only the most recent N records are kept — older ones are overwritten. This is fine because torn writes can only happen on the most recent write.
§O_DIRECT mode
When the parent WAL uses O_DIRECT, the DWB can also be opened with
O_DIRECT (DwbMode::Direct). This:
- Keeps the page cache free of DWB bytes — the O_DIRECT WAL was specifically designed not to warm the cache, and a buffered DWB undoes that by writing the exact same payload through the cache.
- Surfaces DWB bytes in block-layer iostat traffic alongside the WAL.
The on-disk layout is the same in both modes (one aligned header block followed by fixed-stride slots, all block-aligned) so a DWB written in one mode can be read in the other.
Structs§
- Double
Write Buffer - Double-write buffer file.
Enums§
- DwbMode
- I/O mode for the double-write buffer file.
Functions§
- slot_
stride - Slot stride in bytes. Exposed for tests and for callers that want to size DWB files ahead of time.
- wal_
dwb_ bytes_ written_ total - Total bytes written to DWB files since process start.