Module snapshot_guarantees

Expand description

§Snapshot Guarantees

This guide covers what d-engine guarantees about snapshot behavior, known limitations, and configuration recommendations for production deployments.

§Guarantees

Write operations are never blocked by snapshot generation. Snapshot creation runs in a background tokio::spawn task, isolating compression I/O from the Raft event loop. Writes continue uninterrupted while a snapshot is being compressed and written to disk.

Committed data is never lost during interrupted transfers. If a snapshot transfer is interrupted mid-way (network drop, leader crash, receiver restart), the stale temporary file is truncated on the next attempt — not appended to — and the transfer restarts cleanly. The leader retries automatically until the follower catches up.

Snapshot files are written atomically. The receiver assembles chunks into a temporary file (temp-snapshot.part.tar.gz) and performs an atomic rename to the final path only after all chunks pass checksum validation. A reader never observes a partially written snapshot file.

§Limitations

P99 write latency may increase during snapshot generation. Compressing the state machine to disk is CPU-bound. Under high write throughput, expect a transient P99 spike of 5–20ms during the compression window.

Interrupted transfers restart from the beginning. d-engine does not support resuming a partial snapshot transfer. If a transfer is interrupted after transferring 90% of a large snapshot, the next attempt retransfers from chunk 0. For snapshots exceeding ~500 MB, ensure stable network conditions.

No cross-datacenter snapshot optimization. Differential or incremental snapshot transfer is out of scope. Each transfer is a full snapshot.

§Operational Boundaries

§When snapshots trigger

A snapshot is triggered when the number of unapplied log entries exceeds max_log_entries_before_snapshot. After the snapshot is created, log entries older than retained_log_entries before the snapshot index are purged.

Any follower whose next_index falls below the purge boundary will receive a full snapshot instead of log entries.

§Chunk timeout

receive_chunk_timeout_in_sec (default: 30s) applies per-chunk on the receiver side. For slow networks or chunks larger than the default 1 KB, increase this value:

[raft.snapshot]
receive_chunk_timeout_in_sec = 60

If this timeout fires, the receiver aborts the current transfer and the leader retries.

§Snapshot retention

cleanup_retain_count (default: 2) controls how many past snapshot files are kept on disk after a new one is created. Keep at least 2 for rollback and debugging headroom.

§Configuration Reference

Field	Default	Description
`max_log_entries_before_snapshot`	1000	Log entries before snapshot triggers
`retained_log_entries`	1	Log entries to retain after snapshot
`chunk_size`	1024 (1 KB)	Size of each transfer chunk in bytes
`receive_chunk_timeout_in_sec`	30	Per-chunk receive timeout on follower
`transfer_timeout_in_sec`	600	Overall transfer timeout
`max_bandwidth_mbps`	1	Transfer bandwidth cap (Mbps)
`cleanup_retain_count`	2	Number of past snapshot files to keep

§Further Reading

Consistency model: consistency_tuning