Skip to main content

Module snapshot_guarantees

Module snapshot_guarantees 

Source
Expand description

§Snapshot Guarantees

This guide covers what d-engine guarantees about snapshot behavior, known limitations, and configuration recommendations for production deployments.

§Guarantees

Write operations are never blocked by snapshot generation. Snapshot creation runs in a background tokio::spawn task, isolating compression I/O from the Raft event loop. Writes continue uninterrupted while a snapshot is being compressed and written to disk.

Committed data is never lost during interrupted transfers. If a snapshot transfer is interrupted mid-way (network drop, leader crash, receiver restart), the stale temporary file is truncated on the next attempt — not appended to — and the transfer restarts cleanly. The leader retries automatically until the follower catches up.

Snapshot files are written atomically. The receiver assembles chunks into a temporary file (temp-snapshot.part.tar.gz) and performs an atomic rename to the final path only after all chunks pass checksum validation. A reader never observes a partially written snapshot file.

§Limitations

P99 write latency may increase during snapshot generation. Compressing the state machine to disk is CPU-bound. Under high write throughput, expect a transient P99 spike of 5–20ms during the compression window.

Interrupted transfers restart from the beginning. d-engine does not support resuming a partial snapshot transfer. If a transfer is interrupted after transferring 90% of a large snapshot, the next attempt retransfers from chunk 0. For snapshots exceeding ~500 MB, ensure stable network conditions.

No cross-datacenter snapshot optimization. Differential or incremental snapshot transfer is out of scope. Each transfer is a full snapshot.

§Operational Boundaries

§When snapshots trigger

A snapshot is triggered when the number of unapplied log entries exceeds max_log_entries_before_snapshot. After the snapshot is created, log entries older than retained_log_entries before the snapshot index are purged.

Any follower whose next_index falls below the purge boundary will receive a full snapshot instead of log entries.

§Chunk timeout

receive_chunk_timeout_in_sec (default: 30s) applies per-chunk on the receiver side. For slow networks or chunks larger than the default 1 KB, increase this value:

[raft.snapshot]
receive_chunk_timeout_in_sec = 60

If this timeout fires, the receiver aborts the current transfer and the leader retries.

§Snapshot retention

cleanup_retain_count (default: 2) controls how many past snapshot files are kept on disk after a new one is created. Keep at least 2 for rollback and debugging headroom.

§Configuration Reference

FieldDefaultDescription
max_log_entries_before_snapshot1000Log entries before snapshot triggers
retained_log_entries1Log entries to retain after snapshot
chunk_size1024 (1 KB)Size of each transfer chunk in bytes
receive_chunk_timeout_in_sec30Per-chunk receive timeout on follower
transfer_timeout_in_sec600Overall transfer timeout
max_bandwidth_mbps1Transfer bandwidth cap (Mbps)
cleanup_retain_count2Number of past snapshot files to keep

§Further Reading