pub enum PersistenceStrategy {
DiskFirst,
MemFirst,
}
Expand description
Defines how Raft log entries are persisted and accessed.
All strategies use a configurable FlushPolicy
to control when memory contents
are flushed to disk, affecting write latency and durability guarantees.
Note: Both strategies now fully load all log entries from disk into memory at startup.
The in-memory SkipMap
serves as the primary data structure for reads in all modes.
§Raft Log Persistence Architecture
This document outlines the design and behavior of the Raft log storage engine. It explains how logs are persisted, how the system handles reads and writes under different strategies, and how consistency is guaranteed across different configurations.
§Overview
Raft logs record the sequence of operations that must be replicated across all nodes in a Raft cluster. Correct and reliable storage of these logs is essential to maintaining the linearizability and safety guarantees of the protocol.
Our log engine supports two persistence strategies:
- DiskFirst: Prioritizes durability.
- MemFirst: Prioritizes performance.
Both strategies support configurable flush policies to control how memory contents are persisted to disk.
§Persistence Strategies
§DiskFirst
- Write Path: On append, entries are first synchronously written to disk. Once confirmed, they are cached in memory.
- Read Path: Reads are served from memory. If the requested entry is missing, it is loaded from disk and cached.
- Startup Behavior: Does not preload all entries from disk into memory. Instead, entries are loaded lazily on access.
- Durability: Ensures strong durability. A log is never considered accepted until it is safely written to disk.
- Memory Use: Memory acts as a read-through cache for performance optimization.
§MemFirst
- Write Path: Entries are first written to memory and acknowledged immediately. Disk persistence is handled asynchronously in the background.
- Read Path: Reads are served from memory only. If an entry is not present in memory, it is considered nonexistent.
- Startup Behavior: Loads all log entries from disk into memory during startup.
- Durability: Durability is best-effort and depends on the flush policy. Recent entries may be lost if a crash occurs before flushing.
- Memory Use: Memory holds the complete working set of logs.
§Flush Policies
Flush policies control how and when in-memory data is persisted to disk. These are especially relevant in MemFirst
mode, but are also applied in DiskFirst
to control how memory state is flushed (e.g., snapshots, metadata, etc).
§Types
-
Immediate
- Flush to disk immediately after every log write.
- Ensures maximum durability, but higher I/O latency.
-
Batch { threshold, interval }
- Flush to disk when:
- The number of unflushed entries exceeds
threshold
, or - The elapsed time since last flush exceeds
interval
milliseconds.
- The number of unflushed entries exceeds
- Balances performance and durability.
- May lose recent entries on crash.
- Flush to disk when:
§Read & Write Semantics
Operation | DiskFirst | MemFirst |
---|---|---|
Write | Write to disk → cache in memory | Write to memory → async flush |
Read | From memory; fallback to disk | Memory only; missing = absent |
Startup | Lazy-loading on access | Preload all entries into memory |
Flush | Controlled via flush_policy | Controlled via flush_policy |
Data loss on crash | No (after disk fsync) | Possible if not flushed |
§Consistency Guarantees
Property | DiskFirst | MemFirst |
---|---|---|
Linearizability | ✅ (strict) | ✅ (with quorum + sync on commit) |
Durability (Post-Commit) | ✅ Always | ❌ Depends on flush policy |
Availability (Under Load) | ❌ Lower | ✅ Higher |
Crash Recovery | ✅ Strong | ❌ Recent entries may be lost |
Startup Readiness | ✅ Fast | ❌ Slower (full load) |
§Recommended Use Cases
Strategy | Best For |
---|---|
DiskFirst | Systems that require strong durability and consistent recovery (e.g., databases, distributed ledgers) |
MemFirst | Systems that favor latency and availability, and can tolerate recovery from snapshots or re-election (e.g., in-memory caches, ephemeral workloads) |
§Developer Notes
- Log Truncation & Compaction: Logs should be truncated after snapshotting, regardless of strategy.
- Backpressure: In
MemFirst
, developers should implement backpressure if memory usage exceeds thresholds. - Lazy Loading: In
DiskFirst
, avoid head-of-line blocking by prefetching future entries when cache misses occur. - Flush Daemon: Use a background task to monitor and enforce flush policy under
MemFirst
.
§Future Improvements
- Snapshot-aware recovery to reduce startup times for
MemFirst
. - Tiered storage support (e.g., WAL on SSD, archival on HDD or cloud).
- Intelligent adaptive flush control based on workload.
Variants§
DiskFirst
Disk-first persistence strategy.
-
Write path: On append, the log entry is first written to disk. Only after a successful disk write is it acknowledged and stored in the in-memory
SkipMap
. -
Read path: Reads are always served from the in-memory
SkipMap
. -
Startup behavior: All log entries are loaded from disk into memory at startup, ensuring consistent access speed regardless of disk state.
-
Suitable for systems prioritizing strong durability while still providing in-memory performance for reads.
MemFirst
Memory-first persistence strategy.
-
Write path: On append, the log entry is first written to the in-memory
SkipMap
and acknowledged immediately. Disk persistence happens asynchronously in the background, governed byFlushPolicy
. -
Read path: Reads are always served from the in-memory
SkipMap
. -
Startup behavior: All log entries are loaded from disk into memory at startup, the same as
DiskFirst
. -
Suitable for systems that favor lower write latency and faster failover, while still retaining a disk-backed log for crash recovery.
Trait Implementations§
Source§impl Clone for PersistenceStrategy
impl Clone for PersistenceStrategy
Source§fn clone(&self) -> PersistenceStrategy
fn clone(&self) -> PersistenceStrategy
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source
. Read moreSource§impl Debug for PersistenceStrategy
impl Debug for PersistenceStrategy
Source§impl<'de> Deserialize<'de> for PersistenceStrategy
impl<'de> Deserialize<'de> for PersistenceStrategy
Source§fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
Source§impl PartialEq for PersistenceStrategy
impl PartialEq for PersistenceStrategy
Source§impl Serialize for PersistenceStrategy
impl Serialize for PersistenceStrategy
impl StructuralPartialEq for PersistenceStrategy
Auto Trait Implementations§
impl Freeze for PersistenceStrategy
impl RefUnwindSafe for PersistenceStrategy
impl Send for PersistenceStrategy
impl Sync for PersistenceStrategy
impl Unpin for PersistenceStrategy
impl UnwindSafe for PersistenceStrategy
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoRequest<T> for T
impl<T> IntoRequest<T> for T
Source§fn into_request(self) -> Request<T>
fn into_request(self) -> Request<T>
T
in a tonic::Request