Expand description
This project strictly follows the Single Responsibility Principle (SRP) to ensure modularity and maintainability. Developers extending the codebase must adhere to this principle to preserve clean separation of concerns. Below are key guidelines and examples.
§1. Core Design Philosophy
-
Main Loop Responsibility:
The
main_loop
(insrc/core/raft.rs
) only orchestrates event routing. It:- Listens for events (e.g., RPCs, timers).
- Delegates handling to the current role (e.g.,
LeaderState
,FollowerState
). - Manages role transitions (e.g.,
Leader → Follower
).
-
Role-Specific Logic:
Each role (
LeaderState
,FollowerState
, etc.) owns its state and behavior. For example:LeaderState
handles log replication and heartbeat management.FollowerState
processes leader requests and election timeouts.
§2. Key Rules for Developers
§Rule 1: No Cross-Role Logic
A role must never directly modify another role’s state or handle its events.
Bad Example (Violates SRP):
// ❌ LeaderState handling Follower-specific logic
impl LeaderState {
async fn handle_append_entries(...) {
if need_step_down {
self.become_follower(); // SRP violation: Leader manages Follower state
self.follower_handle_entries(...); // SRP violation: Leader handles Follower logic
}
}
}
Correct Approach:
// ✅ Leader sends a role transition event to main_loop
impl LeaderState {
async fn handle_append_entries(...) -> Result<()> {
if need_step_down {
role_tx.send(RoleEvent::BecomeFollower(...))?; // Delegate transition
return Ok(()); // Exit immediately
}
// ... (Leader-specific logic)
}
}
§Rule 2: Atomic Role Transitions
When a role changes (e.g., Leader → Follower
), the new role immediately takes over event handling.
Example:
// main_loop.rs (simplified)
loop {
let event = event_rx.recv().await;
match current_role {
Role::Leader(leader) => {
leader.handle_event(event).await?;
// If leader sends RoleEvent::BecomeFollower, update `current_role` here
}
Role::Follower(follower) => { ... }
}
}
§Rule 3: State Isolation
Role-specific data (e.g., LeaderState.next_index
) must not leak into other roles.
Bad Example:
// ❌ FollowerState accessing LeaderState internals
impl FollowerState {
fn reset_leader_index(&mut self, leader: &LeaderState) {
self.match_index = leader.next_index; // SRP violation
}
}
Correct Approach:
// ✅ LeaderState persists its state to disk before stepping down
impl LeaderState {
async fn step_down(&mut self, ctx: &RaftContext) -> Result<()> {
ctx.save_state(self.next_index).await?; // Isolate state
Ok(())
}
}
// ✅ FollowerState loads state from disk
impl FollowerState {
async fn load_state(ctx: &RaftContext) -> Result<Self> {
let next_index = ctx.load_state().await?;
Ok(Self { match_index: next_index })
}
}
§3. Folder Structure Alignment
The codebase enforces SRP through role-specific modules:
src/core/raft_role/
├── leader_state.rs # Leader-only logic (e.g., log replication)
├── follower_state.rs # Follower-only logic (e.g., election timeout)
├── candidate_state.rs # Candidate election logic
└── mod.rs # Role transitions and common interfaces
§4. Adding New Features
When extending the project:
- Identify Ownership: Should the logic belong to a role, the main loop, or a utility module?
- Avoid Hybrid Roles: Never create a
LeaderFollowerHybridState
. Use role transitions instead. - Example: Adding a “Learner” role:
- Create
learner_state.rs
with Learner-specific logic. - Update
main_loop
to handleRole::Learner
. - Add
RoleEvent::BecomeLearner
for transitions.
- Create
§5. Why SRP Matters Here
- Predictability: Each role’s behavior is isolated and testable.
- Safer Changes: Modifying
FollowerState
won’t accidentally breakLeaderState
. - Protocol Compliance: Raft requires strict role separation; SRP enforces this naturally.
By following these rules, your contributions will align with the project’s design philosophy. This section outlines the core principles for distinguishing between protocol logic errors (expected business failures) and system-level errors (unrecoverable faults) across the entire project. Developers extending this codebase should strictly follow these guidelines to ensure consistency and reliability.
§1. Core Principles
Error Type | Description | Handling Strategy |
---|---|---|
Protocol Logic Errors | Failures dictated by protocol rules (e.g., term mismatches, log inconsistencies). | - Return Ok(()) with a protocol-compliant response (e.g., success: false ). |
System-Level Errors | Critical failures (e.g., I/O errors, channel disconnections, state corruption). | - Return Error to halt normal operation.- Trigger recovery mechanisms (retry/alert). |
Illegal Operation Errors | Violations of the Raft state machine rules (e.g., invalid role transitions). | - Return Error immediately.- Indicates bugs in code logic and must be fixed. |
§2. Examples
§Example 1: Protocol Logic Error
// Reject stale AppendEntries request (term < current_term)
if req.term < self.current_term() {
let resp = AppendEntriesResponse { success: false, term: self.current_term() };
sender.send(Ok(resp))?; // Protocol error → return Ok(())
return Ok(());
}
§Example 2: System-Level Error
// Fail to persist logs (return Error)
ctx.raft_log().append_entries(&req.entries).await?; // Propagates storage error
§Example 3: Illegal Operation Error
// Follower illegally attempts to become Leader
impl RaftRoleState for FollowerState {
fn become_leader(&self) -> Result<LeaderState> {
warn!("Follower cannot directly become Leader");
Err(Error::IllegalOperation("follower → leader")) // Hard error → return Error
}
}
§3. Best Practices
-
Atomic State Changes
Ensure critical operations (e.g., role transitions) are atomic. If a step fails after partial execution, return
Error
to avoid inconsistent states. -
Error Classification
Define explicit error types to distinguish recoverable and non-recoverable failures:
ⓘenum Error { Protocol(String), // Debugging only (e.g., "term mismatch") Storage(io::Error), // Retry or alert Network(NetworkError), // Reconnect logic IllegalOperation(&'static str), // Code bug → must fix }
-
Handle Illegal Operations Strictly
- Log illegal operations at
error!
level. - Add unit tests to ensure invalid transitions return
Error
.
ⓘ#[test] fn follower_cannot_become_leader() { let follower = FollowerState::new(); assert!(matches!( follower.become_leader(), Err(Error::IllegalOperation("follower → leader")) )); }
- Log illegal operations at
§4. Extending to the Entire Project
These principles apply to all components:
- RPC Clients/Servers:
- Treat invalid RPC sequences (e.g., duplicate requests) as protocol errors (
Ok
with response). - Treat connection resets as system errors (
Error
).
- Treat invalid RPC sequences (e.g., duplicate requests) as protocol errors (
- State Machines:
- Invalid operations (e.g., applying non-committed logs) → Protocol errors.
- Disk write failures → System errors.
- Cluster Management:
- Node join conflicts (e.g., duplicate IDs) → Protocol errors.
- Invalid role transitions (e.g., Follower → Leader) → Illegal Operation errors.
By adhering to these rules, your extensions will maintain the same robustness and predictability as the core implementation.
§Raft Roles and Leader Election: Responsibilities and Step-Down Conditions
§Leader
- Heartbeat Authority: Send periodic heartbeats to all followers/learners
- Log Replication: Manage log entry replication pipeline
- Conflict Resolution: Overwrite inconsistent follower logs
- Membership Changes: Process configuration changes (learner promotion/demotion)
- Snapshot Coordination: Trigger snapshot creation and send via InstallSnapshot RPC
§Follower
- Request Processing:
- Respond to Leader heartbeats
- Append received log entries
- Redirect client writes to Leader
- Election Participation: Convert to candidate if election timeout expires
- Log Consistency: Maintain locally persisted log entries
- Snapshot Handling: Apply received snapshots and truncate obsolete logs
§Candidate
- Election Initiation: Start new election by incrementing term
- Vote Solicitation: Send RequestVote RPCs to all nodes
- State Transition:
- Become Leader if receiving majority votes
- Revert to Follower if newer term discovered
§Learner
- Bootstrapping:
- Default initial role for new nodes joining the cluster.
- Receive initial cluster state via snapshot transfer.
- Log Synchronization:
- Replicate logs but without voting rights.
- Eligible for promotion to follower once caught up.
- Failure Adaptation:
- Followers with excessive log lag (beyond max_lag_threshold) demote themselves to learners.
- Learners auto-promote back to follower when their match_index reaches a configurable threshold.
- Snapshot Protocol:
- Leaders send full snapshots triggered by LearnerEvent.
- Track replication progress separately for learners.
§Conditions That Cause a Leader to Step Down
A leader must relinquish authority and revert to follower state under these conditions:
-
Receives an RPC with a Higher Term
If the leader receives any RPC (AppendEntries, RequestVote, etc.) with a term greater than its current term, it must step down immediately.
-
Grants a Vote to Another Candidate
If the leader grants a vote to another candidate (usually due to the candidate having a more up-to-date log), it must step down.
-
Fails to Contact the Majority for Too Long
If the leader cannot successfully reach a quorum of nodes for a duration longer than one election timeout, it must assume it has lost authority and step down.
§Snapshot Module Design
Defines interface and behaviors for snapshot operations in d-engine
Raft implementation.
§Design Philosophy
- Pluggable strategy: Intentional flexibility for snapshot policies and storage implementations
- Zero-enforced approach: No default snapshot strategy - implement based on your storage needs
- Reference implementation:
sled
-based demo included (not production-ready)
§Core Concepts
§Snapshot Transfer: Bidirectional Streaming Design
sequenceDiagram
participant Learner
participant Leader
Learner->>Leader: 1. Request snapshot
Leader->>Learner: 2. Stream chunks
Learner->>Leader: 3. Send ACKs
Leader->>Learner: 4. Retransmit missing chunks
- Dual-channel communication:
- Data channel: Leader → Learner (SnapshotChunk stream)
- Feedback channel: Learner → Leader (SnapshotAck stream)
- Key benefits:
- Flow control: ACKs regulate transmission speed
- Reliability: Per-chunk CRC verification
- Resumability: Selective retransmission of failed chunks
- Backpressure: Explicit backoff when receiver lags
- Sequence:
sequenceDiagram
Learner->>Leader: StreamSnapshot(initial ACK)
Leader-->>Learner: Chunk 0 (metadata)
Learner->>Leader: ACK(seq=0, status=Accepted)
Leader-->>Learner: Chunk 1
Learner->>Leader: ACK(seq=1, status=ChecksumMismatch)
Leader-->>Learner: Chunk 1 (retransmit)
Learner->>Leader: ACK(seq=1, status=Accepted)
loop Remaining chunks
Leader-->>Learner: Chunk N
Learner->>Leader: ACK(seq=N)
end
§The bidirectional snapshot transfer is implemented as a state machine, coordinating both data transmission (chunks) and control feedback (ACKs) between Leader and Learner.
stateDiagram-v2
[*] --> Waiting
Waiting --> ProcessingChunk: Data available
Waiting --> ProcessingAck: ACK received
Waiting --> Retrying: Retry timer triggered
ProcessingChunk --> Waiting
ProcessingAck --> Waiting
Retrying --> Waiting
Waiting --> Completed: All chunks sent & acknowledged
Data Flow Summary:
Leader Learner
| |
|--> Stream<SnapshotChunk> ----> |
| |
|<-- Stream<SnapshotAck> <------ |
| |
§Snapshot Policies
Fully customizable through SnapshotPolicy
trait:
pub trait SnapshotPolicy {
fn should_create_snapshot(&self, ctx: &SnapshotContext) -> bool;
}
pub struct SnapshotContext {
pub last_included_index: u64,
pub last_applied_index: u64,
pub current_term: u64,
pub unapplied_entries: usize,
}
Enable customized snapshot policy:
let node = NodeBuilder::new()
.with_snapshot_policy(Arc::new(TimeBasedPolicy {
interval: Duration::from_secs(3600),
}))
.build();
By default, Size-Based Policy is enabled.
§Snaphot policy been used when generate new snapshot
sequenceDiagram
participant Leader
participant Policy
participant StateMachine
Leader->>Policy: Check should_create_snapshot()
Policy-->>Leader: true/false
alt Should create
Leader->>StateMachine: Initiate snapshot creation
end
Common policy types:
- Size-based (default)
- Time-based intervals
- Hybrid approaches
- External metric triggers
§Leader generate snapshot sequence diagram
sequenceDiagram
participant Leader
participant Follower
participant StateMachine
Leader->>Follower: Send chunked stream Stream<SnapshotChunk>
Follower->>StateMachine: Create temporary database instance
loop Chunk processing
Follower->>Temporary DB: Write chunk data
end
Follower->>StateMachine: Atomically replace main database
StateMachine->>sled: ArcSwap::store(new_db)
Follower->>Leader: Return success response
§Transfer Mechanics
- Chunking:
- Fixed-size chunks (configurable 4-16MB)
- First chunk contains metadata
- CRC32 checksum per chunk
- Rate limiting:
if config.max_bandwidth_mbps > 0 {
let min_duration = chunk_size_mb / config.max_bandwidth_mbps;
tokio::time::sleep(min_duration).await;
}
- Error recovery:
- Checksum mismatches trigger single-chunk retransmission
- Out-of-order detection resets stream position
- 10-second ACK timeout fails entire transfer
§Module Responsibilities
Component | Responsibilities |
---|---|
StateMachineHandler | - Chunk validation- Temporary file management- ACK generation- Error handling- Snapshot finalization |
StateMachine | - Snapshot application- Online state replacement- Consistency guarantees |
§Generating a new snapshot:
- [StateMachine] Generate new DB based on the temporary file provided by the [StateMachineHandler] →
- [StateMachine] Generate data →
- [StateMachine] Dump current DB into the new DB →
- [StateMachineHandler] Verify policy conditions and finalize the snapshot and updating the snapshot version.
§Applying a snapshot:
- [StateMachineHandler] Snapshot chunk reception and validation →
- [StateMachineHandler] Write chunks into a temporary file until success →
- [StateMachineHandler] Error handling and sends error response back to the sender and terminates the process →
- After all chunks have been successfully processed and validated, the [StateMachineHandler] finalizes the snapshot →
- [StateMachineHandler] Passing the snapshot file to the [StateMachine] →
- [StateMachine] Apply Snapshot and do online replacement - replacing the old state with the new one based on the snapshot.
§Cleaning up old snapshots:
[StateMachineHandler] automatically maintains old snapshots according to version policies, while the StateMachine is not aware of it.
§Purge Log Design
§Leader Purge Log State Management
sequenceDiagram
participant Leader
participant StateMachine
participant Storage
Leader->>StateMachine: create_snapshot()
StateMachine->>Leader: SnapshotMeta(last_included_index=100)
Note over Leader,Storage: Critical Atomic Operation
Leader->>Storage: persist_last_purged(100) # 1. Update in-memory last_purged_index<br/>2. Flush to disk atomically
Leader->>Leader: scheduled_purge_upto = Some(100) # Schedule async task
Leader->>Background: trigger purge_task(100)
Background->>Storage: physical_delete_logs(0..100)
Storage->>Leader: notify_purge_complete(100)
Leader->>Storage: verify_last_purged(100) # Optional consistency check
§Follower Purge Log State Management
sequenceDiagram
participant Leader
participant Follower
participant FollowerStorage
Leader->>Follower: InstallSnapshot RPC (last_included=100)
Note over Follower,FollowerStorage: Protocol-Required Atomic Update
Follower->>FollowerStorage: 1. Apply snapshot to state machine<br/>2. persist_last_purged(100)
Follower->>Follower: last_purged_index = Some(100) # Volatile state update
Follower->>Follower: pending_purge = Some(100) # Mark background task
Follower->>Background: schedule_log_purge(100)
Background->>FollowerStorage: delete_logs(0..100)
FollowerStorage->>Follower: purge_complete(100)
Follower->>Follower: pending_purge = None # Clear task status
§Raft Node Join & Promotion Architecture
§Business Objectives
- Read Load Distribution — Learners are promoted to Followers to share read traffic.
- High Availability for Mission-Critical Systems — Maintain odd-numbered quorum for safe scaling.
- Offline Analysis / Disaster Recovery — Support non-voting Learners for analytical or backup roles.
§Architecture Principles
Principle | Description |
---|---|
Even-Count Expansion Only | Nodes must be added in pairs to keep the voting member count odd. |
Sequential Join Handling | Leader handles join requests one at a time. |
Atomic Join Semantics | Node either fully joins as ActiveFollower or rolls back on failure. |
Retry on Join Failure | Join attempts include retry logic while waiting for peer readiness. |
OpsAdmin Involvement | Manual intervention support planned for future releases (not implemented yet). |
§Node State Lifecycle
stateDiagram-v2
[*] --> Joining
Joining --> Syncing: Leader accepts join
Syncing --> Ready: State synchronized
Ready --> StandbyLearner: Promotion timeout
Ready --> KeepAsLearner: Analytics role
Ready --> CheckPromotion: Standard role
CheckPromotion --> Promote: Quorum conditions met
CheckPromotion --> Pairing: Needs partner
Pairing --> Promote: Partner ready
Pairing --> StandbyLearner: Pairing timeout
§Core Integration Flow
sequenceDiagram
participant NewNode
participant Leader
participant RaftLog
NewNode->>Leader: JoinRequest
Leader->>RaftLog: ConfChange entry
RaftLog-->>Leader: Commit confirmation
Leader-->>NewNode: JoinResponse
Leader->>NewNode: InstallSnapshot stream
NewNode->>Leader: ApplyComplete
Leader->>NewNode: AppendEntries
NewNode->>Leader: ReadyNotification
Leader->>ReadyQueue: Enqueue node
§Key Architectural Components
§1. Connection Isolation
- Control Plane: Heartbeats and votes with low latency.
- Data Plane: Log replication with high throughput.
- Bulk Plane: Snapshot transfer which is bandwidth intensive.
§2. State Synchronization
- Immediate Snapshot: New nodes always receive the latest snapshot first.
- Atomic Replacement: Crash-safe application of state.
- Incremental Logs: Catch-up via incremental log entries after snapshot.
§3. Promotion System
- Quorum-Aware Batching:
- Maintains odd voter count for quorum safety.
- FIFO processing of join and promotion requests.
- Batch multiple promotions when possible.
- Timeout Enforcement:
- Join Request timeout: 30 seconds.
- Promotion window: 5 minutes.
- Pairing wait timeout: 2 minutes.
- Pairing Mechanism:
- Groups nodes that require partners for promotion.
- Jointly promotes paired nodes once both are ready.
§4. Failure Handling
- Stale Learner Detection: Periodic status checks trigger automatic downgrade and operator alert.
- Zombie Node Removal: Removes unresponsive nodes through batch proposals and maintenance.
§Quorum Preservation
flowchart LR
A[Current Voters] --> B{Odd Count?}
B -->|Yes| C[Promote Single]
B -->|No| D[Find Pair]
D --> E[Wait for Partner]
E -->|Ready| F[Joint Promotion]
E -->|Timeout| G[Demote to Standby]
§Timeout Enforcement Summary
Scenario | Timeout | Action |
---|---|---|
Join Request | 30s | Abort configuration |
Snapshot Transfer | Dynamic | Resume from last chunk |
Promotion Window | 5min | Demote to StandbyLearner |
Pairing Wait | 2min | Cancel pairing |
§Lifecycle Example
-
New Analytics Node:
Joining → Syncing → Ready → KeepAsLearner
-
Standard Node Promotion:
Joining → Syncing → Ready → Promote
-
Paired Promotion:
Ready → Pairing → Promote
(when partner ready) -
Failed Node:
Syncing → [Timeout] → Zombie → Removed
§Conditions That Cause a Leader to Step Down
A leader will step down in the following situations:
- Receives an RPC with a Higher Term If the leader receives an AppendEntries, RequestVote, or other RPC with a term number greater than its current term, it must step down and revert to a follower.
- Grants a Vote to Another Candidate If the leader receives a RequestVote RPC and grants the vote to another candidate (usually because the candidate has a more up-to-date log), the leader must step down.
- Fails to Contact the Majority for Too Long If the leader cannot reach a quorum (majority of nodes) for longer than one election timeout, it must assume it no longer holds authority and step down.
§Raft Log Persistence Architecture
This document outlines the design and behavior of the Raft log storage engine. It explains how logs are persisted, how the system handles reads and writes under different strategies, and how consistency is guaranteed across different configurations.
§Overview
Raft logs record the sequence of operations that must be replicated across all nodes in a Raft cluster. Correct and reliable storage of these logs is essential to maintaining the linearizability and safety guarantees of the protocol.
Our log engine supports two persistence strategies:
- DiskFirst: Prioritizes durability.
- MemFirst: Prioritizes performance.
Both strategies support configurable flush policies to control how memory contents are persisted to disk.
§Persistence Strategies
§DiskFirst
- Write Path: On append, entries are first synchronously written to disk. Once confirmed, they are cached in memory.
- Read Path: Reads are served from memory. If the requested entry is missing, it is loaded from disk and cached.
- Startup Behavior: Does not preload all entries from disk into memory. Instead, entries are loaded lazily on access.
- Durability: Ensures strong durability. A log is never considered accepted until it is safely written to disk.
- Memory Use: Memory acts as a read-through cache for performance optimization.
§MemFirst
- Write Path: Entries are first written to memory and acknowledged immediately. Disk persistence is handled asynchronously in the background.
- Read Path: Reads are served from memory only. If an entry is not present in memory, it is considered nonexistent.
- Startup Behavior: Loads all log entries from disk into memory during startup.
- Durability: Durability is best-effort and depends on the flush policy. Recent entries may be lost if a crash occurs before flushing.
- Memory Use: Memory holds the complete working set of logs.
§Flush Policies
Flush policies control how and when in-memory data is persisted to disk. These are especially relevant in MemFirst
mode, but are also applied in DiskFirst
to control how memory state is flushed (e.g., snapshots, metadata, etc).
§Types
-
Immediate
- Flush to disk immediately after every log write.
- Ensures maximum durability, but higher I/O latency.
-
Batch { threshold, interval }
- Flush to disk when:
- The number of unflushed entries exceeds
threshold
, or - The elapsed time since last flush exceeds
interval
milliseconds.
- The number of unflushed entries exceeds
- Balances performance and durability.
- May lose recent entries on crash.
- Flush to disk when:
§Read & Write Semantics
Operation | DiskFirst | MemFirst |
---|---|---|
Write | Write to disk → cache in memory | Write to memory → async flush |
Read | From memory; fallback to disk | Memory only; missing = absent |
Startup | Lazy-loading on access | Preload all entries into memory |
Flush | Controlled via flush_policy | Controlled via flush_policy |
Data loss on crash | No (after disk fsync) | Possible if not flushed |
§Consistency Guarantees
Property | DiskFirst | MemFirst |
---|---|---|
Linearizability | ✅ (strict) | ✅ (with quorum + sync on commit) |
Durability (Post-Commit) | ✅ Always | ❌ Depends on flush policy |
Availability (Under Load) | ❌ Lower | ✅ Higher |
Crash Recovery | ✅ Strong | ❌ Recent entries may be lost |
Startup Readiness | ✅ Fast | ❌ Slower (full load) |
§Recommended Use Cases
Strategy | Best For |
---|---|
DiskFirst | Systems that require strong durability and consistent recovery (e.g., databases, distributed ledgers) |
MemFirst | Systems that favor latency and availability, and can tolerate recovery from snapshots or re-election (e.g., in-memory caches, ephemeral workloads) |
§Developer Notes
- Log Truncation & Compaction: Logs should be truncated after snapshotting, regardless of strategy.
- Backpressure: In
MemFirst
, developers should implement backpressure if memory usage exceeds thresholds. - Lazy Loading: In
DiskFirst
, avoid head-of-line blocking by prefetching future entries when cache misses occur. - Flush Daemon: Use a background task to monitor and enforce flush policy under
MemFirst
.
§Future Improvements
- Snapshot-aware recovery to reduce startup times for
MemFirst
. - Tiered storage support (e.g., WAL on SSD, archival on HDD or cloud).
- Intelligent adaptive flush control based on workload.