Module architecture

Module architecture 

Source
Expand description

This project strictly follows the Single Responsibility Principle (SRP) to ensure modularity and maintainability. Developers extending the codebase must adhere to this principle to preserve clean separation of concerns. Below are key guidelines and examples.


§1. Core Design Philosophy

  • Main Loop Responsibility:

    The main_loop (in src/core/raft.rs) only orchestrates event routing. It:

    1. Listens for events (e.g., RPCs, timers).
    2. Delegates handling to the current role (e.g., LeaderState, FollowerState).
    3. Manages role transitions (e.g., Leader → Follower).
  • Role-Specific Logic:

    Each role (LeaderState, FollowerState, etc.) owns its state and behavior. For example:

    • LeaderState handles log replication and heartbeat management.
    • FollowerState processes leader requests and election timeouts.

§2. Key Rules for Developers

§Rule 1: No Cross-Role Logic

A role must never directly modify another role’s state or handle its events.

Bad Example (Violates SRP):

// ❌ LeaderState handling Follower-specific logic
impl LeaderState {
    async fn handle_append_entries(...) {
        if need_step_down {
            self.become_follower(); // SRP violation: Leader manages Follower state
            self.follower_handle_entries(...); // SRP violation: Leader handles Follower logic
        }
    }
}

Correct Approach:

// ✅ Leader sends a role transition event to main_loop
impl LeaderState {
    async fn handle_append_entries(...) -> Result<()> {
        if need_step_down {
            role_tx.send(RoleEvent::BecomeFollower(...))?; // Delegate transition
            return Ok(()); // Exit immediately
        }
        // ... (Leader-specific logic)
    }
}

§Rule 2: Atomic Role Transitions

When a role changes (e.g., Leader → Follower), the new role immediately takes over event handling.

Example:

// main_loop.rs (simplified)
loop {
    let event = event_rx.recv().await;
    match current_role {
        Role::Leader(leader) => {
            leader.handle_event(event).await?;
            // If leader sends RoleEvent::BecomeFollower, update `current_role` here
        }
        Role::Follower(follower) => { ... }
    }
}

§Rule 3: State Isolation

Role-specific data (e.g., LeaderState.next_index) must not leak into other roles.

Bad Example:

// ❌ FollowerState accessing LeaderState internals
impl FollowerState {
    fn reset_leader_index(&mut self, leader: &LeaderState) {
        self.match_index = leader.next_index; // SRP violation
    }
}

Correct Approach:

// ✅ LeaderState persists its state to disk before stepping down
impl LeaderState {
    async fn step_down(&mut self, ctx: &RaftContext) -> Result<()> {
        ctx.save_state(self.next_index).await?; // Isolate state
        Ok(())
    }
}

// ✅ FollowerState loads state from disk
impl FollowerState {
    async fn load_state(ctx: &RaftContext) -> Result<Self> {
        let next_index = ctx.load_state().await?;
        Ok(Self { match_index: next_index })
    }
}

§3. Folder Structure Alignment

The codebase enforces SRP through role-specific modules:

src/core/raft_role/
├── leader_state.rs          # Leader-only logic (e.g., log replication)
├── follower_state.rs        # Follower-only logic (e.g., election timeout)
├── candidate_state.rs       # Candidate election logic
└── mod.rs                   # Role transitions and common interfaces

§4. Adding New Features

When extending the project:

  1. Identify Ownership: Should the logic belong to a role, the main loop, or a utility module?
  2. Avoid Hybrid Roles: Never create a LeaderFollowerHybridState. Use role transitions instead.
  3. Example: Adding a “Learner” role:
    • Create learner_state.rs with Learner-specific logic.
    • Update main_loop to handle Role::Learner.
    • Add RoleEvent::BecomeLearner for transitions.

§5. Why SRP Matters Here

  • Predictability: Each role’s behavior is isolated and testable.
  • Safer Changes: Modifying FollowerState won’t accidentally break LeaderState.
  • Protocol Compliance: Raft requires strict role separation; SRP enforces this naturally.

By following these rules, your contributions will align with the project’s design philosophy. This section outlines the core principles for distinguishing between protocol logic errors (expected business failures) and system-level errors (unrecoverable faults) across the entire project. Developers extending this codebase should strictly follow these guidelines to ensure consistency and reliability.

§1. Core Principles

Error TypeDescriptionHandling Strategy
Protocol Logic ErrorsFailures dictated by protocol rules (e.g., term mismatches, log inconsistencies).- Return Ok(()) with a protocol-compliant response (e.g., success: false).
System-Level ErrorsCritical failures (e.g., I/O errors, channel disconnections, state corruption).- Return Error to halt normal operation.- Trigger recovery mechanisms (retry/alert).
Illegal Operation ErrorsViolations of the Raft state machine rules (e.g., invalid role transitions).- Return Error immediately.- Indicates bugs in code logic and must be fixed.

§2. Examples

§Example 1: Protocol Logic Error

// Reject stale AppendEntries request (term < current_term)
if req.term < self.current_term() {
    let resp = AppendEntriesResponse { success: false, term: self.current_term() };
    sender.send(Ok(resp))?;  // Protocol error → return Ok(())
    return Ok(());
}

§Example 2: System-Level Error

// Fail to persist logs (return Error)
ctx.raft_log().append_entries(&req.entries).await?; // Propagates storage error

§Example 3: Illegal Operation Error

// Follower illegally attempts to become Leader
impl RaftRoleState for FollowerState {
    fn become_leader(&self) -> Result<LeaderState> {
        warn!("Follower cannot directly become Leader");
        Err(Error::IllegalOperation("follower → leader")) // Hard error → return Error
    }
}

§3. Best Practices


  1. Atomic State Changes

    Ensure critical operations (e.g., role transitions) are atomic. If a step fails after partial execution, return Error to avoid inconsistent states.

  2. Error Classification

    Define explicit error types to distinguish recoverable and non-recoverable failures:

    enum Error {
        Protocol(String),         // Debugging only (e.g., "term mismatch")
        Storage(io::Error),       // Retry or alert
        Network(NetworkError),    // Reconnect logic
        IllegalOperation(&'static str), // Code bug → must fix
    }
  3. Handle Illegal Operations Strictly

    • Log illegal operations at error! level.
    • Add unit tests to ensure invalid transitions return Error.
    #[test]
    fn follower_cannot_become_leader() {
        let follower = FollowerState::new();
        assert!(matches!(
            follower.become_leader(),
            Err(Error::IllegalOperation("follower → leader"))
        ));
    }

§4. Extending to the Entire Project

These principles apply to all components:

  • RPC Clients/Servers:
    • Treat invalid RPC sequences (e.g., duplicate requests) as protocol errors (Ok with response).
    • Treat connection resets as system errors (Error).
  • State Machines:
    • Invalid operations (e.g., applying non-committed logs) → Protocol errors.
    • Disk write failures → System errors.
  • Cluster Management:
    • Node join conflicts (e.g., duplicate IDs) → Protocol errors.
    • Invalid role transitions (e.g., Follower → Leader) → Illegal Operation errors.

By adhering to these rules, your extensions will maintain the same robustness and predictability as the core implementation.

§Raft Roles and Leader Election: Responsibilities and Step-Down Conditions

§Leader

  • Heartbeat Authority: Send periodic heartbeats to all followers/learners
  • Log Replication: Manage log entry replication pipeline
  • Conflict Resolution: Overwrite inconsistent follower logs
  • Membership Changes: Process configuration changes (learner promotion/demotion)
  • Snapshot Coordination: Trigger snapshot creation and send via InstallSnapshot RPC

§Follower

  • Request Processing:
    • Respond to Leader heartbeats
    • Append received log entries
    • Redirect client writes to Leader
  • Election Participation: Convert to candidate if election timeout expires
  • Log Consistency: Maintain locally persisted log entries
  • Snapshot Handling: Apply received snapshots and truncate obsolete logs

§Candidate

  • Election Initiation: Start new election by incrementing term
  • Vote Solicitation: Send RequestVote RPCs to all nodes
  • State Transition:
    • Become Leader if receiving majority votes
    • Revert to Follower if newer term discovered

§Learner

  • Bootstrapping:
    • Default initial role for new nodes joining the cluster.
    • Receive initial cluster state via snapshot transfer.
  • Log Synchronization:
    • Replicate logs but without voting rights.
    • Eligible for promotion to follower once caught up.
  • Failure Adaptation:
    • Followers with excessive log lag (beyond max_lag_threshold) demote themselves to learners.
    • Learners auto-promote back to follower when their match_index reaches a configurable threshold.
  • Snapshot Protocol:
    • Leaders send full snapshots triggered by LearnerEvent.
    • Track replication progress separately for learners.

§Conditions That Cause a Leader to Step Down

A leader must relinquish authority and revert to follower state under these conditions:

  1. Receives an RPC with a Higher Term

    If the leader receives any RPC (AppendEntries, RequestVote, etc.) with a term greater than its current term, it must step down immediately.

  2. Grants a Vote to Another Candidate

    If the leader grants a vote to another candidate (usually due to the candidate having a more up-to-date log), it must step down.

  3. Fails to Contact the Majority for Too Long

    If the leader cannot successfully reach a quorum of nodes for a duration longer than one election timeout, it must assume it has lost authority and step down.

§Snapshot Module Design

Defines interface and behaviors for snapshot operations in d-engine Raft implementation.

§Design Philosophy

  • Pluggable strategy: Intentional flexibility for snapshot policies and storage implementations
  • Zero-enforced approach: No default snapshot strategy - implement based on your storage needs
  • Reference implementation: sled-based demo included (not production-ready)

§Core Concepts

§Snapshot Transfer: Bidirectional Streaming Design

sequenceDiagram
    participant Learner
    participant Leader

    Learner->>Leader: 1. Request snapshot
    Leader->>Learner: 2. Stream chunks
    Learner->>Leader: 3. Send ACKs
    Leader->>Learner: 4. Retransmit missing chunks
  1. Dual-channel communication:
    • Data channel: Leader → Learner (SnapshotChunk stream)
    • Feedback channel: Learner → Leader (SnapshotAck stream)
  2. Key benefits:
    • Flow control: ACKs regulate transmission speed
    • Reliability: Per-chunk CRC verification
    • Resumability: Selective retransmission of failed chunks
    • Backpressure: Explicit backoff when receiver lags
  3. Sequence:
sequenceDiagram
    Learner->>Leader: StreamSnapshot(initial ACK)
    Leader-->>Learner: Chunk 0 (metadata)
    Learner->>Leader: ACK(seq=0, status=Accepted)
    Leader-->>Learner: Chunk 1
    Learner->>Leader: ACK(seq=1, status=ChecksumMismatch)
    Leader-->>Learner: Chunk 1 (retransmit)
    Learner->>Leader: ACK(seq=1, status=Accepted)
    loop Remaining chunks
        Leader-->>Learner: Chunk N
        Learner->>Leader: ACK(seq=N)
    end
§The bidirectional snapshot transfer is implemented as a state machine, coordinating both data transmission (chunks) and control feedback (ACKs) between Leader and Learner.
stateDiagram-v2
    [*] --> Waiting
    Waiting --> ProcessingChunk: Data available
    Waiting --> ProcessingAck: ACK received
    Waiting --> Retrying: Retry timer triggered
    ProcessingChunk --> Waiting
    ProcessingAck --> Waiting
    Retrying --> Waiting
    Waiting --> Completed: All chunks sent & acknowledged

Data Flow Summary:

Leader                          Learner
  |                                |
  |--> Stream<SnapshotChunk> ----> |
  |                                |
  |<-- Stream<SnapshotAck> <------ |
  |                                |

§Snapshot Policies

Fully customizable through SnapshotPolicy trait:

pub trait SnapshotPolicy {
    fn should_create_snapshot(&self, ctx: &SnapshotContext) -> bool;
}

pub struct SnapshotContext {
    pub last_included_index: u64,
    pub last_applied_index: u64,
    pub current_term: u64,
    pub unapplied_entries: usize,
}

Enable customized snapshot policy:

let node = NodeBuilder::new()
    .with_snapshot_policy(Arc::new(TimeBasedPolicy {
        interval: Duration::from_secs(3600),
    }))
    .build();

By default, Size-Based Policy is enabled.

§Snaphot policy been used when generate new snapshot
sequenceDiagram
    participant Leader
    participant Policy
    participant StateMachine

    Leader->>Policy: Check should_create_snapshot()
    Policy-->>Leader: true/false
    alt Should create
        Leader->>StateMachine: Initiate snapshot creation
    end

Common policy types:

  • Size-based (default)
  • Time-based intervals
  • Hybrid approaches
  • External metric triggers

§Leader generate snapshot sequence diagram

sequenceDiagram
    participant Leader
    participant Follower
    participant StateMachine

    Leader->>Follower: Send chunked stream Stream<SnapshotChunk>
    Follower->>StateMachine: Create temporary database instance
    loop Chunk processing
        Follower->>Temporary DB: Write chunk data
    end
    Follower->>StateMachine: Atomically replace main database
    StateMachine->>sled: ArcSwap::store(new_db)
    Follower->>Leader: Return success response

§Transfer Mechanics

  1. Chunking:
    • Fixed-size chunks (configurable 4-16MB)
    • First chunk contains metadata
    • CRC32 checksum per chunk
  2. Rate limiting:
if config.max_bandwidth_mbps > 0 {
    let min_duration = chunk_size_mb / config.max_bandwidth_mbps;
    tokio::time::sleep(min_duration).await;
}
  1. Error recovery:
    • Checksum mismatches trigger single-chunk retransmission
    • Out-of-order detection resets stream position
    • 10-second ACK timeout fails entire transfer

§Module Responsibilities

ComponentResponsibilities
StateMachineHandler- Chunk validation- Temporary file management- ACK generation- Error handling- Snapshot finalization
StateMachine- Snapshot application- Online state replacement- Consistency guarantees
§Generating a new snapshot:
  1. [StateMachine] Generate new DB based on the temporary file provided by the [StateMachineHandler] →
  2. [StateMachine] Generate data →
  3. [StateMachine] Dump current DB into the new DB →
  4. [StateMachineHandler] Verify policy conditions and finalize the snapshot and updating the snapshot version.
§Applying a snapshot:
  1. [StateMachineHandler] Snapshot chunk reception and validation →
  2. [StateMachineHandler] Write chunks into a temporary file until success →
  3. [StateMachineHandler] Error handling and sends error response back to the sender and terminates the process →
  4. After all chunks have been successfully processed and validated, the [StateMachineHandler] finalizes the snapshot →
  5. [StateMachineHandler] Passing the snapshot file to the [StateMachine] →
  6. [StateMachine] Apply Snapshot and do online replacement - replacing the old state with the new one based on the snapshot.
§Cleaning up old snapshots:

[StateMachineHandler] automatically maintains old snapshots according to version policies, while the StateMachine is not aware of it.

§Purge Log Design

§Leader Purge Log State Management

sequenceDiagram
    participant Leader
    participant StateMachine
    participant Storage

    Leader->>StateMachine: create_snapshot()
    StateMachine->>Leader: SnapshotMeta(last_included_index=100)

    Note over Leader,Storage: Critical Atomic Operation
    Leader->>Storage: persist_last_purged(100) # 1. Update in-memory last_purged_index<br/>2. Flush to disk atomically

    Leader->>Leader: scheduled_purge_upto = Some(100) # Schedule async task
    Leader->>Background: trigger purge_task(100)

    Background->>Storage: physical_delete_logs(0..100)
    Storage->>Leader: notify_purge_complete(100)
    Leader->>Storage: verify_last_purged(100) # Optional consistency check

§Follower Purge Log State Management

sequenceDiagram
    participant Leader
    participant Follower
    participant FollowerStorage

    Leader->>Follower: InstallSnapshot RPC (last_included=100)

    Note over Follower,FollowerStorage: Protocol-Required Atomic Update
    Follower->>FollowerStorage: 1. Apply snapshot to state machine<br/>2. persist_last_purged(100)

    Follower->>Follower: last_purged_index = Some(100) # Volatile state update
    Follower->>Follower: pending_purge = Some(100) # Mark background task

    Follower->>Background: schedule_log_purge(100)
    Background->>FollowerStorage: delete_logs(0..100)
    FollowerStorage->>Follower: purge_complete(100)
    Follower->>Follower: pending_purge = None # Clear task status

§Raft Node Join & Promotion Architecture

§Business Objectives

  1. Read Load Distribution — Learners are promoted to Followers to share read traffic.
  2. High Availability for Mission-Critical Systems — Maintain odd-numbered quorum for safe scaling.
  3. Offline Analysis / Disaster Recovery — Support non-voting Learners for analytical or backup roles.

§Architecture Principles

PrincipleDescription
Even-Count Expansion OnlyNodes must be added in pairs to keep the voting member count odd.
Sequential Join HandlingLeader handles join requests one at a time.
Atomic Join SemanticsNode either fully joins as ActiveFollower or rolls back on failure.
Retry on Join FailureJoin attempts include retry logic while waiting for peer readiness.
OpsAdmin InvolvementManual intervention support planned for future releases (not implemented yet).

§Node State Lifecycle

stateDiagram-v2
    [*] --> Joining
    Joining --> Syncing: Leader accepts join
    Syncing --> Ready: State synchronized
    Ready --> StandbyLearner: Promotion timeout
    Ready --> KeepAsLearner: Analytics role
    Ready --> CheckPromotion: Standard role
    CheckPromotion --> Promote: Quorum conditions met
    CheckPromotion --> Pairing: Needs partner
    Pairing --> Promote: Partner ready
    Pairing --> StandbyLearner: Pairing timeout

§Core Integration Flow

sequenceDiagram
    participant NewNode
    participant Leader
    participant RaftLog

    NewNode->>Leader: JoinRequest
    Leader->>RaftLog: ConfChange entry
    RaftLog-->>Leader: Commit confirmation
    Leader-->>NewNode: JoinResponse
    Leader->>NewNode: InstallSnapshot stream
    NewNode->>Leader: ApplyComplete
    Leader->>NewNode: AppendEntries
    NewNode->>Leader: ReadyNotification
    Leader->>ReadyQueue: Enqueue node

§Key Architectural Components

§1. Connection Isolation

  • Control Plane: Heartbeats and votes with low latency.
  • Data Plane: Log replication with high throughput.
  • Bulk Plane: Snapshot transfer which is bandwidth intensive.

§2. State Synchronization

  • Immediate Snapshot: New nodes always receive the latest snapshot first.
  • Atomic Replacement: Crash-safe application of state.
  • Incremental Logs: Catch-up via incremental log entries after snapshot.

§3. Promotion System

  • Quorum-Aware Batching:
    • Maintains odd voter count for quorum safety.
    • FIFO processing of join and promotion requests.
    • Batch multiple promotions when possible.
  • Timeout Enforcement:
    • Join Request timeout: 30 seconds.
    • Promotion window: 5 minutes.
    • Pairing wait timeout: 2 minutes.
  • Pairing Mechanism:
    • Groups nodes that require partners for promotion.
    • Jointly promotes paired nodes once both are ready.

§4. Failure Handling

  • Stale Learner Detection: Periodic status checks trigger automatic downgrade and operator alert.
  • Zombie Node Removal: Removes unresponsive nodes through batch proposals and maintenance.

§Quorum Preservation

flowchart LR
    A[Current Voters] --> B{Odd Count?}
    B -->|Yes| C[Promote Single]
    B -->|No| D[Find Pair]
    D --> E[Wait for Partner]
    E -->|Ready| F[Joint Promotion]
    E -->|Timeout| G[Demote to Standby]

§Timeout Enforcement Summary

ScenarioTimeoutAction
Join Request30sAbort configuration
Snapshot TransferDynamicResume from last chunk
Promotion Window5minDemote to StandbyLearner
Pairing Wait2minCancel pairing

§Lifecycle Example

  1. New Analytics Node:

    Joining → Syncing → Ready → KeepAsLearner

  2. Standard Node Promotion:

    Joining → Syncing → Ready → Promote

  3. Paired Promotion:

    Ready → Pairing → Promote (when partner ready)

  4. Failed Node:

    Syncing → [Timeout] → Zombie → Removed

§Conditions That Cause a Leader to Step Down

A leader will step down in the following situations:

  1. Receives an RPC with a Higher Term If the leader receives an AppendEntries, RequestVote, or other RPC with a term number greater than its current term, it must step down and revert to a follower.
  2. Grants a Vote to Another Candidate If the leader receives a RequestVote RPC and grants the vote to another candidate (usually because the candidate has a more up-to-date log), the leader must step down.
  3. Fails to Contact the Majority for Too Long If the leader cannot reach a quorum (majority of nodes) for longer than one election timeout, it must assume it no longer holds authority and step down.

§Raft Log Persistence Architecture

This document outlines the design and behavior of the Raft log storage engine. It explains how logs are persisted, how the system handles reads and writes under different strategies, and how consistency is guaranteed across different configurations.


§Overview

Raft logs record the sequence of operations that must be replicated across all nodes in a Raft cluster. Correct and reliable storage of these logs is essential to maintaining the linearizability and safety guarantees of the protocol.

Our log engine supports two persistence strategies:

  • DiskFirst: Prioritizes durability.
  • MemFirst: Prioritizes performance.

Both strategies support configurable flush policies to control how memory contents are persisted to disk.


§Persistence Strategies

§DiskFirst

  • Write Path: On append, entries are first synchronously written to disk. Once confirmed, they are cached in memory.
  • Read Path: Reads are served from memory. If the requested entry is missing, it is loaded from disk and cached.
  • Startup Behavior: Does not preload all entries from disk into memory. Instead, entries are loaded lazily on access.
  • Durability: Ensures strong durability. A log is never considered accepted until it is safely written to disk.
  • Memory Use: Memory acts as a read-through cache for performance optimization.

§MemFirst

  • Write Path: Entries are first written to memory and acknowledged immediately. Disk persistence is handled asynchronously in the background.
  • Read Path: Reads are served from memory only. If an entry is not present in memory, it is considered nonexistent.
  • Startup Behavior: Loads all log entries from disk into memory during startup.
  • Durability: Durability is best-effort and depends on the flush policy. Recent entries may be lost if a crash occurs before flushing.
  • Memory Use: Memory holds the complete working set of logs.

§Flush Policies

Flush policies control how and when in-memory data is persisted to disk. These are especially relevant in MemFirst mode, but are also applied in DiskFirst to control how memory state is flushed (e.g., snapshots, metadata, etc).

§Types

  • Immediate

    • Flush to disk immediately after every log write.
    • Ensures maximum durability, but higher I/O latency.
  • Batch { threshold, interval }

    • Flush to disk when:
      • The number of unflushed entries exceeds threshold, or
      • The elapsed time since last flush exceeds interval milliseconds.
    • Balances performance and durability.
    • May lose recent entries on crash.

§Read & Write Semantics

OperationDiskFirstMemFirst
WriteWrite to disk → cache in memoryWrite to memory → async flush
ReadFrom memory; fallback to diskMemory only; missing = absent
StartupLazy-loading on accessPreload all entries into memory
FlushControlled via flush_policyControlled via flush_policy
Data loss on crashNo (after disk fsync)Possible if not flushed

§Consistency Guarantees

PropertyDiskFirstMemFirst
Linearizability✅ (strict)✅ (with quorum + sync on commit)
Durability (Post-Commit)✅ Always❌ Depends on flush policy
Availability (Under Load)❌ Lower✅ Higher
Crash Recovery✅ Strong❌ Recent entries may be lost
Startup Readiness✅ Fast❌ Slower (full load)

StrategyBest For
DiskFirstSystems that require strong durability and consistent recovery (e.g., databases, distributed ledgers)
MemFirstSystems that favor latency and availability, and can tolerate recovery from snapshots or re-election (e.g., in-memory caches, ephemeral workloads)

§Developer Notes

  • Log Truncation & Compaction: Logs should be truncated after snapshotting, regardless of strategy.
  • Backpressure: In MemFirst, developers should implement backpressure if memory usage exceeds thresholds.
  • Lazy Loading: In DiskFirst, avoid head-of-line blocking by prefetching future entries when cache misses occur.
  • Flush Daemon: Use a background task to monitor and enforce flush policy under MemFirst.

§Future Improvements

  • Snapshot-aware recovery to reduce startup times for MemFirst.
  • Tiered storage support (e.g., WAL on SSD, archival on HDD or cloud).
  • Intelligent adaptive flush control based on workload.