Module architecture

Expand description

This project strictly follows the Single Responsibility Principle (SRP) to ensure modularity and maintainability. Developers extending the codebase must adhere to this principle to preserve clean separation of concerns. Below are key guidelines and examples.

§1. Core Design Philosophy

Main Loop Responsibility:

The main_loop (in src/core/raft.rs) only orchestrates event routing. It:
1. Listens for events (e.g., RPCs, timers).
2. Delegates handling to the current role (e.g., LeaderState, FollowerState).
3. Manages role transitions (e.g., Leader → Follower).
Role-Specific Logic:

Each role (LeaderState, FollowerState, etc.) owns its state and behavior. For example:
- LeaderState handles log replication and heartbeat management.
- FollowerState processes leader requests and election timeouts.

§2. Key Rules for Developers

§Rule 1: No Cross-Role Logic

A role must never directly modify another role’s state or handle its events.

Bad Example (Violates SRP):

// ❌ LeaderState handling Follower-specific logic
impl LeaderState {
    async fn handle_append_entries(...) {
        if need_step_down {
            self.become_follower(); // SRP violation: Leader manages Follower state
            self.follower_handle_entries(...); // SRP violation: Leader handles Follower logic
        }
    }
}

Correct Approach:

// ✅ Leader sends a role transition event to main_loop
impl LeaderState {
    async fn handle_append_entries(...) -> Result<()> {
        if need_step_down {
            role_tx.send(RoleEvent::BecomeFollower(...))?; // Delegate transition
            return Ok(()); // Exit immediately
        }
        // ... (Leader-specific logic)
    }
}

§Rule 2: Atomic Role Transitions

When a role changes (e.g., Leader → Follower), the new role immediately takes over event handling.

Example:

// main_loop.rs (simplified)
loop {
    let event = event_rx.recv().await;
    match current_role {
        Role::Leader(leader) => {
            leader.handle_event(event).await?;
            // If leader sends RoleEvent::BecomeFollower, update `current_role` here
        }
        Role::Follower(follower) => { ... }
    }
}

§Rule 3: State Isolation

Role-specific data (e.g., LeaderState.next_index) must not leak into other roles.

Bad Example:

// ❌ FollowerState accessing LeaderState internals
impl FollowerState {
    fn reset_leader_index(&mut self, leader: &LeaderState) {
        self.match_index = leader.next_index; // SRP violation
    }
}

Correct Approach:

// ✅ LeaderState persists its state to disk before stepping down
impl LeaderState {
    async fn step_down(&mut self, ctx: &RaftContext) -> Result<()> {
        ctx.save_state(self.next_index).await?; // Isolate state
        Ok(())
    }
}

// ✅ FollowerState loads state from disk
impl FollowerState {
    async fn load_state(ctx: &RaftContext) -> Result<Self> {
        let next_index = ctx.load_state().await?;
        Ok(Self { match_index: next_index })
    }
}

§3. Folder Structure Alignment

The codebase enforces SRP through role-specific modules:

src/core/raft_role/
├── leader_state.rs          # Leader-only logic (e.g., log replication)
├── follower_state.rs        # Follower-only logic (e.g., election timeout)
├── candidate_state.rs       # Candidate election logic
└── mod.rs                   # Role transitions and common interfaces

§4. Adding New Features

When extending the project:

Identify Ownership: Should the logic belong to a role, the main loop, or a utility module?
Avoid Hybrid Roles: Never create a LeaderFollowerHybridState. Use role transitions instead.
Example: Adding a “Learner” role:
- Create learner_state.rs with Learner-specific logic.
- Update main_loop to handle Role::Learner.
- Add RoleEvent::BecomeLearner for transitions.

§5. Why SRP Matters Here

Predictability: Each role’s behavior is isolated and testable.
Safer Changes: Modifying FollowerState won’t accidentally break LeaderState.
Protocol Compliance: Raft requires strict role separation; SRP enforces this naturally.

By following these rules, your contributions will align with the project’s design philosophy. This section outlines the core principles for distinguishing between protocol logic errors (expected business failures) and system-level errors (unrecoverable faults) across the entire project. Developers extending this codebase should strictly follow these guidelines to ensure consistency and reliability.

§1. Core Principles

Error Type	Description	Handling Strategy
Protocol Logic Errors	Failures dictated by protocol rules (e.g., term mismatches, log inconsistencies).	- Return `Ok(())` with a protocol-compliant response (e.g., `success: false`).
System-Level Errors	Critical failures (e.g., I/O errors, channel disconnections, state corruption).	- Return `Error` to halt normal operation.- Trigger recovery mechanisms (retry/alert).
Illegal Operation Errors	Violations of the Raft state machine rules (e.g., invalid role transitions).	- Return `Error` immediately.- Indicates bugs in code logic and must be fixed.

§2. Examples

§Example 1: Protocol Logic Error

// Reject stale AppendEntries request (term < current_term)
if req.term < self.current_term() {
    let resp = AppendEntriesResponse { success: false, term: self.current_term() };
    sender.send(Ok(resp))?;  // Protocol error → return Ok(())
    return Ok(());
}

§Example 2: System-Level Error

// Fail to persist logs (return Error)
ctx.raft_log().append_entries(&req.entries).await?; // Propagates storage error

§Example 3: Illegal Operation Error

// Follower illegally attempts to become Leader
impl RaftRoleState for FollowerState {
    fn become_leader(&self) -> Result<LeaderState> {
        warn!("Follower cannot directly become Leader");
        Err(Error::IllegalOperation("follower → leader")) // Hard error → return Error
    }
}

§3. Best Practices

Atomic State Changes

Ensure critical operations (e.g., role transitions) are atomic. If a step fails after partial execution, return Error to avoid inconsistent states.

Error Classification

Define explicit error types to distinguish recoverable and non-recoverable failures:

enum Error {
    Protocol(String),         // Debugging only (e.g., "term mismatch")
    Storage(io::Error),       // Retry or alert
    Network(NetworkError),    // Reconnect logic
    IllegalOperation(&'static str), // Code bug → must fix
}

Handle Illegal Operations Strictly

Log illegal operations at error! level.
Add unit tests to ensure invalid transitions return Error.

#[test]
fn follower_cannot_become_leader() {
    let follower = FollowerState::new();
    assert!(matches!(
        follower.become_leader(),
        Err(Error::IllegalOperation("follower → leader"))
    ));
}

§4. Extending to the Entire Project

These principles apply to all components:

RPC Clients/Servers:
- Treat invalid RPC sequences (e.g., duplicate requests) as protocol errors (Ok with response).
- Treat connection resets as system errors (Error).
State Machines:
- Invalid operations (e.g., applying non-committed logs) → Protocol errors.
- Disk write failures → System errors.
Cluster Management:
- Node join conflicts (e.g., duplicate IDs) → Protocol errors.
- Invalid role transitions (e.g., Follower → Leader) → Illegal Operation errors.

By adhering to these rules, your extensions will maintain the same robustness and predictability as the core implementation.

§Raft Roles and Leader Election: Responsibilities and Step-Down Conditions

§Leader

Heartbeat Authority: Send periodic heartbeats to all followers/learners
Log Replication: Manage log entry replication pipeline
Conflict Resolution: Overwrite inconsistent follower logs
Membership Changes: Process configuration changes (learner promotion/demotion)
Snapshot Coordination: Trigger snapshot creation and send via InstallSnapshot RPC

§Follower

Request Processing:
- Respond to Leader heartbeats
- Append received log entries
- Redirect client writes to Leader
Election Participation: Convert to candidate if election timeout expires
Log Consistency: Maintain locally persisted log entries
Snapshot Handling: Apply received snapshots and truncate obsolete logs

§Candidate

Election Initiation: Start new election by incrementing term
Vote Solicitation: Send RequestVote RPCs to all nodes
State Transition:
- Become Leader if receiving majority votes
- Revert to Follower if newer term discovered

§Learner

Bootstrapping:
- Default initial role for new nodes joining the cluster.
- Receive initial cluster state via snapshot transfer.
Log Synchronization:
- Replicate logs but without voting rights.
- Eligible for promotion to follower once caught up.
Failure Adaptation:
- Followers with excessive log lag (beyond max_lag_threshold) demote themselves to learners.
- Learners auto-promote back to follower when their match_index reaches a configurable threshold.
Snapshot Protocol:
- Leaders send full snapshots triggered by LearnerEvent.
- Track replication progress separately for learners.

§Conditions That Cause a Leader to Step Down

A leader must relinquish authority and revert to follower state under these conditions:

Receives an RPC with a Higher Term

If the leader receives any RPC (AppendEntries, RequestVote, etc.) with a term greater than its current term, it must step down immediately.
Grants a Vote to Another Candidate

If the leader grants a vote to another candidate (usually due to the candidate having a more up-to-date log), it must step down.
Fails to Contact the Majority for Too Long

If the leader cannot successfully reach a quorum of nodes for a duration longer than one election timeout, it must assume it has lost authority and step down.

§Snapshot Module Design

Defines interface and behaviors for snapshot operations in d-engine Raft implementation.

§Design Philosophy

Pluggable strategy: Intentional flexibility for snapshot policies and storage implementations
Zero-enforced approach: No default snapshot strategy - implement based on your storage needs
Reference implementation: sled-based demo included (not production-ready)

§Core Concepts

§Snapshot Transfer: Bidirectional Streaming Design

sequenceDiagram
    participant Learner
    participant Leader

    Learner->>Leader: 1. Request snapshot
    Leader->>Learner: 2. Stream chunks
    Learner->>Leader: 3. Send ACKs
    Leader->>Learner: 4. Retransmit missing chunks

Dual-channel communication:
- Data channel: Leader → Learner (SnapshotChunk stream)
- Feedback channel: Learner → Leader (SnapshotAck stream)
Key benefits:
- Flow control: ACKs regulate transmission speed
- Reliability: Per-chunk CRC verification
- Resumability: Selective retransmission of failed chunks
- Backpressure: Explicit backoff when receiver lags
Sequence:

sequenceDiagram
    Learner->>Leader: StreamSnapshot(initial ACK)
    Leader-->>Learner: Chunk 0 (metadata)
    Learner->>Leader: ACK(seq=0, status=Accepted)
    Leader-->>Learner: Chunk 1
    Learner->>Leader: ACK(seq=1, status=ChecksumMismatch)
    Leader-->>Learner: Chunk 1 (retransmit)
    Learner->>Leader: ACK(seq=1, status=Accepted)
    loop Remaining chunks
        Leader-->>Learner: Chunk N
        Learner->>Leader: ACK(seq=N)
    end

§The bidirectional snapshot transfer is implemented as a state machine, coordinating both data transmission (chunks) and control feedback (ACKs) between Leader and Learner.

stateDiagram-v2
    [*] --> Waiting
    Waiting --> ProcessingChunk: Data available
    Waiting --> ProcessingAck: ACK received
    Waiting --> Retrying: Retry timer triggered
    ProcessingChunk --> Waiting
    ProcessingAck --> Waiting
    Retrying --> Waiting
    Waiting --> Completed: All chunks sent & acknowledged

Data Flow Summary:

Leader                          Learner
  |                                |
  |--> Stream<SnapshotChunk> ----> |
  |                                |
  |<-- Stream<SnapshotAck> <------ |
  |                                |

§Snapshot Policies

Fully customizable through SnapshotPolicy trait:

pub trait SnapshotPolicy {
    fn should_create_snapshot(&self, ctx: &SnapshotContext) -> bool;
}

pub struct SnapshotContext {
    pub last_included_index: u64,
    pub last_applied_index: u64,
    pub current_term: u64,
    pub unapplied_entries: usize,
}

Enable customized snapshot policy:

let node = NodeBuilder::new()
    .with_snapshot_policy(Arc::new(TimeBasedPolicy {
        interval: Duration::from_secs(3600),
    }))
    .build();

By default, Size-Based Policy is enabled.

§Snaphot policy been used when generate new snapshot

sequenceDiagram
    participant Leader
    participant Policy
    participant StateMachine

    Leader->>Policy: Check should_create_snapshot()
    Policy-->>Leader: true/false
    alt Should create
        Leader->>StateMachine: Initiate snapshot creation
    end

Common policy types:

Size-based (default)
Time-based intervals
Hybrid approaches
External metric triggers

§Leader generate snapshot sequence diagram

sequenceDiagram
    participant Leader
    participant Follower
    participant StateMachine

    Leader->>Follower: Send chunked stream Stream<SnapshotChunk>
    Follower->>StateMachine: Create temporary database instance
    loop Chunk processing
        Follower->>Temporary DB: Write chunk data
    end
    Follower->>StateMachine: Atomically replace main database
    StateMachine->>sled: ArcSwap::store(new_db)
    Follower->>Leader: Return success response

§Transfer Mechanics

Chunking:
- Fixed-size chunks (configurable 4-16MB)
- First chunk contains metadata
- CRC32 checksum per chunk
Rate limiting:

if config.max_bandwidth_mbps > 0 {
    let min_duration = chunk_size_mb / config.max_bandwidth_mbps;
    tokio::time::sleep(min_duration).await;
}

Error recovery:
- Checksum mismatches trigger single-chunk retransmission
- Out-of-order detection resets stream position
- 10-second ACK timeout fails entire transfer

§Module Responsibilities

Component	Responsibilities
StateMachineHandler	- Chunk validation- Temporary file management- ACK generation- Error handling- Snapshot finalization
StateMachine	- Snapshot application- Online state replacement- Consistency guarantees

§Generating a new snapshot:

[StateMachine] Generate new DB based on the temporary file provided by the [StateMachineHandler] →
[StateMachine] Generate data →
[StateMachine] Dump current DB into the new DB →
[StateMachineHandler] Verify policy conditions and finalize the snapshot and updating the snapshot version.

§Applying a snapshot:

[StateMachineHandler] Snapshot chunk reception and validation →
[StateMachineHandler] Write chunks into a temporary file until success →
[StateMachineHandler] Error handling and sends error response back to the sender and terminates the process →
After all chunks have been successfully processed and validated, the [StateMachineHandler] finalizes the snapshot →
[StateMachineHandler] Passing the snapshot file to the [StateMachine] →
[StateMachine] Apply Snapshot and do online replacement - replacing the old state with the new one based on the snapshot.

§Cleaning up old snapshots:

[StateMachineHandler] automatically maintains old snapshots according to version policies, while the StateMachine is not aware of it.

§Purge Log Design

§Leader Purge Log State Management

sequenceDiagram
    participant Leader
    participant StateMachine
    participant Storage

    Leader->>StateMachine: create_snapshot()
    StateMachine->>Leader: SnapshotMeta(last_included_index=100)

    Note over Leader,Storage: Critical Atomic Operation
    Leader->>Storage: persist_last_purged(100) # 1. Update in-memory last_purged_index<br/>2. Flush to disk atomically

    Leader->>Leader: scheduled_purge_upto = Some(100) # Schedule async task
    Leader->>Background: trigger purge_task(100)

    Background->>Storage: physical_delete_logs(0..100)
    Storage->>Leader: notify_purge_complete(100)
    Leader->>Storage: verify_last_purged(100) # Optional consistency check

§Follower Purge Log State Management

sequenceDiagram
    participant Leader
    participant Follower
    participant FollowerStorage

    Leader->>Follower: InstallSnapshot RPC (last_included=100)

    Note over Follower,FollowerStorage: Protocol-Required Atomic Update
    Follower->>FollowerStorage: 1. Apply snapshot to state machine<br/>2. persist_last_purged(100)

    Follower->>Follower: last_purged_index = Some(100) # Volatile state update
    Follower->>Follower: pending_purge = Some(100) # Mark background task

    Follower->>Background: schedule_log_purge(100)
    Background->>FollowerStorage: delete_logs(0..100)
    FollowerStorage->>Follower: purge_complete(100)
    Follower->>Follower: pending_purge = None # Clear task status

§Raft Node Join & Promotion Architecture

§Business Objectives

Read Load Distribution — Learners are promoted to Followers to share read traffic.
High Availability for Mission-Critical Systems — Maintain odd-numbered quorum for safe scaling.
Offline Analysis / Disaster Recovery — Support non-voting Learners for analytical or backup roles.

§Architecture Principles

Principle	Description
Even-Count Expansion Only	Nodes must be added in pairs to keep the voting member count odd.
Sequential Join Handling	Leader handles join requests one at a time.
Atomic Join Semantics	Node either fully joins as ActiveFollower or rolls back on failure.
Retry on Join Failure	Join attempts include retry logic while waiting for peer readiness.
OpsAdmin Involvement	Manual intervention support planned for future releases (not implemented yet).

§Node State Lifecycle

stateDiagram-v2
    [*] --> Joining
    Joining --> Syncing: Leader accepts join
    Syncing --> Ready: State synchronized
    Ready --> StandbyLearner: Promotion timeout
    Ready --> KeepAsLearner: Analytics role
    Ready --> CheckPromotion: Standard role
    CheckPromotion --> Promote: Quorum conditions met
    CheckPromotion --> Pairing: Needs partner
    Pairing --> Promote: Partner ready
    Pairing --> StandbyLearner: Pairing timeout

§Core Integration Flow

sequenceDiagram
    participant NewNode
    participant Leader
    participant RaftLog

    NewNode->>Leader: JoinRequest
    Leader->>RaftLog: ConfChange entry
    RaftLog-->>Leader: Commit confirmation
    Leader-->>NewNode: JoinResponse
    Leader->>NewNode: InstallSnapshot stream
    NewNode->>Leader: ApplyComplete
    Leader->>NewNode: AppendEntries
    NewNode->>Leader: ReadyNotification
    Leader->>ReadyQueue: Enqueue node

§Key Architectural Components

§1. Connection Isolation

Control Plane: Heartbeats and votes with low latency.
Data Plane: Log replication with high throughput.
Bulk Plane: Snapshot transfer which is bandwidth intensive.

§2. State Synchronization

Immediate Snapshot: New nodes always receive the latest snapshot first.
Atomic Replacement: Crash-safe application of state.
Incremental Logs: Catch-up via incremental log entries after snapshot.

§3. Promotion System

Quorum-Aware Batching:
- Maintains odd voter count for quorum safety.
- FIFO processing of join and promotion requests.
- Batch multiple promotions when possible.
Timeout Enforcement:
- Join Request timeout: 30 seconds.
- Promotion window: 5 minutes.
- Pairing wait timeout: 2 minutes.
Pairing Mechanism:
- Groups nodes that require partners for promotion.
- Jointly promotes paired nodes once both are ready.

§4. Failure Handling

Stale Learner Detection: Periodic status checks trigger automatic downgrade and operator alert.
Zombie Node Removal: Removes unresponsive nodes through batch proposals and maintenance.

§Quorum Preservation

flowchart LR
    A[Current Voters] --> B{Odd Count?}
    B -->|Yes| C[Promote Single]
    B -->|No| D[Find Pair]
    D --> E[Wait for Partner]
    E -->|Ready| F[Joint Promotion]
    E -->|Timeout| G[Demote to Standby]

§Timeout Enforcement Summary

Scenario	Timeout	Action
Join Request	30s	Abort configuration
Snapshot Transfer	Dynamic	Resume from last chunk
Promotion Window	5min	Demote to StandbyLearner
Pairing Wait	2min	Cancel pairing

§Lifecycle Example

New Analytics Node:

Joining → Syncing → Ready → KeepAsLearner
Standard Node Promotion:

Joining → Syncing → Ready → Promote
Paired Promotion:

Ready → Pairing → Promote (when partner ready)
Failed Node:

Syncing → [Timeout] → Zombie → Removed

§Conditions That Cause a Leader to Step Down

A leader will step down in the following situations:

Receives an RPC with a Higher Term If the leader receives an AppendEntries, RequestVote, or other RPC with a term number greater than its current term, it must step down and revert to a follower.
Grants a Vote to Another Candidate If the leader receives a RequestVote RPC and grants the vote to another candidate (usually because the candidate has a more up-to-date log), the leader must step down.
Fails to Contact the Majority for Too Long If the leader cannot reach a quorum (majority of nodes) for longer than one election timeout, it must assume it no longer holds authority and step down.

§Raft Log Persistence Architecture

This document outlines the design and behavior of the Raft log storage engine. It explains how logs are persisted, how the system handles reads and writes under different strategies, and how consistency is guaranteed across different configurations.

§Overview

Raft logs record the sequence of operations that must be replicated across all nodes in a Raft cluster. Correct and reliable storage of these logs is essential to maintaining the linearizability and safety guarantees of the protocol.

Our log engine supports two persistence strategies:

DiskFirst: Prioritizes durability.
MemFirst: Prioritizes performance.

Both strategies support configurable flush policies to control how memory contents are persisted to disk.

§Persistence Strategies

§DiskFirst

Write Path: On append, entries are first synchronously written to disk. Once confirmed, they are cached in memory.
Read Path: Reads are served from memory. If the requested entry is missing, it is loaded from disk and cached.
Startup Behavior: Does not preload all entries from disk into memory. Instead, entries are loaded lazily on access.
Durability: Ensures strong durability. A log is never considered accepted until it is safely written to disk.
Memory Use: Memory acts as a read-through cache for performance optimization.

§MemFirst

Write Path: Entries are first written to memory and acknowledged immediately. Disk persistence is handled asynchronously in the background.
Read Path: Reads are served from memory only. If an entry is not present in memory, it is considered nonexistent.
Startup Behavior: Loads all log entries from disk into memory during startup.
Durability: Durability is best-effort and depends on the flush policy. Recent entries may be lost if a crash occurs before flushing.
Memory Use: Memory holds the complete working set of logs.

§Flush Policies

Flush policies control how and when in-memory data is persisted to disk. These are especially relevant in MemFirst mode, but are also applied in DiskFirst to control how memory state is flushed (e.g., snapshots, metadata, etc).

§Types

Immediate
- Flush to disk immediately after every log write.
- Ensures maximum durability, but higher I/O latency.
Batch { threshold, interval }
- Flush to disk when:
  - The number of unflushed entries exceeds threshold, or
  - The elapsed time since last flush exceeds interval milliseconds.
- Balances performance and durability.
- May lose recent entries on crash.

§Read & Write Semantics

Operation	DiskFirst	MemFirst
Write	Write to disk → cache in memory	Write to memory → async flush
Read	From memory; fallback to disk	Memory only; missing = absent
Startup	Lazy-loading on access	Preload all entries into memory
Flush	Controlled via `flush_policy`	Controlled via `flush_policy`
Data loss on crash	No (after disk fsync)	Possible if not flushed

§Consistency Guarantees

Property	DiskFirst	MemFirst
Linearizability	✅ (strict)	✅ (with quorum + sync on commit)
Durability (Post-Commit)	✅ Always	❌ Depends on flush policy
Availability (Under Load)	❌ Lower	✅ Higher
Crash Recovery	✅ Strong	❌ Recent entries may be lost
Startup Readiness	✅ Fast	❌ Slower (full load)

§Recommended Use Cases

Strategy	Best For
DiskFirst	Systems that require strong durability and consistent recovery (e.g., databases, distributed ledgers)
MemFirst	Systems that favor latency and availability, and can tolerate recovery from snapshots or re-election (e.g., in-memory caches, ephemeral workloads)

§Developer Notes

Log Truncation & Compaction: Logs should be truncated after snapshotting, regardless of strategy.
Backpressure: In MemFirst, developers should implement backpressure if memory usage exceeds thresholds.
Lazy Loading: In DiskFirst, avoid head-of-line blocking by prefetching future entries when cache misses occur.
Flush Daemon: Use a background task to monitor and enforce flush policy under MemFirst.

§Future Improvements

Snapshot-aware recovery to reduce startup times for MemFirst.
Tiered storage support (e.g., WAL on SSD, archival on HDD or cloud).
Intelligent adaptive flush control based on workload.