openraft 0.10.0-alpha.18

Advanced Raft consensus
Documentation
# Monitoring and Maintenance

This guide explains how to monitor a Raft cluster's health and perform maintenance operations like adding or removing nodes.

## Monitoring Cluster Health

### Accessing Metrics

Use [`Raft::metrics()`] to monitor cluster health. It returns a watch channel with real-time [`RaftMetrics`]:

```ignore
let metrics_rx = raft.metrics();

// Read current state
let metrics = metrics_rx.borrow_watched();
println!("Node state: {:?}", metrics.state);
println!("Current leader: {:?}", metrics.current_leader);
```

### Key Health Indicators

**Node State** ([`RaftMetrics::state`]): Whether the node is [`Leader`], [`Follower`], [`Learner`], or [`Candidate`].

**Current Leader** ([`RaftMetrics::current_leader`]): ID of the current leader, if known.

**Log State**:
- [`RaftMetrics::last_log_index`]: Last appended log index
- [`RaftMetrics::last_applied`]: Last applied log to state machine
- [`RaftMetrics::snapshot`]: Last log in snapshot

**Leader Health** ([`RaftMetrics::last_quorum_acked`]): For leaders only, timestamp of the last quorum acknowledgment. An old timestamp suggests the leader may be partitioned from the cluster.

[`Leader`]: `crate::core::ServerState::Leader`
[`Follower`]: `crate::core::ServerState::Follower`
[`Learner`]: `crate::core::ServerState::Learner`
[`Candidate`]: `crate::core::ServerState::Candidate`

### Detecting Node Failures

When this node is the leader, use these metrics to detect follower/learner issues:

**Heartbeat Metrics** ([`RaftMetrics::heartbeat`]): Maps each node ID to the last heartbeat acknowledgment time. Calculate the elapsed time to detect potentially offline nodes:

```ignore
if let Some(heartbeat) = &metrics.heartbeat {
    for (node_id, last_ack) in heartbeat {
        if let Some(ack_time) = last_ack {
            let elapsed = ack_time.elapsed();
            if elapsed > threshold {
                // Node may be offline or unreachable
            }
        }
    }
}
```

**Replication Metrics** ([`RaftMetrics::replication`]): Maps each node ID to the last matched log index. Compare with the leader's log to detect lagging nodes:

```ignore
if let Some(replication) = &metrics.replication {
    for (node_id, matched_log) in replication {
        if let Some(matched) = matched_log {
            let lag = metrics.last_log_index.unwrap_or(0) - matched.index;
            if lag > threshold {
                // Node is lagging behind
            }
        }
    }
}
```

## Maintenance Operations

When monitoring detects issues (offline nodes, excessive lag), perform maintenance operations.

See [`dynamic_membership`] for API details on `add_learner` and `change_membership`.

### When to Add Nodes

Add nodes when:
- Expanding cluster capacity
- Replacing failed nodes
- Increasing replication factor

**Process:**
1. Use [`Raft::add_learner()`] with `blocking=true` to add and wait for catch-up
2. Use [`Raft::change_membership()`] to promote to voter

### When to Remove Nodes

Remove nodes when:
- Decommissioning servers
- Reducing cluster size
- Node is permanently failed

**Process:**
1. Build new voter set excluding the failed node
2. Use [`Raft::change_membership()`] where:
   - `retain=true`: Node becomes learner (can be re-promoted later)
   - `retain=false`: Node removed completely from cluster

**Important:** Node can be terminated after the uniform config is committed (after the two-phase change completes).

## Automated Maintenance Cautions

When building automated maintenance systems, follow these safety guidelines:

### 1. Leader Uncertainty

A leader cannot definitively determine if a node has failed. The issue could be:
- Network partition affecting the follower
- Network issues with the leader itself
- Temporary connectivity problems

Don't immediately remove unresponsive nodes. Use multiple checks over time.

### 2. Leader Validity

The leader might be stale. Another node could be the current leader with a higher term. The stale leader's maintenance requests will be rejected by the actual leader.

Always check [`RaftMetrics::running_state`] for errors and verify leadership before operations.

### 3. Maintain Quorum

Removing nodes one by one without adding replacements can eventually leave a single-node cluster, affecting availability.

**Best practice**: Add a replacement node before removing a failed one.

### 4. Only Leaders Perform Maintenance

Only the cluster leader can perform membership changes. Followers must forward requests to the leader or wait until they become leader.

Check `metrics.state == `[`ServerState::Leader`] before attempting maintenance operations.

[`ServerState::Leader`]: `crate::core::ServerState::Leader`

## Example: Automated Node Replacement

```ignore
use std::time::Duration;

const OFFLINE_THRESHOLD: Duration = Duration::from_secs(30);
const MAX_LAG: u64 = 1000;

async fn check_and_maintain(raft: &Raft<TypeConfig>) -> Result<()> {
    let metrics = raft.metrics().borrow_watched();

    // Only leader performs maintenance
    if metrics.state != ServerState::Leader {
        return Ok(());
    }

    // Check for offline or lagging nodes
    if let (Some(heartbeat), Some(replication)) = (&metrics.heartbeat, &metrics.replication) {
        for (node_id, last_ack) in heartbeat {
            // Skip if heartbeat is recent
            if let Some(ack_time) = last_ack {
                if ack_time.elapsed() < OFFLINE_THRESHOLD {
                    continue;
                }
            }

            // Node appears offline - prepare replacement
            println!("Node {} appears offline", node_id);

            // 1. Add replacement node as learner
            let new_node = get_replacement_node().await?;
            raft.add_learner(new_node.id, new_node.addr, true).await?;

            // 2. Build new membership without failed node, including replacement
            let mut new_voters = metrics.membership_config
                .voter_ids()
                .filter(|id| id != node_id)
                .collect::<BTreeSet<_>>();
            new_voters.insert(new_node.id);

            // 3. Change membership
            raft.change_membership(new_voters, false).await?;

            break; // Handle one node at a time
        }
    }

    Ok(())
}
```

## Monitoring Best Practices

1. **Export metrics** to monitoring systems (Prometheus, etc.) for historical analysis
2. **Set up alerts** for leader changes, node failures, and replication lag
3. **Monitor leader lease** via `last_quorum_acked` to detect leader isolation
4. **Track membership changes** to audit cluster topology changes
5. **Use multiple indicators** before declaring a node failed

## See Also

- [`RaftMetrics`]: Complete metrics structure
- [`dynamic_membership`]: API details for `add_learner` and `change_membership`
- [`node_lifecycle`]: Internal mechanics of node state transitions

[`Raft::metrics()`]: `crate::Raft::metrics`
[`Raft::add_learner()`]: `crate::Raft::add_learner`
[`Raft::change_membership()`]: `crate::Raft::change_membership`
[`RaftMetrics`]: `crate::metrics::RaftMetrics`
[`RaftMetrics::state`]: `crate::metrics::RaftMetrics::state`
[`RaftMetrics::current_leader`]: `crate::metrics::RaftMetrics::current_leader`
[`RaftMetrics::last_log_index`]: `crate::metrics::RaftMetrics::last_log_index`
[`RaftMetrics::last_applied`]: `crate::metrics::RaftMetrics::last_applied`
[`RaftMetrics::snapshot`]: `crate::metrics::RaftMetrics::snapshot`
[`RaftMetrics::last_quorum_acked`]: `crate::metrics::RaftMetrics::last_quorum_acked`
[`RaftMetrics::heartbeat`]: `crate::metrics::RaftMetrics::heartbeat`
[`RaftMetrics::replication`]: `crate::metrics::RaftMetrics::replication`
[`RaftMetrics::running_state`]: `crate::metrics::RaftMetrics::running_state`
[`dynamic_membership`]: `crate::docs::cluster_control::dynamic_membership`
[`node_lifecycle`]: `crate::docs::cluster_control::node_lifecycle`