Skip to main content

Module failover

Module failover 

Source
Expand description

Coordinated zero-RPO failover (issue #833, PRD #819).

Drives a planned primary handover so that no acknowledged write is lost. The flow is the classic coordinated switchover:

  1. Freeze writes on the current primary and capture its frontier LSN at the instant writes stopped. No new LSN is minted after the freeze, so the frontier is a fixed catch-up target.
  2. Wait the target replica to the frontier — poll the target’s acknowledged (durable) frontier until it covers the frozen LSN.
  3. Hand over the term — mint current_term + 1 and stamp it on the target, promoting it to primary.
  4. Demote the old primary to a replica that streams from the new primary under the new term.

§Two modes

  • FailoverMode::Coordinated is the zero-RPO path. If the target cannot reach the frontier before catch_up_deadline, the handover aborts: writes resume on the old primary and nothing is committed on the target, so the cluster keeps serving and no acknowledged write is lost (issue #833 criterion 1).
  • FailoverMode::Force is the emergency path. It still tries to reach the frontier, but on timeout it completes the handover anyway, surfacing the skipped catch-up — the un-replicated LSN gap between the frozen frontier and the target’s reached frontier (issue #833 criterion 2).

§Module shape

FailoverCoordinator::run is a pure state machine. The clock and the cluster mutations (freeze, resume, poll, commit) are injected behind FailoverTransport, so the whole flow is exercised deterministically with a scripted fake — no clock, no network, no engine dependency. The post-handover roles are returned in the outcome (RoleAssignment) so a caller can assert that the new primary advertises the new term and the old primary streams as a replica (issue #833 criterion 3). Wiring the transport to the real WAL frontier and the gRPC role-swap is left to the transport layer.

Structs§

FailoverCoordinator
The coordinated zero-RPO failover state machine.
FailoverNode
A node participating in a failover.
FailoverOutcome
The result of a completed handover.
FailoverRequest
A request to hand the primary role from old_primary to target.
RoleAssignment
Post-handover roles of the two nodes, used to assert that the new primary advertises the new term and the old primary streams as a replica (issue #833 criterion 3).

Enums§

FailoverError
Why a coordinated failover could not complete without losing writes.
FailoverMode
How a failover should be executed.
NodeRole
The replication role a node plays after a failover step.

Traits§

FailoverTransport
Cluster mutations and the clock the coordinator drives, injected so the state machine stays pure and deterministically testable.