Expand description
Coordinated zero-RPO failover (issue #833, PRD #819).
Drives a planned primary handover so that no acknowledged write is lost. The flow is the classic coordinated switchover:
- Freeze writes on the current primary and capture its frontier LSN at the instant writes stopped. No new LSN is minted after the freeze, so the frontier is a fixed catch-up target.
- Wait the target replica to the frontier — poll the target’s acknowledged (durable) frontier until it covers the frozen LSN.
- Hand over the term — mint
current_term + 1and stamp it on the target, promoting it to primary. - Demote the old primary to a replica that streams from the new primary under the new term.
§Two modes
FailoverMode::Coordinatedis the zero-RPO path. If the target cannot reach the frontier beforecatch_up_deadline, the handover aborts: writes resume on the old primary and nothing is committed on the target, so the cluster keeps serving and no acknowledged write is lost (issue #833 criterion 1).FailoverMode::Forceis the emergency path. It still tries to reach the frontier, but ontimeoutit completes the handover anyway, surfacing the skipped catch-up — the un-replicated LSN gap between the frozen frontier and the target’s reached frontier (issue #833 criterion 2).
§Module shape
FailoverCoordinator::run is a pure state machine. The clock and
the cluster mutations (freeze, resume, poll, commit) are injected
behind FailoverTransport, so the whole flow is exercised
deterministically with a scripted fake — no clock, no network, no
engine dependency. The post-handover roles are returned in the
outcome (RoleAssignment) so a caller can assert that the new
primary advertises the new term and the old primary streams as a
replica (issue #833 criterion 3). Wiring the transport to the real
WAL frontier and the gRPC role-swap is left to the transport layer.
Structs§
- Failover
Coordinator - The coordinated zero-RPO failover state machine.
- Failover
Node - A node participating in a failover.
- Failover
Outcome - The result of a completed handover.
- Failover
Request - A request to hand the primary role from
old_primarytotarget. - Role
Assignment - Post-handover roles of the two nodes, used to assert that the new primary advertises the new term and the old primary streams as a replica (issue #833 criterion 3).
Enums§
- Failover
Error - Why a coordinated failover could not complete without losing writes.
- Failover
Mode - How a failover should be executed.
- Node
Role - The replication role a node plays after a failover step.
Traits§
- Failover
Transport - Cluster mutations and the clock the coordinator drives, injected so the state machine stays pure and deterministically testable.