Module failover

Expand description

Coordinated zero-RPO failover (issue #833, PRD #819).

Drives a planned primary handover so that no acknowledged write is lost. The flow is the classic coordinated switchover:

Freeze writes on the current primary and capture its frontier LSN at the instant writes stopped. No new LSN is minted after the freeze, so the frontier is a fixed catch-up target.
Wait the target replica to the frontier — poll the target’s acknowledged (durable) frontier until it covers the frozen LSN.
Hand over the term — mint current_term + 1 and stamp it on the target, promoting it to primary.
Demote the old primary to a replica that streams from the new primary under the new term.

§Two modes

FailoverMode::Coordinated is the zero-RPO path. If the target cannot reach the frontier before catch_up_deadline, the handover aborts: writes resume on the old primary and nothing is committed on the target, so the cluster keeps serving and no acknowledged write is lost (issue #833 criterion 1).
FailoverMode::Force is the emergency path. It still tries to reach the frontier, but on timeout it completes the handover anyway, surfacing the skipped catch-up — the un-replicated LSN gap between the frozen frontier and the target’s reached frontier (issue #833 criterion 2).

§Module shape

FailoverCoordinator::run is a pure state machine. The clock and the cluster mutations (freeze, resume, poll, commit) are injected behind FailoverTransport, so the whole flow is exercised deterministically with a scripted fake — no clock, no network, no engine dependency. The post-handover roles are returned in the outcome (RoleAssignment) so a caller can assert that the new primary advertises the new term and the old primary streams as a replica (issue #833 criterion 3). Wiring the transport to the real WAL frontier and the gRPC role-swap is left to the transport layer.

Structs§

FailoverCoordinator: The coordinated zero-RPO failover state machine.
FailoverNode: A node participating in a failover.
FailoverOutcome: The result of a completed handover.
FailoverRequest: A request to hand the primary role from old_primary to target.
RoleAssignment: Post-handover roles of the two nodes, used to assert that the new primary advertises the new term and the old primary streams as a replica (issue #833 criterion 3).

Enums§

FailoverError: Why a coordinated failover could not complete without losing writes.
FailoverMode: How a failover should be executed.
NodeRole: The replication role a node plays after a failover step.

Traits§

FailoverTransport: Cluster mutations and the clock the coordinator drives, injected so the state machine stays pure and deterministically testable.

Module failover

Module failover Copy item path

§Two modes

§Module shape

Structs§

Enums§

Traits§

Module failover