Skip to main content

spawn_catchup_loop

Function spawn_catchup_loop 

Source
pub fn spawn_catchup_loop(
    config: FederationConfig,
    db: Db,
    interval: Duration,
) -> JoinHandle<()>
Expand description

v0.6.0.1 (#320) — post-partition catchup poller.

Previously a node rejoining the mesh after SIGSTOP / network blip / restart would only receive NEW writes that arrived AFTER resume; anything the other peers wrote during the outage stayed on those peers. r14 scenario-14 observed this as node-3 seeing 2/20 writes post-SIGCONT.

This loop periodically calls GET /api/v1/sync/since?peer=<local> against each configured peer, applying returned memories via insert_if_newer. The since value is the receiver-side vector clock entry for that peer, so we never re-pull already-applied rows. First catchup after a restart runs with since=None, pulling a capped snapshot (limit=500).

Interval is operator-tunable via --catchup-interval-secs. 0 disables. The loop is a best-effort background task: errors are logged but never propagated. In the happy path a partitioned node converges within one interval after resume.

This is deliberately NOT a substitute for the synchronous quorum-write path — it’s a safety net for the tail. Normal writes still fan out via broadcast_store_quorum; catchup only fires for rows that DIDN’T land during the original write deadline.