pub fn spawn_catchup_loop(
config: FederationConfig,
db: Db,
interval: Duration,
) -> JoinHandle<()>Expand description
v0.6.0.1 (#320) — post-partition catchup poller.
Previously a node rejoining the mesh after SIGSTOP / network blip / restart would only receive NEW writes that arrived AFTER resume; anything the other peers wrote during the outage stayed on those peers. r14 scenario-14 observed this as node-3 seeing 2/20 writes post-SIGCONT.
This loop periodically calls GET /api/v1/sync/since?peer=<local> against
each configured peer, applying returned memories via insert_if_newer.
The since value is the receiver-side vector clock entry for that peer,
so we never re-pull already-applied rows. First catchup after a restart
runs with since=None, pulling a capped snapshot (limit=500).
Interval is operator-tunable via --catchup-interval-secs. 0 disables.
The loop is a best-effort background task: errors are logged but never
propagated. In the happy path a partitioned node converges within one
interval after resume.
This is deliberately NOT a substitute for the synchronous quorum-write
path — it’s a safety net for the tail. Normal writes still fan out via
broadcast_store_quorum; catchup only fires for rows that DIDN’T land
during the original write deadline.