Skip to main content

Module boot_sentinel

Module boot_sentinel 

Source
Expand description

Boot sentinel: auto-rollback to a last-known-good binary when a freshly-swapped binary crash-loops on startup (#582).

Both kanade-backend and kanade-agent are self-replacing Windows services: an update overwrites the running exe and the Service Control Manager restarts it. If the new binary crashes during early boot (exactly what the #573 JetStream regression did to the backend on 2026-06-11), nothing rolls it back — the SCM just restarts the same broken exe forever.

This module gates each boot. The swap step [arm_for_swap] writes a sentinel and snapshots the outgoing (known-good) binary to <exe>.last-good. The sentinel and quarantine files live in the shared data_dir but are namespaced by the exe’s role (.boot-sentinel-<role>.json), so a backend and an agent co-located on the same host keep independent boot state instead of clobbering a single shared file. Every boot calls [check_on_boot] as the very first thing in main() — before NATS, the DB, or any bootstrap that can fail — which increments a persisted attempt counter and, once it crosses the crash-loop threshold, restores .last-good over the live exe and quarantines the failed version so the autonomous self-update path won’t immediately re-deploy it (which would loop rollout↔rollback forever). [confirm_healthy], called once the process is genuinely up, promotes the running exe to the new last-good and clears the sentinel.

The attempt counter is persisted BEFORE the crashy code runs, so a hard crash still advances it: boot 1..N each bump the counter, and the boot that crosses the threshold rolls back, after which the SCM restarts into .last-good.

§Windows exe lock

A running exe is locked on Windows (no overwrite), but a rename of the running exe IS allowed. So the rollback renames the live exe aside (<exe>.rollback-bak) and copies .last-good into place, then the caller exits so the SCM relaunches the restored binary. The same rename-then-replace works on Unix and in unit tests (where the “exe” is just a temp file), so the logic is testable everywhere.

Structs§

BootSentinel
Per-role boot guard. Construct once at the top of main().

Enums§

BootDecision
What [check_on_boot] decided. On RolledBack the caller MUST exit (non-zero) so the service manager relaunches the restored last-good binary.

Constants§

DEFAULT_MAX_ATTEMPTS
Crash-loop threshold. Boot attempts 1..=N proceed; attempt N+1 triggers the rollback (the check is attempts <= max). So the default 3 gives a freshly-swapped binary three chances to confirm healthy and rolls back on the fourth boot — enough to ride out a one-off transient (slow disk, flaky first NATS connect) without masking a genuinely broken binary.