Expand description
Boot sentinel: auto-rollback to a last-known-good binary when a freshly-swapped binary crash-loops on startup (#582).
Both kanade-backend and kanade-agent are self-replacing
Windows services: an update overwrites the running exe and the
Service Control Manager restarts it. If the new binary crashes
during early boot (exactly what the #573 JetStream regression did
to the backend on 2026-06-11), nothing rolls it back — the SCM
just restarts the same broken exe forever.
This module gates each boot. The swap step [arm_for_swap] writes
a sentinel and snapshots the outgoing (known-good) binary to
<exe>.last-good. The sentinel and quarantine files live in the
shared data_dir but are namespaced by the exe’s role
(.boot-sentinel-<role>.json), so a backend and an agent co-located
on the same host keep independent boot state instead of clobbering a
single shared file. Every boot calls [check_on_boot] as the very
first thing in main() — before NATS, the DB, or any bootstrap
that can fail — which increments a persisted attempt counter and,
once it crosses the crash-loop threshold, restores .last-good
over the live exe and quarantines the failed version so the
autonomous self-update path won’t immediately re-deploy it (which
would loop rollout↔rollback forever). [confirm_healthy], called
once the process is genuinely up, promotes the running exe to the
new last-good and clears the sentinel.
The attempt counter is persisted BEFORE the crashy code runs, so a
hard crash still advances it: boot 1..N each bump the counter, and
the boot that crosses the threshold rolls back, after which the SCM
restarts into .last-good.
§Windows exe lock
A running exe is locked on Windows (no overwrite), but a rename
of the running exe IS allowed. So the rollback renames the live exe
aside (<exe>.rollback-bak) and copies .last-good into place,
then the caller exits so the SCM relaunches the restored binary.
The same rename-then-replace works on Unix and in unit tests (where
the “exe” is just a temp file), so the logic is testable everywhere.
Structs§
- Boot
Sentinel - Per-role boot guard. Construct once at the top of
main().
Enums§
- Boot
Decision - What [
check_on_boot] decided. OnRolledBackthe caller MUST exit (non-zero) so the service manager relaunches the restored last-good binary.
Constants§
- DEFAULT_
MAX_ ATTEMPTS - Crash-loop threshold. Boot attempts
1..=Nproceed; attemptN+1triggers the rollback (the check isattempts <= max). So the default 3 gives a freshly-swapped binary three chances to confirm healthy and rolls back on the fourth boot — enough to ride out a one-off transient (slow disk, flaky first NATS connect) without masking a genuinely broken binary.