Expand description
v0.8.4 #72 — load manager snapshot files with per-manager fault isolation.
§Why
Pre-#72 each of the nine --*-state-file loaders in main.rs used
the from_json(&raw).map_err(|e| format!(...))? pattern: a single
corrupted, truncated, or schema-incompatible snapshot would bubble
Err out of the boot sequence and kill the gateway start-up.
The operator was forced to either restore the file from backup or
manually rm it before the gateway would even bind its listener —
a loud restart-loop that took the entire data-plane down for one
manager’s bad JSON.
§What changed
load_or_fresh turns the read-side Err/parse-side Err into:
- a
tracing::warn!log line carrying the manager name, the file path, and the underlying error (operators grep forstate file parse failedin logs); - a bump to the
s4_state_file_load_failures_total{manager,reason}Prometheus counter (operators alert onrate(... > 0)so silent boot-time fall-backs surface in dashboards); - a fresh
T::default()manager — the gateway boots with empty in-memory state for the affected manager and the operator’s snapshot file is left in place for post-mortem inspection (we never touch the operator’s bytes — recovering / re-importing is their call).
Every other manager keeps loading normally. One bad file no longer cascades into a gateway-wide DoS.
§What did NOT change
--mfa-default-secret-filekeeps its fail-closed read path. A missing or unreadable MFA secret means MFA verification cannot succeed; silently booting with no secret would let DELETEs slip past the MFA gate. That call site stays inside the MFA loader block and continues to surface a hard error.- The on-disk snapshot is never deleted, renamed, or rewritten by
the boot path. Operators decide whether to
rmthe bad file or restore from a known-good copy.
Functions§
- load_
or_ fresh - v0.8.4 #72: load a manager snapshot with per-manager graceful degradation. See module docs for the contract.
- read_
state_ file_ or_ fresh - Read a
--*-state-file <PATH>snapshot, returningOk(None)for the three “start fresh” cases andOk(Some(json))for the actual restore-from-snapshot case: