pub struct Supervisor<R: ProcessRunner = JobRunner> { /* private fields */ }Expand description
Keeps a Command alive: runs it, classifies every exit against the
RestartPolicy and the stop_when predicate, and
restarts it after an exponential-backoff delay until supervision ends.
use processkit::{Command, RestartPolicy, Supervisor};
use std::time::Duration;
let outcome = Supervisor::new(Command::new("my-server").args(["--port", "8080"]))
.restart(RestartPolicy::OnCrash)
.max_restarts(5)
.backoff(Duration::from_millis(200), 2.0)
.stop_when(|res| res.code() == Some(0))
.run()
.await?;
println!("ended after {} restarts: {:?}", outcome.restarts, outcome.stopped);Defaults: OnCrash, unlimited restarts, backoff
200ms × 2.0 capped at 30 s, jitter on, failure-storm guard off (enable
with storm_pause; once enabled, failure-score
half-life 30 s and threshold 5.0).
Runs go through a ProcessRunner — JobRunner by default (each
incarnation in its own private kill-on-drop group). Inject another with
with_runner: a &ProcessGroup supervises every
incarnation inside one shared group, and a
ScriptedRunner makes supervision logic fully
hermetic in tests.
Implementations§
Source§impl Supervisor<JobRunner>
impl Supervisor<JobRunner>
Source§impl<R: ProcessRunner> Supervisor<R>
impl<R: ProcessRunner> Supervisor<R>
Sourcepub fn with_runner<R2: ProcessRunner>(self, runner: R2) -> Supervisor<R2>
pub fn with_runner<R2: ProcessRunner>(self, runner: R2) -> Supervisor<R2>
Run every incarnation through runner instead of the default
JobRunner — e.g. a &ProcessGroup for one shared kill-on-drop
group, or a test double for hermetic supervision tests.
With a shared group, the group’s state applies to every incarnation:
notably, restarting into a suspended group on the Linux cgroup
mechanism spawns the new child frozen (see the
ProcessGroup::suspend docs, process-control feature) — resume the
group before supervising into it.
Sourcepub fn restart(self, policy: RestartPolicy) -> Self
pub fn restart(self, policy: RestartPolicy) -> Self
When to restart (default: OnCrash).
Sourcepub fn max_restarts(self, n: u32) -> Self
pub fn max_restarts(self, n: u32) -> Self
Restart at most n times — n + 1 total runs (default: unlimited).
Sourcepub fn backoff(self, base: Duration, factor: f64) -> Self
pub fn backoff(self, base: Duration, factor: f64) -> Self
Exponential backoff before each restart: the n-th restart (0-based)
waits base × factor^n, capped by max_backoff.
A factor below 1.0 (or non-finite) is treated as 1.0.
Default: 200ms × 2.0.
Sourcepub fn max_backoff(self, cap: Duration) -> Self
pub fn max_backoff(self, cap: Duration) -> Self
Cap any single backoff delay (default: 30 s).
Sourcepub fn jitter(self, enabled: bool) -> Self
pub fn jitter(self, enabled: bool) -> Self
Multiply each backoff delay by a uniform factor in [0.5, 1.5)
(default: on), so a fleet of supervised workers restarted by the
same incident doesn’t stampede back in lockstep. Disable for
deterministic delays.
Sourcepub fn storm_pause(self, pause: Duration) -> Self
pub fn storm_pause(self, pause: Duration) -> Self
Enable the failure-storm guard: when crash-restarts cluster faster
than the failure score can decay (see
failure_decay /
failure_threshold), pause restarts once
for pause — jittered into [0.5, 1.5) of the nominal value per
jitter — then reset the score and resume. Off by
default; this is the master switch, the other two knobs only tune it.
Each failed run adds 1 to a score that halves every
failure_decay: score = score × 0.5^(Δt / failure_decay) + 1. A
service that fails rarely never accumulates past the threshold; a
storm trips it and gets one collective pause instead of hammering
restarts at backoff speed. (Design borrowed from Go’s suture
supervisor — the idea, not the code.)
Only failures feed the score: crashes and spawn errors. A clean exit
restarted under Always is not a failure.
The storm pause stacks with (runs before) the per-restart backoff,
and max_restarts is checked first — a storm
pause never resurrects an exhausted budget. Pauses taken are reported
in SupervisionOutcome::storm_pauses.
Sourcepub fn failure_decay(self, decay: Duration) -> Self
pub fn failure_decay(self, decay: Duration) -> Self
Half-life of the failure score used by the storm guard (default: 30 s):
every decay seconds without a failure, the accumulated score halves.
A zero half-life keeps no history — every failure scores exactly 1,
so the guard trips only with a threshold below 1.0. No effect unless
storm_pause is set.
Sourcepub fn failure_threshold(self, threshold: f64) -> Self
pub fn failure_threshold(self, threshold: f64) -> Self
Failure score above which the storm guard trips (default: 5.0 —
roughly “more than five failures inside one half-life”). A non-finite
threshold never trips. No effect unless
storm_pause is set.
Sourcepub fn stop_when(
self,
predicate: impl Fn(&ProcessResult<String>) -> bool + Send + Sync + 'static,
) -> Self
pub fn stop_when( self, predicate: impl Fn(&ProcessResult<String>) -> bool + Send + Sync + 'static, ) -> Self
End supervision when predicate matches a completed run — checked
before the RestartPolicy on every exit, clean or not. (It never
sees a run that failed to start; spawn errors are classified by the
policy alone.)
Sourcepub async fn run(self) -> Result<SupervisionOutcome>
pub async fn run(self) -> Result<SupervisionOutcome>
Supervise until the policy, the predicate, or the restart budget ends
it, and report the SupervisionOutcome.
§Errors
Returns Err only when the terminating attempt failed to produce a
result at all (a spawn/IO failure when no further restart is allowed) —
there is no final ProcessResult to report in that case. A spawn
failure with restarts remaining counts as a crash and is retried.
§Cancellation
Dropping this future mid-run abandons the in-flight incarnation. With
the default JobRunner it is killed on drop (the incarnation owns a
private group); with a shared-group runner
(with_runner(&group)) the incarnation stays
alive in the caller’s group until the group tears it down.
An incarnation cancelled via its token (Command::cancel_on, with the
cancellation feature) is terminal: supervision returns that
Error::Cancelled immediately, regardless of policy or budget — the
token stays cancelled, so a restart would only be cancelled again.