processkit 0.8.0

# Supervision

[‹ docs index](README.md)

Where [`retry`](timeouts-and-cancellation.md#retries) answers *"run this once,
replaying on failure"*, a `Supervisor` answers the different question *"keep
this alive"*: restart a child per policy whenever it exits, with bounded
restarts, exponential backoff, and jitter — a minimal `runit`/`systemd`-style
keeper, platform-agnostic because it sits entirely on the
[`ProcessRunner` seam](testing.md#the-processrunner-seam).

- [The shape](#the-shape)
- [Policies: what counts as a crash](#policies-what-counts-as-a-crash)
- [Backoff and jitter](#backoff-and-jitter)
- [Failure storms](#failure-storms)
- [Stopping](#stopping)
- [Outcomes](#outcomes)
- [Supervising inside a shared group](#supervising-inside-a-shared-group)
- [Errors and cancellation](#errors-and-cancellation)

## The shape

```rust,no_run
use processkit::{Command, RestartPolicy, Supervisor};
use std::time::Duration;

#[tokio::main]
async fn main() -> processkit::Result<()> {
    let outcome = Supervisor::new(Command::new("my-server").args(["--port", "8080"]))
        .restart(RestartPolicy::OnCrash)           // default
        .max_restarts(5)                           // default: unlimited
        .backoff(Duration::from_millis(200), 2.0)  // default: 200ms × 2.0
        .max_backoff(Duration::from_secs(30))      // default: 30s cap
        .jitter(true)                              // default: on
        .stop_when(|res| res.code() == Some(0))    // optional exit condition
        .run()
        .await?;

    println!(
        "ended after {} restarts, reason: {:?}, last exit: {:?}",
        outcome.restarts, outcome.stopped, outcome.final_result.code(),
    );
    Ok(())
}
```

Each *incarnation* is one full captured run of the command (so the command's
own `timeout`, stdin, env, … all apply per run — with the usual
[one-shot-stdin caveat](commands.md#standard-input) for the second run
onward).

## Policies: what counts as a crash

A **crash** is any run without a clean exit: a non-zero code, a timeout, a
signal-kill, or a spawn failure.

| `RestartPolicy` | Restarts after… |
|---|---|
| `OnCrash` *(default)* | crashes only; a clean exit ends supervision (`PolicySatisfied`) |
| `Always` | every completed run, clean or not — pair it with `stop_when`/`max_restarts` or it loops forever |
| `Never` | nothing: one run, reported as-is |

## Backoff and jitter

The *n*-th restart (0-based) sleeps

```text
delay(n) = min(base × factor^n, max_backoff) × jitter
```

with `jitter` drawn uniformly from `[0.5, 1.5)` per restart. Jitter is **on by
default** so a fleet of supervised workers restarted by the same incident
doesn't stampede back in lockstep; `jitter(false)` gives deterministic delays
(useful in tests with a paused tokio clock). A non-finite or `< 1.0` factor is
treated as `1.0` — constant delay, never a shrinking one.

```text
base=200ms, factor=2.0, cap=30s:
restart #0 → ~200ms   #1 → ~400ms   #2 → ~800ms … #7 → ~25.6s   #8+ → 30s (cap)
```

## Failure storms

Backoff spaces *individual* restarts; `max_restarts` is a *lifetime* cap.
Neither distinguishes a service that fails once a day from one that is
suddenly crash-looping. The opt-in **storm guard** does (a design borrowed
from Go's [`suture`](https://github.com/thejerf/suture) supervisor — the
idea, not the code):

```rust,no_run
use processkit::{Command, Supervisor};
use std::time::Duration;

let outcome = Supervisor::new(Command::new("worker"))
    .storm_pause(Duration::from_secs(15))     // master switch — off by default
    .failure_decay(Duration::from_secs(30))   // score half-life (default 30s)
    .failure_threshold(5.0)                   // trip point (default 5.0)
    .run()
    .await?;

println!("storm pauses taken: {}", outcome.storm_pauses);
```

Each failed run adds `1` to a score that **halves every `failure_decay`**:

```text
score = score × 0.5^(Δt / failure_decay) + 1
```

- *Fails rarely*: the score decays back toward `1` between failures and never
  reaches the threshold — the guard stays out of the way.
- *Failure storm*: failures arrive faster than the half-life drains them, the
  score climbs past `failure_threshold`, and the supervisor takes **one
  collective pause** of `storm_pause` (jittered into `[0.5, 1.5)` like the
  backoff), resets the score, and resumes.

Only failures feed the score — crashes and spawn errors, not clean exits
restarted under `RestartPolicy::Always`. The pause stacks with (runs before)
the per-restart backoff, and the `max_restarts` budget is checked first, so a
storm pause never extends an exhausted budget. Pauses taken are reported in
`SupervisionOutcome::storm_pauses`.

## Stopping

Three gates, checked in this order after every completed run:

1. **`stop_when(predicate)`** — sees the run's `ProcessResult`; returning
   `true` ends supervision *regardless of policy* (→ `StopReason::Predicate`).
   "Exit 0 is done, anything else is a crash" is the classic:
   `stop_when(|res| res.code() == Some(0))` under `RestartPolicy::Always`.
2. **The policy** — `OnCrash` stops on a clean exit (→ `PolicySatisfied`).
3. **`max_restarts(n)`** — at most *n* restarts = *n + 1* total runs; an
   exhausted budget reports the last result (→ `RestartsExhausted`).
   `max_restarts(0)` means exactly one run.

## Outcomes

`run()` resolves to a `SupervisionOutcome`:

```rust,no_run
let outcome = Supervisor::new(Command::new("job")).run().await?;

outcome.final_result; // ProcessResult<String> of the LAST run
outcome.restarts;     // how many restarts happened (not counting run #1)
outcome.stopped;      // StopReason::{Predicate, PolicySatisfied, RestartsExhausted}
outcome.storm_pauses; // failure-storm pauses taken (0 unless storm_pause is set)
```

Note `run()` returning `Ok` does **not** mean the child succeeded — it means
supervision *concluded*. Inspect `final_result` (or `ensure_success()` it) for
the child's own verdict.

## Supervising inside a shared group

The supervisor runs through any `ProcessRunner`. The headline production
variant injects a [`ProcessGroup`](process-groups.md) so every incarnation —
and everything it spawns — lives in one kill-on-drop container:

```rust,no_run
use processkit::{Command, ProcessGroup, RestartPolicy, Supervisor};

let group = ProcessGroup::new()?;

let outcome = Supervisor::new(Command::new("worker"))
    .with_runner(&group)                 // &group is itself a ProcessRunner
    .restart(RestartPolicy::OnCrash)
    .max_restarts(10)
    .run()
    .await?;

// The group outlives supervision: drop it (or shutdown) to reap any strays.
```

Mind one interaction: don't supervise into a group you've
[suspended](process-groups.md#suspending-and-resuming) — under the cgroup
mechanism the restarted child would start frozen (and the spawn itself can
block). Resume first.

The same injection point makes supervision logic **hermetically testable** —
script a sequence of fake results and assert the restart/stop behavior with
no real process; see [Testing your code](testing.md#scripting-replies).

## Errors and cancellation

A run that produces no result at all (spawn/IO failure) can't be judged by
`stop_when`; the policy treats it as a crash and restarts (with backoff)
unless the policy is `Never` or the budget is exhausted — then the error
itself surfaces as `run()`'s `Err`.

With the [`cancellation` feature](timeouts-and-cancellation.md#cancellation),
a cancelled incarnation is **terminal**: `run()` returns
`Err(Error::Cancelled)` immediately. The token never un-cancels, so a restart
could only produce another instantly-cancelled run — the supervisor refuses
the futile loop.

---

Next: [Testing your code](testing.md) ·
[Timeouts, retries & cancellation](timeouts-and-cancellation.md) ·
[Process groups](process-groups.md)