api-debug-lab 0.4.0

Reproducible API troubleshooting fixtures and a Rust diagnostic CLI.
Documentation
# P0 escalation note (template)

For incidents where the customer is *currently* impacted in production
and a credible "fire" is implied. Use sparingly — P0 carries a paging
cost and should not be invoked for diagnostics that can wait until
business hours.

---

## Customer

- **Account / org**: <id>
- **Plan / criticality tier**: <e.g., enterprise, regulated>
- **Reporter**: <name, channel, started at UTC>

## Symptom

One sentence stating the customer-visible failure mode and its scope.

> Example: All webhook deliveries to the customer's `/orders` endpoint
> have been returning 401 since 18:40 UTC, ~3,400 events queued.

## Reproduction

```bash
api-debug-lab replay <case>
```

…or the curl reproduction emitted by `replay`. Confirm reproduction
*before* paging — if the failure cannot be reproduced from the captured
case, treat as a "cannot reproduce" escalation (see
`escalation_template_no_repro.md`).

## Evidence (verbatim from `diagnose`)

Paste the EVIDENCE block from `diagnose <case> --format human`. Do not
summarise — the raw block is the audit trail.

## Likely cause

The rule's `likely_cause`. Add one line of customer-context translation
("their secret rotated 12 minutes before the spike began").

## Time-bounded asks

P0 escalations should request *specific* time-bounded actions, not
open-ended investigation. Example:

- Within 15 min: confirm or rule out a body-mutating proxy on the
  receiver path.
- Within 30 min: rotate the signing secret or revert the most recent
  receiver deploy.
- Within 60 min: drain the queued retry backlog after either fix.

## Hand-off artefacts

The on-call engineer cannot work without:

1. The exact bytes that arrived at the receiver (pcap or proxy access
   log).
2. The exact bytes the sender hashed.
3. Any deploy / rotation events in the last 60 minutes on either side.
4. The wall-clock skew between sender and receiver at the time of the
   first failed request.

Without these the diagnosis is bounded by what the customer reports;
with them, the receiving team can localise to one of: secret, body,
or timestamp prefix.

## Escalation chain

- Primary on-call: <rotation>
- Secondary: <rotation>
- Manager (if no acknowledgement in 10 min): <name>

## Post-incident

A P0 always produces a postmortem. Link the case file (`<fixture_dir>`)
in the post — it is the smallest reproducible artefact for the
incident timeline.