omni-dev 0.24.0

# ADR-0010: Multi-Layer Retry Strategy

## Status

✅ Accepted

## Context

omni-dev's `twiddle` and `check` commands make AI API calls that can fail in several
distinct ways:

- **Batch-level failure.** A batch of multiple commits sent in a single request fails
  entirely — due to a transient network error, a rate limit, or a model-side error. The
  failure says nothing about whether individual commits within the batch would succeed
  alone.

- **Response parse failure.** The AI returns a successful HTTP response but the content is
  not valid YAML, or the YAML does not conform to the expected schema. This occurs
  occasionally when the model produces malformed output despite receiving a valid prompt.

- **Persistent single-commit failure.** A specific commit fails even when sent alone,
  suggesting a commit-specific problem (e.g. a diff that consistently confuses the model)
  rather than a transient infrastructure issue.

A single retry policy cannot address all three failure modes well:

- Retrying a failed batch as a whole risks hitting the same context-window or model issue
  again. It does not isolate which commit(s) caused the failure.
- Immediately surfacing a batch failure to the user abandons commits that might have
  succeeded individually.
- Silently swallowing all failures (or retrying indefinitely) provides no user control.

Three single-policy alternatives were evaluated:

- **No retry.** Simple but fragile. Any transient failure propagates immediately to the
  user, who must re-run the entire command. Transient network and rate-limit errors are
  common enough in production AI API usage that this is not acceptable.

- **Uniform exponential backoff.** Appropriate for rate-limit errors, but adds latency for
  every failure regardless of cause, and does not distinguish between a flaky network and
  a commit-specific model failure.

- **Interactive retry only.** Lets the user decide on every failure. Correct in principle
  but requires user attention for failures that are trivially recoverable automatically,
  making the common case unnecessarily interactive.

## Decision

We will use a **three-layer retry strategy** numbered in execution order from innermost to
outermost — from most-automatic to most-interactive.

### Layer 1 — Parse/request retry (check only)

Implemented in `src/claude/client.rs` as `check_commits_with_retry`.

For the `check` command, each AI request is retried up to two additional times (three total
attempts) when the request fails or the response cannot be parsed as a valid check report. A
warning is printed on each failed attempt. This is the innermost layer: it wraps every
individual AI call, absorbing occasional model output instability without user involvement.

This layer is not applied to `twiddle`. Amendment YAML parsing is structurally simpler than
check-report parsing, and amendment generation failures are handled by layers 2 and 3.
Convergence of retry behaviour between the two commands is deferred until evidence of
amendment parse failures accumulates.

### Layer 2 — Split-and-retry (batch failures)

Implemented in `src/cli/git/twiddle.rs` and `src/cli/git/check.rs`.

When a batch of more than one commit fails after layer 1 exhausts its attempts, the batch
is automatically split and each commit is retried as a solo request. A warning is printed
(`warning: batch of N failed, retrying individually: <error>`), but no user input is
required. This isolates the failure: commits that succeed individually are preserved; commits
that also fail individually fall through to layer 3.

Solo batches that fail are not split further (there is nothing to split); they fall directly
to layer 3.

### Layer 3 — Interactive retry (persistent failures)

Implemented in `run_interactive_retry_twiddle_check` (`twiddle.rs`) and
`run_interactive_retry_check` (`check.rs`).

Commits that remain failed after layers 1 and 2 are presented to the user one at a time.
The user is prompted to retry the commit individually or skip it. This gives the user
explicit control over commits with persistent failures without abandoning the rest of the
run.

### Layer interactions

The layers compose from innermost to outermost. For a multi-commit batch in `check`:

1. The batch is sent as a single AI request; layer 1 retries up to three times on failure.
2. If the batch still fails, each commit is retried individually (layer 2); layer 1 retries
   again for each individual request.
3. For each individual commit that still fails, the user is prompted (layer 3).

For `twiddle`, layer 1 does not apply; a failing batch goes directly to layer 2 (split),
and individual failures go to layer 3.

For a single-commit batch in either command, layer 2 does not apply; failures go directly
to layer 3 (after layer 1 retries for check).

## Consequences

**Positive:**

- **Transient failures are absorbed automatically.** Network blips and occasional model
  errors are retried without user involvement. The common case (transient failure of one
  commit in a multi-commit batch) resolves automatically via layer 2.

- **Failure isolation.** A batch failure does not discard the results of all commits in the
  batch. Layer 1 determines which specific commits are problematic, preserving the work
  done on commits that succeed individually.

- **User control for persistent failures.** Layer 3 ensures the user is never left with a
  silent partial result. They can retry individual problematic commits or skip them with
  full awareness.

**Negative:**

- **Asymmetric behaviour between `twiddle` and `check`.** Layer 1 (parse/request retry)
  applies only to `check`. A user who encounters a transient amendment parse failure in
  `twiddle` will reach layer 3 (interactive prompt) sooner than an equivalent failure in
  `check`. This inconsistency is a known limitation to be revisited if amendment parse
  failures become common in practice.

- **No backoff between retries.** Split-and-retry and parse/request retry do not introduce
  delays between attempts. Under sustained rate limiting, rapid retries may compound the
  problem. This is accepted for now given the bounded retry counts; a backoff mechanism
  can be added if rate-limit errors become frequent.

- **Layer 2 doubles API calls on batch failure.** When a batch of N commits fails, the
  split-and-retry sends N additional requests. Under rate limits this may temporarily
  increase pressure. The trade-off is accepted because the alternative — discarding all N
  commits — is strictly worse for the user.

**Neutral:**

- **Interactive retry requires stdin.** Layer 3 reads from stdin, which means the command
  cannot be used non-interactively (e.g. in CI) when commits reach layer 3. This is
  intentional: persistent failures require human judgement and are not appropriate to
  silently swallow in automation.