apiforge 0.4.0

Production-grade API release automation CLI. From merged code to healthy pods in production — one command.
Documentation
# ApiForge — Deep Dive

> Production-grade API release automation CLI, written in Rust.
> From merged code to healthy pods in production — one command.

---

## 1. The problem

Releasing an API service usually means a fragile chain of manual steps:
bump the version file, write a changelog, commit, tag, push, build a Docker
image, authenticate against a registry, push image tags, patch the Kubernetes
deployment, watch the rollout, create a GitHub release, check the health
endpoint, tell the team. Every step can fail, every failure leaves the system
in a half-released state, and the recovery procedure lives in someone's head.

ApiForge turns that chain into a single, transactional command:

```bash
apiforge release patch
```

If anything fails midway, everything that already ran is **rolled back
automatically, in reverse order**.

---

## 2. Architecture

```
CLI (clap)
   │
   ▼
Commands (main.rs)  ──── init / doctor / release / rollback / history / status / config validate
   │
   ▼
ReleaseOrchestrator ──── validate all → execute in order → reverse-order rollback on failure
   │
   ▼
Steps (Step trait)  ──── git-preflight · version-bump · changelog · git-commit · git-tag
   │                     git-push · docker-build · docker-push · k8s-update · k8s-rollout
   │                     github-release · health-check · slack/webhook notify
   ▼
Integrations        ──── git2 (libgit2) · bollard (Docker) · kube · AWS SDK (ECR/STS) · octocrab (GitHub)
   │
   ▼
External systems    ──── Git remote · Docker daemon · container registry · K8s cluster · GitHub · Slack
```

Cross-cutting modules:

| Module | Role |
|---|---|
| `config.rs` | `apiforge.toml` model, validation, `${VAR}` env resolution |
| `error.rs` | Typed error hierarchy (`thiserror`) per domain |
| `audit/` | Embedded sled DB: release history with per-step records |
| `output/` | Terminal rendering: spinners on TTY, plain lines in CI, stderr routing for JSON mode |
| `utils/` | semver bumping, Tera templating, retry with exponential backoff + jitter, secret redaction, rollback target selection |

---

## 3. The Step contract

Every pipeline action implements one trait:

```rust
#[async_trait]
pub trait Step: Send + Sync {
    fn name(&self) -> &str;
    fn description(&self) -> &str;
    async fn validate(&self, ctx: &StepContext) -> Result<()>;   // preflight
    async fn execute(&self, ctx: &StepContext) -> Result<StepOutput>;
    async fn dry_run(&self, ctx: &StepContext) -> Result<StepOutput>;
    async fn rollback(&self, _ctx: &StepContext) -> Result<()> { Ok(()) }
}
```

Design points:

- **Validation is separated from execution.** The orchestrator validates
  *every* step before executing *any* step — a missing Dockerfile or invalid
  GitHub token aborts the release before the first git mutation.
- **Dry-run is a first-class path**, not a flag check inside execute. Steps
  return structured previews (file diffs, resolved image tags, estimated
  layers, changelog content) with zero side effects.
- **Rollback state is owned by the step.** The version-bump step snapshots
  the original file bytes; the GitHub step remembers the created release ID;
  steps use interior mutability (`RwLock`, atomics) since the orchestrator
  holds them behind shared references.
- **Progress streaming.** Long-running steps (Docker build/push, rollout
  wait, health polling) push live messages through a `ProgressReporter` in
  the context, rendered as terminal spinners.

---

## 4. The orchestrator and the RunReport

```rust
let report: RunReport = orchestrator.run().await;
// report.steps: Vec<(name, StepOutput)>  — includes the failed step
// report.error: Option<ApiForgeError>
// report.rolled_back: bool
```

The orchestrator never hides partial progress. On failure it:

1. records the failing step's sanitized error,
2. rolls back completed steps in reverse order (a failed rollback of one step
   is logged and does not stop the others),
3. returns the full report.

The CLI then writes an audit record with real status
(`success` / `failed` / `rolled_back`), per-step durations, and the error —
and sends notifications for both outcomes.

---

## 5. Rollback semantics

| Step | Undo |
|---|---|
| version-bump | Restore the original file bytes (preserves unrelated local edits) |
| changelog | `git checkout CHANGELOG.md` |
| git-commit | Soft reset to parent (changes stay staged) |
| git-tag | Delete tag |
| git-push | Delete remote + local tag; **commit history is never force-rewritten** |
| k8s-update | Revert deployment to previous ReplicaSet revision (like `kubectl rollout undo`) |
| k8s-rollout | No-op — the update step owns the revert (prevents double-rollback to N−2) |
| github-release | Delete the created release |

Why no force-push on rollback: once a commit reaches a shared remote, peers
and CI may have consumed it. The version-bump commit is harmless on its own —
without a tag it is not a release, no image is published for it, and the next
successful release supersedes it.

---

## 6. Smart rollback (`apiforge rollback`)

Without `--to`, the target is auto-detected:

1. Read the **currently deployed version** from the Kubernetes deployment's
   image tag.
2. Candidates: successful, non-dry-run releases from the **audit history**
   (newest first); fall back to **semver git tags** when no history exists.
3. Pick the newest candidate **strictly older** than the deployed version.
4. Confirm (unless `--yes`), patch the deployment, wait for rollout, then
   **re-run the configured health check** — a rollback that leaves the
   service unhealthy exits non-zero.

`--dry-run` previews the decision without cluster access.

---

## 7. Reliability engineering

- **Timeouts** wrap every network-prone git operation (libgit2 calls run in
  blocking threads under `tokio::time::timeout`); fetch/push/operation
  budgets are configurable per project.
- **Retries with exponential backoff + jitter** wrap AWS, Kubernetes, and
  GitHub calls; error classifiers distinguish transient failures (throttling,
  5xx, timeouts) from permanent ones (auth, not-found) so the latter fail
  immediately.
- **Secret redaction** scrubs AWS account IDs/access keys/ARNs, GitHub
  tokens, URL tokens, and basic-auth credentials from every error path —
  including audit metadata and notification text.
- **Bounded audit storage**: sled DB capped at 10k records with pruning and
  compaction; writes are retried on transient I/O errors and flushed
  explicitly.
- **Env-var resolution at load time**: `${GITHUB_TOKEN}` style references in
  secret fields resolve when the config is read; a missing variable fails
  fast with its name.

---

## 8. UX details

- **Live spinners** (interactive terminals only) show the running step with
  streaming detail: Docker build output lines, `docker-push: tag 2/3`,
  `k8s-rollout: 4/5 replicas ready`, `health-check: attempt 3 (12s of 60s)`.
- **CI-safe degradation**: pipes and CI logs get the exact plain line format;
  no ANSI spinner frames ever leak.
- **Machine output**: `--output json` keeps stdout pure JSON (plan and
  progress go to stderr) — pipe straight into `jq`.
- **Guard rails**: dirty tree, wrong branch, ahead/behind remote, and
  duplicate tags are all blocked in preflight with precise messages.
- `apiforge doctor` checks tooling; `apiforge config validate --verbose`
  runs layered config diagnostics (file → TOML → schema → field semantics).

---

## 9. Language support

The version step reads/writes the native version file per ecosystem:

| Language | File | Mechanism |
|---|---|---|
| Rust | `Cargo.toml` | `toml_edit` (format-preserving) |
| Node | `package.json` | serde JSON (pretty, trailing newline) |
| Python | `pyproject.toml` | Poetry **and** PEP 621 layouts |
| Go | `version.go` / `go.mod` | `Version = "x"` var or comment fallback |
| Java | `pom.xml` | project-level `<version>` (skips `${properties}`) |

---

## 10. Quality gates

- 74 unit/integration tests (orchestrator rollback ordering, audit retention,
  retry edge cases, config env resolution, changelog ordering, rollback
  target selection, version file round-trips)
- End-to-end CLI matrix: 52 scenario tests across every command and flag,
  including failure paths, non-TTY hang checks, and JSON purity
- `clippy -D warnings`, rustfmt, `cargo audit` (documented transitive
  ignores), criterion benchmarks
- CI: test matrix on Linux/macOS/Windows × stable + MSRV 1.91.1; release
  workflow builds 5 targets and publishes to crates.io + GitHub Releases

---

## 11. Extending ApiForge

Adding a step:

1. Create a module under `src/steps/`.
2. Implement `Step` (name, description, validate, execute, dry_run, and
   rollback if the step mutates anything).
3. Register it in `src/steps/mod.rs` and wire it into the pipeline in
   `main.rs`.

The trait keeps concerns local: a step never needs to know about ordering,
rollback sequencing, output rendering, or audit recording.