rust-job-queue-api-worker-system 0.1.0

# Operations Runbook

Operational notes for running the system in production-shaped environments. Covers boot order, expected failure modes, recovery, and how to read metrics during an incident.

---

## Boot order

The components form a strict dependency chain:

```
postgres  →  migrations (one-shot, exits 0)  →  {api, worker}
```

`docker-compose.yml` enforces this with `depends_on: { condition: service_healthy }` for Postgres and `service_completed_successfully` for the migrations container. The api and worker will refuse to start until migrations have applied cleanly.

The migrations container runs `job-queue-migrate`, which is idempotent: re-running it against an already-migrated database is a no-op.

---

## Worker crash mid-job

The worker can exit at any time (SIGKILL, OOM, hardware failure). When this happens, the row it was processing is left with `status = 'running'` and `locked_at` pointing at the crash time.

### Recovery behavior

On the next worker startup, `recover_stale_at_startup` runs once and re-queues any row whose `status = 'running' AND locked_at < now() - interval '5 minutes'`. Re-queued rows have `status = 'retrying'`, `run_at = now()`, and `locked_at = NULL`. They will be picked up by the next dequeue cycle.

The 5-minute threshold is deliberate: it avoids reprocessing live work where a worker is still actively executing a long job. The cost is that a *truly* crashed worker's job is blocked for up to 5 minutes before recovery picks it up.

### Manual intervention

If you need to recover a stuck row faster than the 5-minute threshold (e.g., during incident response), you can manually re-queue it:

```sql
UPDATE jobs
   SET status      = 'retrying',
       run_at      = now(),
       locked_at   = NULL,
       locked_by   = NULL,
       last_error  = COALESCE(last_error, 'manually requeued during incident'),
       updated_at  = now()
 WHERE id = '<job-uuid>'
   AND status = 'running';
```

Always check `attempts < max_attempts` first; if the job has already exhausted its attempts, manually re-queuing will cause `failed_permanent` after the next try anyway.

---

## Postgres becomes unreachable

### Worker behavior

`queue::fetch_next` returns `sqlx::Error`, the worker logs `fetch_next error; backing off`, sleeps up to the max poll interval (2 s by default), and retries. No data loss: in-flight jobs that completed their `mark_*` call before the disconnection are already durable; in-flight jobs that did not complete will be observed as stale-locked at the next successful connection (recovery sweep at next startup).

### API behavior

Handlers that try to read or write to Postgres return 500 Internal Server Error with the error body `{"error": "internal", "message": "internal server error"}`. The real error detail is logged at `error!` level — search the API container's logs by the request ID surfaced in the response's `x-request-id` header.

### Recovery

When Postgres comes back, the worker reconnects on the next poll cycle (~500 ms - 2 s) without operator intervention. The API resumes serving on the next request after pool reconnect.

---

## Manually re-queueing a job

To force a job back into the queue (e.g., after an external dependency that the job calls has been fixed):

```sql
UPDATE jobs
   SET status     = 'retrying',
       run_at     = now(),
       updated_at = now()
 WHERE id = '<job-uuid>'
   AND status IN ('failed_permanent', 'cancelled');
```

Be deliberate. `failed_permanent` means the system gave up after `max_attempts`; re-queuing without first understanding *why* it failed will likely produce another `failed_permanent`.

---

## Reading metrics

Two Prometheus endpoints. Scrape both.

| Service | Endpoint | Default port |
|---|---|---|
| API | `GET /metrics` (same port as HTTP) | 8080 |
| Worker | `GET /metrics` (separate listener) | 9091 |

### Worker metrics

- `worker_jobs_started_total{kind}` — counter, total claims by the dequeue path.
- `worker_jobs_completed_total{kind, outcome}` — counter, total `mark_*` operations. `outcome` is one of: `succeeded`, `retrying`, `failed_permanent`, `cancelled`, `error` (the last meaning the `mark_*` SQL itself failed and was logged as a warning).
- `worker_job_duration_seconds{kind, outcome}` — histogram, time from claim to terminal disposition.

### Useful queries

In-flight jobs at scrape time:
```promql
sum(worker_jobs_started_total) - sum(worker_jobs_completed_total)
```

Failure rate (per kind, last 5 min):
```promql
rate(worker_jobs_completed_total{outcome="failed_permanent"}[5m])
  /
rate(worker_jobs_completed_total[5m])
```

Per-kind p99 latency (last 5 min):
```promql
histogram_quantile(0.99, rate(worker_job_duration_seconds_bucket[5m]))
```

Queue depth (this requires a separate query against Postgres; not exposed as a metric in v0.1):
```sql
SELECT status, COUNT(*) FROM jobs GROUP BY status;
```

---

## Graceful shutdown semantics

On SIGINT or SIGTERM:

- The api binary stops accepting new connections, drains in-flight requests, and exits when all requests complete.
- The worker binary flips its `CancellationToken`. Worker loops exit between jobs (not mid-job). After a 30-second grace period any still-running tasks are aborted; their rows stay in `status = 'running'` and will be recovered on the next startup (see "Worker crash mid-job" above).

The 30-second grace is a configurable default. To increase it for a heavy-job workload, modify `WorkerRuntime::with_shutdown_grace(...)` in the worker bin before calling `runtime.run(cancel)`.

---

## Log patterns to alert on

Run a log aggregator search / alert against these patterns:

- `failed to mark succeeded` / `failed to mark failed_or_retry` / `failed to finalize cancelled` — the `mark_*` SQL after a successful job execution failed; the row may be in an unexpected state.
- `fetch_next error; backing off` — repeated occurrences indicate Postgres connectivity issues.
- `shutdown grace period expired; aborting remaining workers` — a job took longer than the 30 s grace; investigate that job kind.
- `recovered N stale running jobs from prior shutdown` — non-zero at startup means the previous shutdown was not clean.

---

## Configuration reference

All knobs are environment variables. See [.env.example](../.env.example) for the full list with defaults.

| Variable | Default | Notes |
|---|---|---|
| `DATABASE_URL` | _(required)_ | Postgres connection string |
| `RUST_LOG` | `info,sqlx=warn` | tracing-subscriber filter |
| `RUST_LOG_FORMAT` | _(unset)_ | Set to `json` for structured logs |
| `API_BIND_ADDR` | `0.0.0.0:8080` | API server bind address |
| `WORKER_CONCURRENCY` | `4` | Worker tasks per worker process |
| `WORKER_METRICS_BIND_ADDR` | `0.0.0.0:9091` | Worker Prometheus listener |