# Operations Runbook
Operational notes for running the system in production-shaped environments. Covers boot order, expected failure modes, recovery, and how to read metrics during an incident.
---
## Boot order
The components form a strict dependency chain:
```
postgres → migrations (one-shot, exits 0) → {api, worker}
```
`docker-compose.yml` enforces this with `depends_on: { condition: service_healthy }` for Postgres and `service_completed_successfully` for the migrations container. The api and worker will refuse to start until migrations have applied cleanly.
The migrations container runs `job-queue-migrate`, which is idempotent: re-running it against an already-migrated database is a no-op.
---
## Worker crash mid-job
The worker can exit at any time (SIGKILL, OOM, hardware failure). When this happens, the row it was processing is left with `status = 'running'` and `locked_at` pointing at the crash time.
### Recovery behavior
On the next worker startup, `recover_stale_at_startup` runs once and re-queues any row whose `status = 'running' AND locked_at < now() - interval '5 minutes'`. Re-queued rows have `status = 'retrying'`, `run_at = now()`, and `locked_at = NULL`. They will be picked up by the next dequeue cycle.
The 5-minute threshold is deliberate: it avoids reprocessing live work where a worker is still actively executing a long job. The cost is that a *truly* crashed worker's job is blocked for up to 5 minutes before recovery picks it up.
### Manual intervention
If you need to recover a stuck row faster than the 5-minute threshold (e.g., during incident response), you can manually re-queue it:
```sql
UPDATE jobs
SET status = 'retrying',
run_at = now(),
locked_at = NULL,
locked_by = NULL,
last_error = COALESCE(last_error, 'manually requeued during incident'),
updated_at = now()
WHERE id = '<job-uuid>'
AND status = 'running';
```
Always check `attempts < max_attempts` first; if the job has already exhausted its attempts, manually re-queuing will cause `failed_permanent` after the next try anyway.
---
## Postgres becomes unreachable
### Worker behavior
`queue::fetch_next` returns `sqlx::Error`, the worker logs `fetch_next error; backing off`, sleeps up to the max poll interval (2 s by default), and retries. No data loss: in-flight jobs that completed their `mark_*` call before the disconnection are already durable; in-flight jobs that did not complete will be observed as stale-locked at the next successful connection (recovery sweep at next startup).
### API behavior
Handlers that try to read or write to Postgres return 500 Internal Server Error with the error body `{"error": "internal", "message": "internal server error"}`. The real error detail is logged at `error!` level — search the API container's logs by the request ID surfaced in the response's `x-request-id` header.
### Recovery
When Postgres comes back, the worker reconnects on the next poll cycle (~500 ms - 2 s) without operator intervention. The API resumes serving on the next request after pool reconnect.
---
## Manually re-queueing a job
To force a job back into the queue (e.g., after an external dependency that the job calls has been fixed):
```sql
UPDATE jobs
SET status = 'retrying',
run_at = now(),
updated_at = now()
WHERE id = '<job-uuid>'
AND status IN ('failed_permanent', 'cancelled');
```
Be deliberate. `failed_permanent` means the system gave up after `max_attempts`; re-queuing without first understanding *why* it failed will likely produce another `failed_permanent`.
---
## Reading metrics
Two Prometheus endpoints. Scrape both.
| API | `GET /metrics` (same port as HTTP) | 8080 |
| Worker | `GET /metrics` (separate listener) | 9091 |
### Worker metrics
- `worker_jobs_started_total{kind}` — counter, total claims by the dequeue path.
- `worker_jobs_completed_total{kind, outcome}` — counter, total `mark_*` operations. `outcome` is one of: `succeeded`, `retrying`, `failed_permanent`, `cancelled`, `error` (the last meaning the `mark_*` SQL itself failed and was logged as a warning).
- `worker_job_duration_seconds{kind, outcome}` — histogram, time from claim to terminal disposition.
### Useful queries
In-flight jobs at scrape time:
```promql
sum(worker_jobs_started_total) - sum(worker_jobs_completed_total)
```
Failure rate (per kind, last 5 min):
```promql
rate(worker_jobs_completed_total{outcome="failed_permanent"}[5m])
/
rate(worker_jobs_completed_total[5m])
```
Per-kind p99 latency (last 5 min):
```promql
histogram_quantile(0.99, rate(worker_job_duration_seconds_bucket[5m]))
```
Queue depth (this requires a separate query against Postgres; not exposed as a metric in v0.1):
```sql
SELECT status, COUNT(*) FROM jobs GROUP BY status;
```
---
## Graceful shutdown semantics
On SIGINT or SIGTERM:
- The api binary stops accepting new connections, drains in-flight requests, and exits when all requests complete.
- The worker binary flips its `CancellationToken`. Worker loops exit between jobs (not mid-job). After a 30-second grace period any still-running tasks are aborted; their rows stay in `status = 'running'` and will be recovered on the next startup (see "Worker crash mid-job" above).
The 30-second grace is a configurable default. To increase it for a heavy-job workload, modify `WorkerRuntime::with_shutdown_grace(...)` in the worker bin before calling `runtime.run(cancel)`.
---
## Log patterns to alert on
Run a log aggregator search / alert against these patterns:
- `failed to mark succeeded` / `failed to mark failed_or_retry` / `failed to finalize cancelled` — the `mark_*` SQL after a successful job execution failed; the row may be in an unexpected state.
- `fetch_next error; backing off` — repeated occurrences indicate Postgres connectivity issues.
- `shutdown grace period expired; aborting remaining workers` — a job took longer than the 30 s grace; investigate that job kind.
- `recovered N stale running jobs from prior shutdown` — non-zero at startup means the previous shutdown was not clean.
---
## Configuration reference
All knobs are environment variables. See [.env.example](../.env.example) for the full list with defaults.
| `DATABASE_URL` | _(required)_ | Postgres connection string |
| `RUST_LOG` | `info,sqlx=warn` | tracing-subscriber filter |
| `RUST_LOG_FORMAT` | _(unset)_ | Set to `json` for structured logs |
| `API_BIND_ADDR` | `0.0.0.0:8080` | API server bind address |
| `WORKER_CONCURRENCY` | `4` | Worker tasks per worker process |
| `WORKER_METRICS_BIND_ADDR` | `0.0.0.0:9091` | Worker Prometheus listener |