1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
//! Supervisor metrics HTTP server: `/metrics` and `/health`.
//!
//! Why: the unattended supervisor exposes fleet state so an operator or a
//! higher-level fleet manager can poll it without attaching to any session. A
//! tiny axum server serving a JSON snapshot satisfies the "metrics exposed on a
//! health endpoint" acceptance criterion while keeping the surface minimal.
//! What: holds a shared [`MetricsHandle`] (an `Arc<RwLock<FleetMetrics>>`) that
//! the supervisor loop updates after each sweep; [`router`] builds the axum
//! [`Router`]; [`bind`] reserves the port (fail-fast) and [`serve_on`] serves it
//! on the bound listener. Gated behind the `daemon` feature because axum is only
//! a dependency there (per the workspace axum-feature rule).
//! Test: `metrics_endpoint_returns_snapshot`, `health_endpoint_ok` in `super::tests`.
use SocketAddr;
use Arc;
use ;
use Serialize;
use RwLock;
use info;
use FleetMetrics;
/// Shared, lock-guarded fleet-metrics snapshot.
///
/// Why: the loop writes a fresh snapshot after each sweep while HTTP handlers read
/// it concurrently; an `Arc<RwLock<…>>` gives many readers / one writer with no
/// blocking on the (frequent) read path.
/// What: a type alias for the shared handle passed to both the loop and the router.
/// Test: exercised by `metrics_endpoint_returns_snapshot`.
pub type MetricsHandle = ;
/// Build a fresh, empty metrics handle.
///
/// Why: the supervisor needs one shared snapshot created before either the loop or
/// the server starts so both reference the same cell.
/// What: wraps a default [`FleetMetrics`] in `Arc<RwLock<…>>`.
/// Test: used by the loop wiring and `metrics_endpoint_returns_snapshot`.
/// Health-check response body.
///
/// Why: `/health` returns a stable, machine-checkable shape so liveness probes
/// (launchd KeepAlive checks, monitoring) can assert `status == "ok"`.
/// What: a one-field struct serialized as `{"status":"ok"}`.
/// Test: `health_endpoint_ok`.
/// Build the supervisor's metrics router over a shared [`MetricsHandle`].
///
/// Why: separating router construction from `serve` lets tests drive the handlers
/// in-process (via `tower::ServiceExt::oneshot`) without binding a socket.
/// What: returns a [`Router`] with `GET /health` and `GET /metrics`, carrying the
/// handle as axum state.
/// Test: `metrics_endpoint_returns_snapshot`, `health_endpoint_ok`.
/// `GET /health` — liveness probe.
///
/// Why: an always-on supervisor needs a trivially-cheap endpoint a process
/// supervisor can hit to confirm it is alive.
/// What: returns `{"status":"ok"}` with a 200.
/// Test: `health_endpoint_ok`.
async
/// `GET /metrics` — current fleet snapshot.
///
/// Why: the single endpoint a human / fleet manager polls to see counts by
/// lifecycle state, surfaced pending decisions, last activity, and supervisor run
/// stats.
/// What: clones the current [`FleetMetrics`] out from under the read lock and
/// returns it as JSON.
/// Test: `metrics_endpoint_returns_snapshot`.
async
/// Bind a `TcpListener` for the metrics server on `addr`.
///
/// Why: the supervisor must fail *fast* if the metrics port is already in use —
/// otherwise it would run for hours with no `/metrics` and only a buried log line.
/// Binding before the loop starts (and propagating the error) turns a silent
/// degradation into an immediate startup failure the operator sees.
/// What: binds and returns a `TcpListener`, surfacing any bind error to the
/// caller with the offending address as context.
/// Test: covered indirectly by `run_supervisor` (bind-before-loop); the failure
/// path mirrors the daemon's own listener bind.
pub async
/// Serve the metrics router on an already-bound listener until the task is dropped.
///
/// Why: separating bind from serve lets the caller bind first (failing fast on a
/// port collision) and only then spawn the serving task, so a bind error can be
/// propagated synchronously rather than lost inside a detached `tokio::spawn`.
/// What: logs the bound address and serves [`router`] on `listener`; returns an
/// error only if serving subsequently fails.
/// Test: handlers are unit-tested via the router; the serve path mirrors the
/// daemon's own `serve_http`.
pub async