solti_prometheus/lib.rs
1//! # solti-prometheus
2//!
3//! Prometheus metrics for the Solti task-orchestration SDK.
4//!
5//! Collectors and helpers share one [`prometheus::Registry`]. A single
6//! `/metrics` endpoint covers runner internals, supervisor events, API
7//! requests, discovery heartbeat, process stats, and build identity.
8//!
9//! ## Collectors and helpers
10//!
11//! | Component | Prefix | Feature | Implements / emits |
12//! |-------------------------------|-------------------------------|------------|-----------------------------------------------------------|
13//! | [`PrometheusMetrics`] | `solti_runner_*` | — | [`solti_runner::MetricsBackend`] |
14//! | [`PrometheusSubscriber`] | `solti_sv_*` + `solti_ctrl_*` | — | [`taskvisor::Subscribe`] |
15//! | [`PrometheusApiMetrics`] | `solti_api_*` | `api` | `solti_api::ApiMetricsBackend` |
16//! | [`PrometheusDiscoverMetrics`] | `solti_discover_*` | `discover` | `solti_discover::DiscoverMetricsBackend` |
17//! | [`register_process_collector`]| `process_*` | `process` | Prometheus' default process collector (Linux-only effect) |
18//! | [`register_build_info`] | `solti_build_info` | — | Gauge `= 1` carrying constant labels |
19//! | [`server`] | — | `server` | Embedded supervised HTTP task exposing `/metrics` |
20//! | [`PrometheusStateCollector`] | `solti_sv_tasks_by_phase` | `state` | Pull-based collector over `solti_core::TaskState` |
21//!
22//! ## Architecture
23//!
24//! ```text
25//! ┌─────────────────────────────────────────────────────────────────────────┐
26//! │ Shared Registry │
27//! │ │
28//! │ PrometheusMetrics → solti_runner_* │
29//! │ PrometheusSubscriber → solti_sv_* , solti_ctrl_* │
30//! │ PrometheusApiMetrics → solti_api_* [feature: api] │
31//! │ PrometheusDiscoverMetrics → solti_discover_* [feature: discover] │
32//! │ register_process_collector→ process_* [feature: process] │
33//! │ register_build_info → solti_build_info │
34//! └──────────┬──────────────────────────────────────────────┬───────────────┘
35//! │ │
36//! ▼ ▼
37//! BuildContext Supervisor /metrics HTTP
38//! runners call event bus exposed by
39//! record_task_*() fans events solti_prometheus::server()
40//! to on_event() [feature: server]
41//! ```
42//!
43//! ## Runner metrics (`solti_runner_*`)
44//!
45//! | Metric | Type | Labels | Description |
46//! |--------------------------------------|-----------|---------------------|--------------------------------|
47//! | `solti_runner_tasks_started_total` | Counter | `runner` | Task spawn events |
48//! | `solti_runner_tasks_completed_total` | Counter | `runner`, `outcome` | Task completion events |
49//! | `solti_runner_task_duration_seconds` | Histogram | `runner`, `outcome` | Per-attempt execution duration |
50//! | `solti_runner_errors_total` | Counter | `runner`, `error` | Runner setup/teardown errors |
51//!
52//! Duration histogram buckets (seconds): [ `0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 300, 1800, 3600`].
53//! Dense through the `10 ms - 10 s` range, sparser long tail out to one hour.
54//!
55//! ## Supervision metrics (`solti_sv_*`)
56//!
57//! | Metric | Type | Labels | Description |
58//! |------------------------------------------|-----------|----------|------------------------------|
59//! | `solti_sv_tasks_in_flight` | Gauge | — | Currently executing tasks |
60//! | `solti_sv_task_restarts_total` | Counter | — | Restarts (attempt > 1) |
61//! | `solti_sv_task_backoff_count_total` | Counter | `source` | Backoff events |
62//! | `solti_sv_task_backoff_duration_seconds` | Histogram | — | Backoff delay duration |
63//! | `solti_sv_task_terminal_total` | Counter | `reason` | Terminal task states |
64//! | `solti_sv_attempts_to_finalize` | Histogram | `outcome`| Attempts when task left loop |
65//! | `solti_sv_task_timeouts_total` | Counter | — | Timeout events |
66//! | `solti_sv_subscriber_overflow_total` | Counter | — | Queue overflow (lost events) |
67//! | `solti_sv_subscriber_panicked_total` | Counter | — | Subscriber panics |
68//! | `solti_sv_tasks_by_phase` | Gauge | `phase` | Current tasks per phase (feature `state`, pull-based snapshot) |
69//!
70//! ## Controller metrics (`solti_ctrl_*`)
71//!
72//! | Metric | Type | Labels | Description |
73//! |--------------------------------|-----------|----------|----------------------------------------|
74//! | `solti_ctrl_submissions_total` | Counter | — | Controller submissions |
75//! | `solti_ctrl_rejections_total` | CounterVec| `reason` | Controller rejections grouped by cause |
76//!
77//! `reason` values (bounded, classified from `Event.reason`):
78//! [ `slot_full`, `slot_busy`, `add_failed`, `remove_failed`, `queue_failed`, `recovery_failed`, `bus_lagged`, `controller_exited`, `other`, `unknown`].
79//!
80//! ## API metrics (`solti_api_*`, feature `api`)
81//!
82//! | Metric | Type | Labels | Description |
83//! |--------------------------------------|-----------|-------------------------------------------|--------------------------|
84//! | `solti_api_requests_total` | Counter | `transport`, `method`, `path`, `status` | Completed requests |
85//! | `solti_api_request_duration_seconds` | Histogram | `transport`, `method`, `path` | Request duration |
86//! | `solti_api_in_flight_requests` | Gauge | `transport` | In-flight request count |
87//!
88//! Request duration buckets (seconds): `0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10`.
89//!
90//! ## Discovery metrics (`solti_discover_*`, feature `discover`)
91//!
92//! | Metric | Type | Labels | Description |
93//! |---------------------------------------------------|-----------|-----------|-----------------------------------|
94//! | `solti_discover_attempts_total` | Counter | — | Total sync attempts |
95//! | `solti_discover_outcomes_total` | Counter | `outcome` | `success` / `failure` |
96//! | `solti_discover_duration_seconds` | Histogram | `outcome` | Sync call duration |
97//! | `solti_discover_failures_total` | Counter | `reason` | Failures grouped by cause |
98//! | `solti_discover_last_success_timestamp_seconds` | Gauge | — | UNIX time of last success |
99//! | `solti_discover_holds_total` | Counter | — | Server-advised retry holds |
100//! | `solti_discover_hold_duration_seconds` | Histogram | — | Duration of advised holds |
101//!
102//! ## Process metrics (`process_*`, feature `process`)
103//!
104//! On Linux, [`register_process_collector`] adds the standard Prometheus process collector:
105//! - `process_cpu_seconds_total`
106//! - `process_resident_memory_bytes`
107//! - `process_virtual_memory_bytes`
108//! - `process_open_fds`
109//! - `process_max_fds`
110//! - `process_start_time_seconds`.
111//!
112//! On other targets the function is a no-op.
113//! (compiles cleanly, registers nothing).
114//!
115//! ## Build info
116//!
117//! [`register_build_info`] registers a `solti_build_info` gauge whose value is always `1`.
118//! Its labels (set as Prometheus *constant* labels) carry build-time identity.
119//!
120//! ## Event → metric mapping (supervision + controller)
121//!
122//! ```text
123//! TaskStarting → tasks_in_flight.inc() (+ task_restarts.inc() if attempt > 1)
124//! TaskStopped → tasks_in_flight.dec()
125//! TaskFailed → tasks_in_flight.dec()
126//! TimeoutHit → task_timeouts.inc()
127//! BackoffScheduled → task_backoff_count{source}.inc() + task_backoff_duration.observe(delay)
128//! ActorExhausted → task_terminal{reason="exhausted"}.inc() + attempts_to_finalize{outcome="exhausted"}.observe(attempt)
129//! ActorDead → task_terminal{reason="fatal"}.inc() + attempts_to_finalize{outcome="fatal"}.observe(attempt)
130//! SubscriberOverflow → subscriber_overflow.inc()
131//! SubscriberPanicked → subscriber_panicked.inc()
132//! ControllerSubmitted → controller_submissions.inc()
133//! ControllerRejected → controller_rejections{reason}.inc() (reason classified from Event.reason)
134//! ```
135//!
136//! ## Feature flags
137//!
138//! | Flag | Default | Effect |
139//! |------------|---------|-------------------------------------------------------------------------------------------------------------------|
140//! | `api` | off | Enables [`PrometheusApiMetrics`] (depends on `solti-api`) |
141//! | `discover` | off | Enables [`PrometheusDiscoverMetrics`] (depends on `solti-discover`) |
142//! | `process` | off | Makes [`register_process_collector`] register actual `process_*` metrics (Linux); propagates `prometheus/process` |
143//! | `server` | off | Enables [`server`] - a supervised embedded HTTP task serving `/metrics` |
144//! | `state` | off | Enables [`PrometheusStateCollector`] — pull-based `solti_sv_tasks_by_phase` snapshot (depends on `solti-core`) |
145//!
146//! ## Quick wire
147//!
148//! ```text
149//! use std::sync::Arc;
150//! use solti_prometheus::{
151//! PrometheusMetrics, PrometheusSubscriber, Registry,
152//! register_build_info, register_process_collector,
153//! };
154//!
155//! let registry = Arc::new(Registry::new());
156//!
157//! // Core collectors.
158//! let metrics = PrometheusMetrics::new(registry.clone())?;
159//! let subscriber = PrometheusSubscriber::new(registry.clone())?;
160//!
161//! // Standard extras.
162//! register_process_collector(®istry)?;
163//! register_build_info(®istry, &[
164//! ("version", env!("CARGO_PKG_VERSION")),
165//! ])?;
166//!
167//! // Wire into solti-runner:
168//! let ctx = BuildContext::new(RunnerEnv::default(), Arc::new(metrics));
169//! let router = RunnerRouter::new().with_context(ctx);
170//!
171//! // Wire into solti-core supervisor:
172//! let subscribers: Vec<Arc<dyn Subscribe>> = vec![Arc::new(subscriber)];
173//! ```
174//!
175//! For a full agent wiring — including the supervised `/metrics` HTTP task, `ApiMetricsBackend` (HTTP + gRPC), and `DiscoverMetricsBackend`
176//! _see the reference agents under `examples/agentd-http` and `examples/agentd-grpc`_ .
177//!
178//! ## Notes
179//!
180//! - `tasks_in_flight` gauge is guarded against going negative: a
181//! [`TaskStopped`](taskvisor::EventKind::TaskStopped) without a preceding
182//! [`TaskStarting`](taskvisor::EventKind::TaskStarting) is a no-op.
183//! - Backoff / discover / API durations are converted from ms → seconds before histogram observation.
184//! - [`PrometheusSubscriber`] uses `queue_capacity = 2048` (2× taskvisor default) to reduce event loss under high throughput.
185//! - All collectors must share a single [`prometheus::Registry`] for a unified `/metrics` endpoint.
186//!
187//! ## Also
188//!
189//! - [`solti_runner::MetricsBackend`] - trait backing [`PrometheusMetrics`].
190//! - [`taskvisor::Subscribe`] - trait backing [`PrometheusSubscriber`].
191//! - `solti_api::ApiMetricsBackend` - trait backing [`PrometheusApiMetrics`] (feature `api`).
192//! - `solti_discover::DiscoverMetricsBackend` - trait backing [`PrometheusDiscoverMetrics`] (feature `discover`).
193
194mod register;
195
196mod subscriber;
197pub use subscriber::{DEFAULT_QUEUE_CAPACITY, PrometheusSubscriber};
198
199mod process;
200pub use process::register_process_collector;
201
202mod backend;
203pub use backend::PrometheusMetrics;
204
205mod info;
206pub use info::register_build_info;
207
208#[cfg(feature = "discover")]
209mod discover;
210#[cfg(feature = "discover")]
211pub use discover::PrometheusDiscoverMetrics;
212
213#[cfg(feature = "api")]
214mod api;
215#[cfg(feature = "api")]
216pub use api::PrometheusApiMetrics;
217
218#[cfg(feature = "server")]
219mod server;
220#[cfg(feature = "server")]
221pub use server::{METRICS_SERVER_SLOT, server};
222
223#[cfg(feature = "state")]
224mod state;
225#[cfg(feature = "state")]
226pub use state::PrometheusStateCollector;
227
228pub use prometheus::Registry;