Skip to main content

solti_prometheus/
lib.rs

1//! # solti-prometheus
2//!
3//! Prometheus metrics for the Solti task-orchestration SDK.
4//!
5//! Collectors and helpers share one [`prometheus::Registry`]. A single
6//! `/metrics` endpoint covers runner internals, supervisor events, API
7//! requests, discovery heartbeat, process stats, and build identity.
8//!
9//! ## Collectors and helpers
10//!
11//! | Component                     | Prefix                        | Feature    | Implements / emits                                        |
12//! |-------------------------------|-------------------------------|------------|-----------------------------------------------------------|
13//! | [`PrometheusMetrics`]         | `solti_runner_*`              | —          | [`solti_runner::MetricsBackend`]                          |
14//! | [`PrometheusSubscriber`]      | `solti_sv_*` + `solti_ctrl_*` | —          | [`taskvisor::Subscribe`]                                  |
15//! | [`PrometheusApiMetrics`]      | `solti_api_*`                 | `api`      | `solti_api::ApiMetricsBackend`                            |
16//! | [`PrometheusDiscoverMetrics`] | `solti_discover_*`            | `discover` | `solti_discover::DiscoverMetricsBackend`                  |
17//! | [`register_process_collector`]| `process_*`                   | `process`  | Prometheus' default process collector (Linux-only effect) |
18//! | [`register_build_info`]       | `solti_build_info`            | —          | Gauge `= 1` carrying constant labels                      |
19//! | [`server`]                    | —                             | `server`   | Embedded supervised HTTP task exposing `/metrics`         |
20//! | [`PrometheusStateCollector`]  | `solti_sv_tasks_by_phase`     | `state`    | Pull-based collector over `solti_core::TaskState`         |
21//!
22//! ## Architecture
23//!
24//! ```text
25//!   ┌─────────────────────────────────────────────────────────────────────────┐
26//!   │                         Shared Registry                                 │
27//!   │                                                                         │
28//!   │   PrometheusMetrics         → solti_runner_*                            │
29//!   │   PrometheusSubscriber      → solti_sv_* , solti_ctrl_*                 │
30//!   │   PrometheusApiMetrics      → solti_api_*          [feature: api]       │
31//!   │   PrometheusDiscoverMetrics → solti_discover_*     [feature: discover]  │
32//!   │   register_process_collector→ process_*            [feature: process]   │
33//!   │   register_build_info       → solti_build_info                          │
34//!   └──────────┬──────────────────────────────────────────────┬───────────────┘
35//!              │                                              │
36//!              ▼                                              ▼
37//!         BuildContext          Supervisor                /metrics HTTP
38//!         runners call          event bus                 exposed by
39//!         record_task_*()       fans events               solti_prometheus::server()
40//!                               to on_event()             [feature: server]
41//! ```
42//!
43//! ## Runner metrics (`solti_runner_*`)
44//!
45//! | Metric                               | Type      | Labels              | Description                    |
46//! |--------------------------------------|-----------|---------------------|--------------------------------|
47//! | `solti_runner_tasks_started_total`   | Counter   | `runner`            | Task spawn events              |
48//! | `solti_runner_tasks_completed_total` | Counter   | `runner`, `outcome` | Task completion events         |
49//! | `solti_runner_task_duration_seconds` | Histogram | `runner`, `outcome` | Per-attempt execution duration |
50//! | `solti_runner_errors_total`          | Counter   | `runner`, `error`   | Runner setup/teardown errors   |
51//!
52//! Duration histogram buckets (seconds): [ `0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 300, 1800, 3600`].
53//! Dense through the `10 ms - 10 s` range, sparser long tail out to one hour.
54//!
55//! ## Supervision metrics (`solti_sv_*`)
56//!
57//! | Metric                                   | Type      | Labels   | Description                  |
58//! |------------------------------------------|-----------|----------|------------------------------|
59//! | `solti_sv_tasks_in_flight`               | Gauge     | —        | Currently executing tasks    |
60//! | `solti_sv_task_restarts_total`           | Counter   | —        | Restarts (attempt > 1)       |
61//! | `solti_sv_task_backoff_count_total`      | Counter   | `source` | Backoff events               |
62//! | `solti_sv_task_backoff_duration_seconds` | Histogram | —        | Backoff delay duration       |
63//! | `solti_sv_task_terminal_total`           | Counter   | `reason` | Terminal task states         |
64//! | `solti_sv_attempts_to_finalize`          | Histogram | `outcome`| Attempts when task left loop |
65//! | `solti_sv_task_timeouts_total`           | Counter   | —        | Timeout events               |
66//! | `solti_sv_subscriber_overflow_total`     | Counter   | —        | Queue overflow (lost events) |
67//! | `solti_sv_subscriber_panicked_total`     | Counter   | —        | Subscriber panics            |
68//! | `solti_sv_tasks_by_phase`                | Gauge     | `phase`  | Current tasks per phase (feature `state`, pull-based snapshot) |
69//!
70//! ## Controller metrics (`solti_ctrl_*`)
71//!
72//! | Metric                         | Type      | Labels   | Description                            |
73//! |--------------------------------|-----------|----------|----------------------------------------|
74//! | `solti_ctrl_submissions_total` | Counter   | —        | Controller submissions                 |
75//! | `solti_ctrl_rejections_total`  | CounterVec| `reason` | Controller rejections grouped by cause |
76//!
77//! `reason` values (bounded, classified from `Event.reason`):
78//! [ `slot_full`, `slot_busy`, `add_failed`, `remove_failed`, `queue_failed`, `recovery_failed`, `bus_lagged`, `controller_exited`, `other`, `unknown`].
79//!
80//! ## API metrics (`solti_api_*`, feature `api`)
81//!
82//! | Metric                               | Type      | Labels                                    | Description              |
83//! |--------------------------------------|-----------|-------------------------------------------|--------------------------|
84//! | `solti_api_requests_total`           | Counter   | `transport`, `method`, `path`, `status`   | Completed requests       |
85//! | `solti_api_request_duration_seconds` | Histogram | `transport`, `method`, `path`             | Request duration         |
86//! | `solti_api_in_flight_requests`       | Gauge     | `transport`                               | In-flight request count  |
87//!
88//! Request duration buckets (seconds): `0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10`.
89//!
90//! ## Discovery metrics (`solti_discover_*`, feature `discover`)
91//!
92//! | Metric                                            | Type      | Labels    | Description                       |
93//! |---------------------------------------------------|-----------|-----------|-----------------------------------|
94//! | `solti_discover_attempts_total`                   | Counter   | —         | Total sync attempts               |
95//! | `solti_discover_outcomes_total`                   | Counter   | `outcome` | `success` / `failure`             |
96//! | `solti_discover_duration_seconds`                 | Histogram | `outcome` | Sync call duration                |
97//! | `solti_discover_failures_total`                   | Counter   | `reason`  | Failures grouped by cause         |
98//! | `solti_discover_last_success_timestamp_seconds`   | Gauge     | —         | UNIX time of last success         |
99//! | `solti_discover_holds_total`                      | Counter   | —         | Server-advised retry holds        |
100//! | `solti_discover_hold_duration_seconds`            | Histogram | —         | Duration of advised holds         |
101//!
102//! ## Process metrics (`process_*`, feature `process`)
103//!
104//! On Linux, [`register_process_collector`] adds the standard Prometheus process collector:
105//!  - `process_cpu_seconds_total`
106//!  - `process_resident_memory_bytes`
107//!  - `process_virtual_memory_bytes`
108//!  - `process_open_fds`
109//!  - `process_max_fds`
110//!  - `process_start_time_seconds`.
111//!
112//! On other targets the function is a no-op.
113//! (compiles cleanly, registers nothing).
114//!
115//! ## Build info
116//!
117//! [`register_build_info`] registers a `solti_build_info` gauge whose value is always `1`.
118//! Its labels (set as Prometheus *constant* labels) carry build-time identity.
119//!
120//! ## Event → metric mapping (supervision + controller)
121//!
122//! ```text
123//!  TaskStarting        → tasks_in_flight.inc() (+ task_restarts.inc() if attempt > 1)
124//!  TaskStopped         → tasks_in_flight.dec()
125//!  TaskFailed          → tasks_in_flight.dec()
126//!  TimeoutHit          → task_timeouts.inc()
127//!  BackoffScheduled    → task_backoff_count{source}.inc() + task_backoff_duration.observe(delay)
128//!  ActorExhausted      → task_terminal{reason="exhausted"}.inc() + attempts_to_finalize{outcome="exhausted"}.observe(attempt)
129//!  ActorDead           → task_terminal{reason="fatal"}.inc()     + attempts_to_finalize{outcome="fatal"}.observe(attempt)
130//!  SubscriberOverflow  → subscriber_overflow.inc()
131//!  SubscriberPanicked  → subscriber_panicked.inc()
132//!  ControllerSubmitted → controller_submissions.inc()
133//!  ControllerRejected  → controller_rejections{reason}.inc()  (reason classified from Event.reason)
134//! ```
135//!
136//! ## Feature flags
137//!
138//! | Flag       | Default | Effect                                                                                                            |
139//! |------------|---------|-------------------------------------------------------------------------------------------------------------------|
140//! | `api`      | off     | Enables [`PrometheusApiMetrics`] (depends on `solti-api`)                                                         |
141//! | `discover` | off     | Enables [`PrometheusDiscoverMetrics`] (depends on `solti-discover`)                                               |
142//! | `process`  | off     | Makes [`register_process_collector`] register actual `process_*` metrics (Linux); propagates `prometheus/process` |
143//! | `server`   | off     | Enables [`server`] - a supervised embedded HTTP task serving `/metrics`                                           |
144//! | `state`    | off     | Enables [`PrometheusStateCollector`] — pull-based `solti_sv_tasks_by_phase` snapshot (depends on `solti-core`)    |
145//!
146//! ## Quick wire
147//!
148//! ```text
149//! use std::sync::Arc;
150//! use solti_prometheus::{
151//!     PrometheusMetrics, PrometheusSubscriber, Registry,
152//!     register_build_info, register_process_collector,
153//! };
154//!
155//! let registry = Arc::new(Registry::new());
156//!
157//! // Core collectors.
158//! let metrics = PrometheusMetrics::new(registry.clone())?;
159//! let subscriber = PrometheusSubscriber::new(registry.clone())?;
160//!
161//! // Standard extras.
162//! register_process_collector(&registry)?;
163//! register_build_info(&registry, &[
164//!     ("version", env!("CARGO_PKG_VERSION")),
165//! ])?;
166//!
167//! // Wire into solti-runner:
168//! let ctx = BuildContext::new(RunnerEnv::default(), Arc::new(metrics));
169//! let router = RunnerRouter::new().with_context(ctx);
170//!
171//! // Wire into solti-core supervisor:
172//! let subscribers: Vec<Arc<dyn Subscribe>> = vec![Arc::new(subscriber)];
173//! ```
174//!
175//! For a full agent wiring — including the supervised `/metrics` HTTP task, `ApiMetricsBackend` (HTTP + gRPC), and `DiscoverMetricsBackend`
176//! _see the reference agents under `examples/agentd-http` and `examples/agentd-grpc`_ .
177//!
178//! ## Notes
179//!
180//! - `tasks_in_flight` gauge is guarded against going negative: a
181//!   [`TaskStopped`](taskvisor::EventKind::TaskStopped) without a preceding
182//!   [`TaskStarting`](taskvisor::EventKind::TaskStarting) is a no-op.
183//! - Backoff / discover / API durations are converted from ms → seconds before histogram observation.
184//! - [`PrometheusSubscriber`] uses `queue_capacity = 2048` (2× taskvisor default) to reduce event loss under high throughput.
185//! - All collectors must share a single [`prometheus::Registry`] for a unified `/metrics` endpoint.
186//!
187//! ## Also
188//!
189//! - [`solti_runner::MetricsBackend`] - trait backing [`PrometheusMetrics`].
190//! - [`taskvisor::Subscribe`] - trait backing [`PrometheusSubscriber`].
191//! - `solti_api::ApiMetricsBackend` - trait backing [`PrometheusApiMetrics`] (feature `api`).
192//! - `solti_discover::DiscoverMetricsBackend` - trait backing [`PrometheusDiscoverMetrics`] (feature `discover`).
193
194mod register;
195
196mod subscriber;
197pub use subscriber::{DEFAULT_QUEUE_CAPACITY, PrometheusSubscriber};
198
199mod process;
200pub use process::register_process_collector;
201
202mod backend;
203pub use backend::PrometheusMetrics;
204
205mod info;
206pub use info::register_build_info;
207
208#[cfg(feature = "discover")]
209mod discover;
210#[cfg(feature = "discover")]
211pub use discover::PrometheusDiscoverMetrics;
212
213#[cfg(feature = "api")]
214mod api;
215#[cfg(feature = "api")]
216pub use api::PrometheusApiMetrics;
217
218#[cfg(feature = "server")]
219mod server;
220#[cfg(feature = "server")]
221pub use server::{METRICS_SERVER_SLOT, server};
222
223#[cfg(feature = "state")]
224mod state;
225#[cfg(feature = "state")]
226pub use state::PrometheusStateCollector;
227
228pub use prometheus::Registry;