1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
//! # solti-prometheus
//!
//! Prometheus metrics for the Solti task-orchestration SDK.
//!
//! Collectors and helpers share one [`prometheus::Registry`]. A single
//! `/metrics` endpoint covers runner internals, supervisor events, API
//! requests, discovery heartbeat, process stats, and build identity.
//!
//! ## Collectors and helpers
//!
//! | Component | Prefix | Feature | Implements / emits |
//! |-------------------------------|-------------------------------|------------|-----------------------------------------------------------|
//! | [`PrometheusMetrics`] | `solti_runner_*` | — | [`solti_runner::MetricsBackend`] |
//! | [`PrometheusSubscriber`] | `solti_sv_*` + `solti_ctrl_*` | — | [`taskvisor::Subscribe`] |
//! | [`PrometheusApiMetrics`] | `solti_api_*` | `api` | `solti_api::ApiMetricsBackend` |
//! | [`PrometheusDiscoverMetrics`] | `solti_discover_*` | `discover` | `solti_discover::DiscoverMetricsBackend` |
//! | [`register_process_collector`]| `process_*` | `process` | Prometheus' default process collector (Linux-only effect) |
//! | [`register_build_info`] | `solti_build_info` | — | Gauge `= 1` carrying constant labels |
//! | [`server`] | — | `server` | Embedded supervised HTTP task exposing `/metrics` |
//! | [`PrometheusStateCollector`] | `solti_sv_tasks_by_phase` | `state` | Pull-based collector over `solti_core::TaskState` |
//!
//! ## Architecture
//!
//! ```text
//! ┌─────────────────────────────────────────────────────────────────────────┐
//! │ Shared Registry │
//! │ │
//! │ PrometheusMetrics → solti_runner_* │
//! │ PrometheusSubscriber → solti_sv_* , solti_ctrl_* │
//! │ PrometheusApiMetrics → solti_api_* [feature: api] │
//! │ PrometheusDiscoverMetrics → solti_discover_* [feature: discover] │
//! │ register_process_collector→ process_* [feature: process] │
//! │ register_build_info → solti_build_info │
//! └──────────┬──────────────────────────────────────────────┬───────────────┘
//! │ │
//! ▼ ▼
//! BuildContext Supervisor /metrics HTTP
//! runners call event bus exposed by
//! record_task_*() fans events solti_prometheus::server()
//! to on_event() [feature: server]
//! ```
//!
//! ## Runner metrics (`solti_runner_*`)
//!
//! | Metric | Type | Labels | Description |
//! |--------------------------------------|-----------|---------------------|--------------------------------|
//! | `solti_runner_tasks_started_total` | Counter | `runner` | Task spawn events |
//! | `solti_runner_tasks_completed_total` | Counter | `runner`, `outcome` | Task completion events |
//! | `solti_runner_task_duration_seconds` | Histogram | `runner`, `outcome` | Per-attempt execution duration |
//! | `solti_runner_errors_total` | Counter | `runner`, `error` | Runner setup/teardown errors |
//!
//! Duration histogram buckets (seconds): [ `0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 300, 1800, 3600`].
//! Dense through the `10 ms - 10 s` range, sparser long tail out to one hour.
//!
//! ## Supervision metrics (`solti_sv_*`)
//!
//! | Metric | Type | Labels | Description |
//! |------------------------------------------|-----------|----------|------------------------------|
//! | `solti_sv_tasks_in_flight` | Gauge | — | Currently executing tasks |
//! | `solti_sv_task_restarts_total` | Counter | — | Restarts (attempt > 1) |
//! | `solti_sv_task_backoff_count_total` | Counter | `source` | Backoff events |
//! | `solti_sv_task_backoff_duration_seconds` | Histogram | — | Backoff delay duration |
//! | `solti_sv_task_terminal_total` | Counter | `reason` | Terminal task states |
//! | `solti_sv_attempts_to_finalize` | Histogram | `outcome`| Attempts when task left loop |
//! | `solti_sv_task_timeouts_total` | Counter | — | Timeout events |
//! | `solti_sv_subscriber_overflow_total` | Counter | — | Queue overflow (lost events) |
//! | `solti_sv_subscriber_panicked_total` | Counter | — | Subscriber panics |
//! | `solti_sv_tasks_by_phase` | Gauge | `phase` | Current tasks per phase (feature `state`, pull-based snapshot) |
//!
//! ## Controller metrics (`solti_ctrl_*`)
//!
//! | Metric | Type | Labels | Description |
//! |--------------------------------|-----------|----------|----------------------------------------|
//! | `solti_ctrl_submissions_total` | Counter | — | Controller submissions |
//! | `solti_ctrl_rejections_total` | CounterVec| `reason` | Controller rejections grouped by cause |
//!
//! `reason` values (bounded, classified from `Event.reason`):
//! [ `slot_full`, `slot_busy`, `add_failed`, `remove_failed`, `queue_failed`, `recovery_failed`, `bus_lagged`, `controller_exited`, `other`, `unknown`].
//!
//! ## API metrics (`solti_api_*`, feature `api`)
//!
//! | Metric | Type | Labels | Description |
//! |--------------------------------------|-----------|-------------------------------------------|--------------------------|
//! | `solti_api_requests_total` | Counter | `transport`, `method`, `path`, `status` | Completed requests |
//! | `solti_api_request_duration_seconds` | Histogram | `transport`, `method`, `path` | Request duration |
//! | `solti_api_in_flight_requests` | Gauge | `transport` | In-flight request count |
//!
//! Request duration buckets (seconds): `0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10`.
//!
//! ## Discovery metrics (`solti_discover_*`, feature `discover`)
//!
//! | Metric | Type | Labels | Description |
//! |---------------------------------------------------|-----------|-----------|-----------------------------------|
//! | `solti_discover_attempts_total` | Counter | — | Total sync attempts |
//! | `solti_discover_outcomes_total` | Counter | `outcome` | `success` / `failure` |
//! | `solti_discover_duration_seconds` | Histogram | `outcome` | Sync call duration |
//! | `solti_discover_failures_total` | Counter | `reason` | Failures grouped by cause |
//! | `solti_discover_last_success_timestamp_seconds` | Gauge | — | UNIX time of last success |
//! | `solti_discover_holds_total` | Counter | — | Server-advised retry holds |
//! | `solti_discover_hold_duration_seconds` | Histogram | — | Duration of advised holds |
//!
//! ## Process metrics (`process_*`, feature `process`)
//!
//! On Linux, [`register_process_collector`] adds the standard Prometheus process collector:
//! - `process_cpu_seconds_total`
//! - `process_resident_memory_bytes`
//! - `process_virtual_memory_bytes`
//! - `process_open_fds`
//! - `process_max_fds`
//! - `process_start_time_seconds`.
//!
//! On other targets the function is a no-op.
//! (compiles cleanly, registers nothing).
//!
//! ## Build info
//!
//! [`register_build_info`] registers a `solti_build_info` gauge whose value is always `1`.
//! Its labels (set as Prometheus *constant* labels) carry build-time identity.
//!
//! ## Event → metric mapping (supervision + controller)
//!
//! ```text
//! TaskStarting → tasks_in_flight.inc() (+ task_restarts.inc() if attempt > 1)
//! TaskStopped → tasks_in_flight.dec()
//! TaskFailed → tasks_in_flight.dec()
//! TimeoutHit → task_timeouts.inc()
//! BackoffScheduled → task_backoff_count{source}.inc() + task_backoff_duration.observe(delay)
//! ActorExhausted → task_terminal{reason="exhausted"}.inc() + attempts_to_finalize{outcome="exhausted"}.observe(attempt)
//! ActorDead → task_terminal{reason="fatal"}.inc() + attempts_to_finalize{outcome="fatal"}.observe(attempt)
//! SubscriberOverflow → subscriber_overflow.inc()
//! SubscriberPanicked → subscriber_panicked.inc()
//! ControllerSubmitted → controller_submissions.inc()
//! ControllerRejected → controller_rejections{reason}.inc() (reason classified from Event.reason)
//! ```
//!
//! ## Feature flags
//!
//! | Flag | Default | Effect |
//! |------------|---------|-------------------------------------------------------------------------------------------------------------------|
//! | `api` | off | Enables [`PrometheusApiMetrics`] (depends on `solti-api`) |
//! | `discover` | off | Enables [`PrometheusDiscoverMetrics`] (depends on `solti-discover`) |
//! | `process` | off | Makes [`register_process_collector`] register actual `process_*` metrics (Linux); propagates `prometheus/process` |
//! | `server` | off | Enables [`server`] - a supervised embedded HTTP task serving `/metrics` |
//! | `state` | off | Enables [`PrometheusStateCollector`] — pull-based `solti_sv_tasks_by_phase` snapshot (depends on `solti-core`) |
//!
//! ## Quick wire
//!
//! ```text
//! use std::sync::Arc;
//! use solti_prometheus::{
//! PrometheusMetrics, PrometheusSubscriber, Registry,
//! register_build_info, register_process_collector,
//! };
//!
//! let registry = Arc::new(Registry::new());
//!
//! // Core collectors.
//! let metrics = PrometheusMetrics::new(registry.clone())?;
//! let subscriber = PrometheusSubscriber::new(registry.clone())?;
//!
//! // Standard extras.
//! register_process_collector(®istry)?;
//! register_build_info(®istry, &[
//! ("version", env!("CARGO_PKG_VERSION")),
//! ])?;
//!
//! // Wire into solti-runner:
//! let ctx = BuildContext::new(RunnerEnv::default(), Arc::new(metrics));
//! let router = RunnerRouter::new().with_context(ctx);
//!
//! // Wire into solti-core supervisor:
//! let subscribers: Vec<Arc<dyn Subscribe>> = vec![Arc::new(subscriber)];
//! ```
//!
//! For a full agent wiring — including the supervised `/metrics` HTTP task, `ApiMetricsBackend` (HTTP + gRPC), and `DiscoverMetricsBackend`
//! _see the reference agents under `examples/agentd-http` and `examples/agentd-grpc`_ .
//!
//! ## Notes
//!
//! - `tasks_in_flight` gauge is guarded against going negative: a
//! [`TaskStopped`](taskvisor::EventKind::TaskStopped) without a preceding
//! [`TaskStarting`](taskvisor::EventKind::TaskStarting) is a no-op.
//! - Backoff / discover / API durations are converted from ms → seconds before histogram observation.
//! - [`PrometheusSubscriber`] uses `queue_capacity = 2048` (2× taskvisor default) to reduce event loss under high throughput.
//! - All collectors must share a single [`prometheus::Registry`] for a unified `/metrics` endpoint.
//!
//! ## Also
//!
//! - [`solti_runner::MetricsBackend`] - trait backing [`PrometheusMetrics`].
//! - [`taskvisor::Subscribe`] - trait backing [`PrometheusSubscriber`].
//! - `solti_api::ApiMetricsBackend` - trait backing [`PrometheusApiMetrics`] (feature `api`).
//! - `solti_discover::DiscoverMetricsBackend` - trait backing [`PrometheusDiscoverMetrics`] (feature `discover`).
pub use ;
pub use register_process_collector;
pub use PrometheusMetrics;
pub use register_build_info;
pub use PrometheusDiscoverMetrics;
pub use PrometheusApiMetrics;
pub use ;
pub use PrometheusStateCollector;
pub use Registry;