solti-prometheus
Prometheus metrics for the Solti task-orchestration SDK.
Collectors and helpers share one prometheus::Registry.
A single /metrics endpoint covers runner internals, supervisor events, API requests, discovery heartbeat, process stats, and build identity.
What ships
| Component | Prefix | Feature | Implements / emits |
|---|---|---|---|
PrometheusMetrics |
solti_runner_* |
— | solti_runner::MetricsBackend |
PrometheusSubscriber |
solti_sv_* + solti_ctrl_* |
— | taskvisor::Subscribe |
PrometheusApiMetrics |
solti_api_* |
api |
solti_api::ApiMetricsBackend |
PrometheusDiscoverMetrics |
solti_discover_* |
discover |
solti_discover::DiscoverMetricsBackend |
register_process_collector |
process_* |
process |
Prometheus' default process collector (Linux-only effect) |
register_build_info |
solti_build_info |
— | Gauge = 1 carrying constant labels |
server |
— | server |
Embedded supervised HTTP task exposing /metrics |
PrometheusStateCollector |
solti_sv_tasks_by_phase |
state |
Pull-based collector over solti_core::TaskState |
Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ Shared Registry │
│ │
│ PrometheusMetrics → solti_runner_* │
│ PrometheusSubscriber → solti_sv_* , solti_ctrl_* │
│ PrometheusApiMetrics → solti_api_* [feature: api] │
│ PrometheusDiscoverMetrics → solti_discover_* [feature: discover] │
│ register_process_collector→ process_* [feature: process] │
│ register_build_info → solti_build_info │
└──────────┬──────────────────────────────────────────────┬───────────────┘
│ │
▼ ▼
BuildContext Supervisor /metrics HTTP
runners call event bus exposed by
record_task_*() fans events solti_prometheus::server()
to on_event() [feature: server]
Runner metrics (solti_runner_*)
| Metric | Type | Labels | Description |
|---|---|---|---|
solti_runner_tasks_started_total |
Counter | runner |
Task spawn events |
solti_runner_tasks_completed_total |
Counter | runner, outcome |
Task completion events |
solti_runner_task_duration_seconds |
Histogram | runner, outcome |
Per-attempt execution duration |
solti_runner_errors_total |
Counter | runner, error |
Runner setup/teardown errors |
Duration histogram buckets (seconds): 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 300, 1800, 3600. Dense through the 10 ms – 10 s range, sparser long tail out to one hour.
Supervision metrics (solti_sv_*)
| Metric | Type | Labels | Description |
|---|---|---|---|
solti_sv_tasks_in_flight |
Gauge | — | Currently executing tasks |
solti_sv_task_restarts_total |
Counter | — | Restarts (attempt > 1) |
solti_sv_task_backoff_count_total |
Counter | source |
Backoff events |
solti_sv_task_backoff_duration_seconds |
Histogram | — | Backoff delay duration |
solti_sv_task_terminal_total |
Counter | reason |
Terminal task states |
solti_sv_attempts_to_finalize |
Histogram | outcome |
Attempts when task left loop |
solti_sv_task_timeouts_total |
Counter | — | Timeout events |
solti_sv_subscriber_overflow_total |
Counter | — | Queue overflow (lost events) |
solti_sv_subscriber_panicked_total |
Counter | — | Subscriber panics |
solti_sv_tasks_by_phase |
Gauge | phase |
Current tasks per phase (feature state, pull-based snapshot) |
Controller metrics (solti_ctrl_*)
| Metric | Type | Labels | Description |
|---|---|---|---|
solti_ctrl_submissions_total |
Counter | — | Controller submissions |
solti_ctrl_rejections_total |
CounterVec | reason |
Controller rejections grouped by cause |
reason values (bounded, classified from Event.reason): slot_full, slot_busy,
add_failed, remove_failed, queue_failed, recovery_failed, bus_lagged,
controller_exited, other, unknown.
API metrics (solti_api_*, feature api)
| Metric | Type | Labels | Description |
|---|---|---|---|
solti_api_requests_total |
Counter | transport, method, path, status |
Completed requests |
solti_api_request_duration_seconds |
Histogram | transport, method, path |
Request duration |
solti_api_in_flight_requests |
Gauge | transport |
In-flight request count |
Request duration buckets (seconds): 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10.
Discovery metrics (solti_discover_*, feature discover)
| Metric | Type | Labels | Description |
|---|---|---|---|
solti_discover_attempts_total |
Counter | — | Total sync attempts |
solti_discover_outcomes_total |
Counter | outcome |
success / failure |
solti_discover_duration_seconds |
Histogram | outcome |
Sync call duration |
solti_discover_failures_total |
Counter | reason |
Failures grouped by cause |
solti_discover_last_success_timestamp_seconds |
Gauge | — | UNIX time of last success |
solti_discover_holds_total |
Counter | — | Server-advised retry holds |
solti_discover_hold_duration_seconds |
Histogram | — | Duration of advised holds |
Process metrics (process_*, feature process)
On Linux, register_process_collector adds the standard Prometheus process collector:
process_cpu_seconds_totalprocess_resident_memory_bytesprocess_virtual_memory_bytesprocess_open_fdsprocess_max_fdsprocess_start_time_seconds
On other targets the function is a no-op.
Build info
register_build_info registers a solti_build_info gauge whose value is always 1.
Its labels (set as Prometheus constant labels) carry build-time identity.
Event → metric mapping (supervision + controller)
TaskStarting → tasks_in_flight.inc() (+ task_restarts.inc() if attempt > 1)
TaskStopped → tasks_in_flight.dec()
TaskFailed → tasks_in_flight.dec()
TimeoutHit → task_timeouts.inc()
BackoffScheduled → task_backoff_count{source}.inc() + task_backoff_duration.observe(delay)
ActorExhausted → task_terminal{reason="exhausted"}.inc() + attempts_to_finalize{outcome="exhausted"}.observe(attempt)
ActorDead → task_terminal{reason="fatal"}.inc() + attempts_to_finalize{outcome="fatal"}.observe(attempt)
SubscriberOverflow → subscriber_overflow.inc()
SubscriberPanicked → subscriber_panicked.inc()
ControllerSubmitted → controller_submissions.inc()
ControllerRejected → controller_rejections{reason}.inc() (reason classified from Event.reason)
Labels cardinality
| Label | Values | Cardinality |
|---|---|---|
runner |
subprocess, wasm, container |
low |
outcome |
success, failure, canceled, timeout (runner); success, failure (discover) |
low |
error |
cgroup_prepare_failed, backend_config_failed, spawn_failed, module_load_failed (from solti_runner::RunnerErrorKind) |
low |
source |
failure, success |
low |
reason |
exhausted, fatal (terminal); slot_full, slot_busy, add_failed, remove_failed, queue_failed, recovery_failed, bus_lagged, controller_exited, other, unknown (controller rejections); connect, timeout, rejected_client, rejected_server, parse, auth, other (discover failures) |
low |
transport |
http, grpc |
low |
method |
HTTP method for HTTP, RPC method name for gRPC | low |
path |
Templated route (/api/v1/tasks/{id}) for HTTP, full RPC path for gRPC |
low (bounded by route set) |
status |
HTTP status code (HTTP), gRPC code number (gRPC) | low |
All label sets have low, bounded cardinality.
Feature flags
| Flag | Default | Effect |
|---|---|---|
api |
off | PrometheusApiMetrics (depends on solti-api) |
discover |
off | PrometheusDiscoverMetrics (depends on solti-discover) |
process |
off | Makes register_process_collector register actual process_* metrics on Linux |
server |
off | server — a supervised embedded HTTP task serving /metrics |
state |
off | PrometheusStateCollector — pull-based solti_sv_tasks_by_phase snapshot (depends on solti-core) |
Example
For a full agent wiring: shared registry, runner metrics, subscriber, supervised/metrics HTTP task, HTTP/gRPC ApiMetricsBackend, and DiscoverMetricsBackend see the reference agents:
Notes
tasks_in_flightgauge is guarded against going negative: aTaskStoppedwithout a precedingTaskStartingis a no-op.- Backoff / discover / API durations are converted from ms → seconds before histogram observation.
PrometheusSubscriberdefaultsqueue_capacitytoDEFAULT_QUEUE_CAPACITY.- All collectors must share one
prometheus::Registryfor a unified/metricsendpoint.