Skip to main content

Crate solti_prometheus

Crate solti_prometheus 

Source
Expand description

§solti-prometheus

Prometheus metrics for the Solti task-orchestration SDK.

Collectors and helpers share one prometheus::Registry. A single /metrics endpoint covers runner internals, supervisor events, API requests, discovery heartbeat, process stats, and build identity.

§Collectors and helpers

ComponentPrefixFeatureImplements / emits
PrometheusMetricssolti_runner_*solti_runner::MetricsBackend
PrometheusSubscribersolti_sv_* + solti_ctrl_*taskvisor::Subscribe
PrometheusApiMetricssolti_api_*apisolti_api::ApiMetricsBackend
PrometheusDiscoverMetricssolti_discover_*discoversolti_discover::DiscoverMetricsBackend
register_process_collectorprocess_*processPrometheus’ default process collector (Linux-only effect)
register_build_infosolti_build_infoGauge = 1 carrying constant labels
serverserverEmbedded supervised HTTP task exposing /metrics
PrometheusStateCollectorsolti_sv_tasks_by_phasestatePull-based collector over solti_core::TaskState

§Architecture

  ┌─────────────────────────────────────────────────────────────────────────┐
  │                         Shared Registry                                 │
  │                                                                         │
  │   PrometheusMetrics         → solti_runner_*                            │
  │   PrometheusSubscriber      → solti_sv_* , solti_ctrl_*                 │
  │   PrometheusApiMetrics      → solti_api_*          [feature: api]       │
  │   PrometheusDiscoverMetrics → solti_discover_*     [feature: discover]  │
  │   register_process_collector→ process_*            [feature: process]   │
  │   register_build_info       → solti_build_info                          │
  └──────────┬──────────────────────────────────────────────┬───────────────┘
             │                                              │
             ▼                                              ▼
        BuildContext          Supervisor                /metrics HTTP
        runners call          event bus                 exposed by
        record_task_*()       fans events               solti_prometheus::server()
                              to on_event()             [feature: server]

§Runner metrics (solti_runner_*)

MetricTypeLabelsDescription
solti_runner_tasks_started_totalCounterrunnerTask spawn events
solti_runner_tasks_completed_totalCounterrunner, outcomeTask completion events
solti_runner_task_duration_secondsHistogramrunner, outcomePer-attempt execution duration
solti_runner_errors_totalCounterrunner, errorRunner setup/teardown errors

Duration histogram buckets (seconds): [ 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 300, 1800, 3600]. Dense through the 10 ms - 10 s range, sparser long tail out to one hour.

§Supervision metrics (solti_sv_*)

MetricTypeLabelsDescription
solti_sv_tasks_in_flightGaugeCurrently executing tasks
solti_sv_task_restarts_totalCounterRestarts (attempt > 1)
solti_sv_task_backoff_count_totalCountersourceBackoff events
solti_sv_task_backoff_duration_secondsHistogramBackoff delay duration
solti_sv_task_terminal_totalCounterreasonTerminal task states
solti_sv_attempts_to_finalizeHistogramoutcomeAttempts when task left loop
solti_sv_task_timeouts_totalCounterTimeout events
solti_sv_subscriber_overflow_totalCounterQueue overflow (lost events)
solti_sv_subscriber_panicked_totalCounterSubscriber panics
solti_sv_tasks_by_phaseGaugephaseCurrent tasks per phase (feature state, pull-based snapshot)

§Controller metrics (solti_ctrl_*)

MetricTypeLabelsDescription
solti_ctrl_submissions_totalCounterController submissions
solti_ctrl_rejections_totalCounterVecreasonController rejections grouped by cause

reason values (bounded, classified from Event.reason): [ slot_full, slot_busy, add_failed, remove_failed, queue_failed, recovery_failed, bus_lagged, controller_exited, other, unknown].

§API metrics (solti_api_*, feature api)

MetricTypeLabelsDescription
solti_api_requests_totalCountertransport, method, path, statusCompleted requests
solti_api_request_duration_secondsHistogramtransport, method, pathRequest duration
solti_api_in_flight_requestsGaugetransportIn-flight request count

Request duration buckets (seconds): 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10.

§Discovery metrics (solti_discover_*, feature discover)

MetricTypeLabelsDescription
solti_discover_attempts_totalCounterTotal sync attempts
solti_discover_outcomes_totalCounteroutcomesuccess / failure
solti_discover_duration_secondsHistogramoutcomeSync call duration
solti_discover_failures_totalCounterreasonFailures grouped by cause
solti_discover_last_success_timestamp_secondsGaugeUNIX time of last success
solti_discover_holds_totalCounterServer-advised retry holds
solti_discover_hold_duration_secondsHistogramDuration of advised holds

§Process metrics (process_*, feature process)

On Linux, register_process_collector adds the standard Prometheus process collector:

  • process_cpu_seconds_total
  • process_resident_memory_bytes
  • process_virtual_memory_bytes
  • process_open_fds
  • process_max_fds
  • process_start_time_seconds.

On other targets the function is a no-op. (compiles cleanly, registers nothing).

§Build info

register_build_info registers a solti_build_info gauge whose value is always 1. Its labels (set as Prometheus constant labels) carry build-time identity.

§Event → metric mapping (supervision + controller)

 TaskStarting        → tasks_in_flight.inc() (+ task_restarts.inc() if attempt > 1)
 TaskStopped         → tasks_in_flight.dec()
 TaskFailed          → tasks_in_flight.dec()
 TimeoutHit          → task_timeouts.inc()
 BackoffScheduled    → task_backoff_count{source}.inc() + task_backoff_duration.observe(delay)
 ActorExhausted      → task_terminal{reason="exhausted"}.inc() + attempts_to_finalize{outcome="exhausted"}.observe(attempt)
 ActorDead           → task_terminal{reason="fatal"}.inc()     + attempts_to_finalize{outcome="fatal"}.observe(attempt)
 SubscriberOverflow  → subscriber_overflow.inc()
 SubscriberPanicked  → subscriber_panicked.inc()
 ControllerSubmitted → controller_submissions.inc()
 ControllerRejected  → controller_rejections{reason}.inc()  (reason classified from Event.reason)

§Feature flags

FlagDefaultEffect
apioffEnables PrometheusApiMetrics (depends on solti-api)
discoveroffEnables PrometheusDiscoverMetrics (depends on solti-discover)
processoffMakes register_process_collector register actual process_* metrics (Linux); propagates prometheus/process
serveroffEnables server - a supervised embedded HTTP task serving /metrics
stateoffEnables PrometheusStateCollector — pull-based solti_sv_tasks_by_phase snapshot (depends on solti-core)

§Quick wire

use std::sync::Arc;
use solti_prometheus::{
    PrometheusMetrics, PrometheusSubscriber, Registry,
    register_build_info, register_process_collector,
};

let registry = Arc::new(Registry::new());

// Core collectors.
let metrics = PrometheusMetrics::new(registry.clone())?;
let subscriber = PrometheusSubscriber::new(registry.clone())?;

// Standard extras.
register_process_collector(&registry)?;
register_build_info(&registry, &[
    ("version", env!("CARGO_PKG_VERSION")),
])?;

// Wire into solti-runner:
let ctx = BuildContext::new(RunnerEnv::default(), Arc::new(metrics));
let router = RunnerRouter::new().with_context(ctx);

// Wire into solti-core supervisor:
let subscribers: Vec<Arc<dyn Subscribe>> = vec![Arc::new(subscriber)];

For a full agent wiring — including the supervised /metrics HTTP task, ApiMetricsBackend (HTTP + gRPC), and DiscoverMetricsBackend see the reference agents under examples/agentd-http and examples/agentd-grpc .

§Notes

  • tasks_in_flight gauge is guarded against going negative: a TaskStopped without a preceding TaskStarting is a no-op.
  • Backoff / discover / API durations are converted from ms → seconds before histogram observation.
  • PrometheusSubscriber uses queue_capacity = 2048 (2× taskvisor default) to reduce event loss under high throughput.
  • All collectors must share a single prometheus::Registry for a unified /metrics endpoint.

§Also

Structs§

PrometheusApiMetrics
Prometheus implementation of ApiMetricsBackend.
PrometheusDiscoverMetrics
Prometheus implementation of DiscoverMetricsBackend.
PrometheusMetrics
Prometheus metrics backend for solti runners.
PrometheusStateCollector
Pull-based Prometheus collector for solti_sv_tasks_by_phase{phase}.
PrometheusSubscriber
Prometheus subscriber for supervision-level metrics.
Registry
A struct for registering Prometheus collectors, collecting their metrics, and gathering them into MetricFamilies for exposition.

Constants§

DEFAULT_QUEUE_CAPACITY
Default subscriber queue capacity.
METRICS_SERVER_SLOT
Logical slot name for the metrics server task.

Functions§

register_build_info
Register a solti_build_info{labels...} gauge with value 1.
register_process_collector
Register the default Prometheus process collector.
server
Builds the metrics HTTP server task and its supervision specification.