solti-prometheus 0.0.1

Solti SDK Prometheus metrics.
Documentation

solti-prometheus

Prometheus metrics for the solti task execution system.

Architecture

  ┌─────────────────────────────────────────────────────────┐
  │                  Shared Registry                        │
  │                                                         │
  │  PrometheusMetrics          PrometheusSubscriber        │
  │  (MetricsBackend)           (taskvisor::Subscribe)      │
  │  ├─ solti_runner_*          ├─ solti_sv_*               │
  │  └─ runner calls            └─ event bus fans out       │
  │     record_task_*()            events to on_event()     │
  └─────────────────────────────────────────────────────────┘
          │                            │
          ▼                            ▼
       BuildContext                 Supervisor
       └─► runners                  └─► event stream

Metric namespaces

Subsystem Prefix Source
PrometheusMetrics solti_runner_* MetricsBackend trait
PrometheusSubscriber solti_sv_* Subscribe trait
Controller (optional) solti_ctrl_* Subscribe trait

Runner metrics (solti_runner_*)

Metric Type Labels Description
solti_runner_tasks_started_total Counter runner Task spawn events
solti_runner_tasks_completed_total Counter runner, outcome Task completion events
solti_runner_task_duration_seconds Histogram runner, outcome Per-attempt execution duration
solti_runner_errors_total Counter runner, error Runner setup/teardown errors

Duration histogram buckets (seconds): 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 30, 60, 120, 300, 600, 1800, 3600.

Supervision metrics (solti_sv_*)

Metric Type Labels Description
solti_sv_tasks_in_flight Gauge Currently executing tasks
solti_sv_task_restarts_total Counter Restarts (attempt > 1)
solti_sv_task_backoff_count_total Counter source Backoff events
solti_sv_task_backoff_duration_seconds Histogram Backoff delay duration
solti_sv_task_terminal_total Counter reason Terminal task states
solti_sv_task_timeouts_total Counter Timeout events
solti_sv_subscriber_overflow_total Counter Queue overflow (lost events)
solti_sv_subscriber_panicked_total Counter Subscriber panics

Controller metrics (solti_ctrl_*, feature controller)

Metric Type Labels Description
solti_ctrl_submissions_total Counter Controller submissions
solti_ctrl_rejections_total Counter Controller rejections

Event → metric mapping

  taskvisor event          metric update
  ───────────────          ─────────────
  TaskStarting           → tasks_in_flight.inc()
                           + task_restarts.inc()  (if attempt > 1)
  TaskStopped            → tasks_in_flight.dec()
  TaskFailed             → tasks_in_flight.dec()
  TimeoutHit             → task_timeouts.inc()
  BackoffScheduled       → task_backoff_count{source}.inc()
                           + task_backoff_duration.observe(delay)
  ActorExhausted         → task_terminal{reason="exhausted"}.inc()
  ActorDead              → task_terminal{reason="fatal"}.inc()
  SubscriberOverflow     → subscriber_overflow.inc()
  SubscriberPanicked     → subscriber_panicked.inc()
  ControllerSubmitted    → controller_submissions.inc()
  ControllerRejected     → controller_rejections.inc()

Labels

Label Values Cardinality
runner subprocess, wasm, container low
outcome success, failure, canceled, timeout low
error spawn_failed, backend_config_failed, … low
source failure, success low
reason exhausted, fatal low

All label sets have low, bounded cardinality.

Feature flags

Flag Default Effect
controller off Adds solti_ctrl_* metrics for controller submit/reject

Notes

  • tasks_in_flight gauge is guarded against going negative: a TaskStopped without a preceding TaskStarting is a no-op.
  • Backoff duration is converted from milliseconds to seconds before histogram observation.
  • PrometheusSubscriber uses queue_capacity = 2048 (2x the taskvisor default) to reduce event loss under high throughput.
  • Both collectors must share a single prometheus::Registry for a unified /metrics endpoint.