slokit 0.6.0

SLO and error-budget engine for Rust: compute error budgets and burn rates, and generate multi-window multi-burn-rate Prometheus alert rules from sloth-compatible specs.
Documentation

slokit

CI crates.io docs.rs

An SLO and error-budget engine for Rust.

slokit does two things the existing tools (all Go or Python) do not do together:

  1. Library core with no serde, YAML, or CLI dependencies, so error-budget and burn-rate math embeds directly inside your services (for example, an Axum handler that reports live budget status).
  2. A generator that reads a sloth-compatible YAML spec and emits Prometheus recording rules, metadata rules, and multi-window multi-burn-rate (MWMBR) page/ticket alerts as a single static binary.

It is drop-in compatible with the sloth prometheus/v1 spec, so existing specs work unchanged, and the generated metrics use the same slo:... names and sloth_* labels, so your Grafana dashboards keep working.

Install

cargo install slokit          # the CLI
cargo add slokit              # the library (add `--no-default-features` for the lean core)

CLI

# Generate Prometheus rules from a spec
slokit generate -i slos.yaml -o rules.yaml

# Generate a Prometheus Operator PrometheusRule instead
slokit generate -i slos.yaml --format operator

# Validate a spec without generating
slokit validate -i slos.yaml

# Lint a spec for advisory issues (100% objective, period shorter than the
# burn-rate windows, alerts missing routing labels, ...). --strict fails CI.
slokit lint -i slos.yaml --strict

# Do the error-budget math from the terminal
slokit calc --objective 99.9 --period 30d --total 1000000 --bad 250

# Check a live Prometheus and report current budget/burn (exits 1 if any SLO breaches)
slokit check -i slos.yaml --url http://localhost:9090 --window 1h

# Check machine-readably, failing the build on warnings too
slokit check -i slos/ --url http://localhost:9090 --output json --fail-on warning

# Generate a Grafana dashboard (JSON) from a spec
slokit dashboard -i slos.yaml -o dashboard.json

Every command's -i accepts a single spec file or a directory of *.yaml/*.yml specs. With a directory, generate merges all rules into one document, check reports across every service, and dashboard emits a JSON array of dashboards.

check exit codes: 0 healthy, 1 the --fail-on level was reached (breach by default, or warning/never), 2 a runtime error. --output json prints the statuses as a JSON array for piping into other tools.

dashboard emits Grafana dashboard JSON with a block per SLO (error budget remaining, current burn rate, objective, and the SLI error ratio over time), querying the same slo:... metrics the generator produces. It declares a datasource template variable, so it imports into any Grafana with a Prometheus data source.

check evaluates each SLO's SLI directly against Prometheus (no deployed recording rules required) and prints a status table:

service 'myservice' against http://localhost:9090 (current window 1h)

STATUS  SLO                               CONSUMED  REMAINING      BURN
OK      requests-availability               12.30%     87.70%     0.50x
BREACH  requests-latency                   120.00%    -20.00%    15.00x

calc output:

Objective:    99.9% over 30d
Error budget: 0.1000% of events
Total events: 1000000
Allowed bad:  1000.00
Observed bad: 250
Burn rate:    0.25x
Consumed:     25.0000%
Remaining:    75.0000%
Exhausted in: 89d 23h

Burn-rate alert thresholds (error ratio that fires each window):
  page   long=1h   short=5m   factor=14.4  threshold=1.4400%
  page   long=6h   short=30m  factor=6     threshold=0.6000%
  ticket long=1d   short=2h   factor=3     threshold=0.3000%
  ticket long=3d   short=6h   factor=1     threshold=0.1000%

Spec format

slokit reads the sloth prometheus/v1 spec, plus slokit extensions: an optional per-SLO period (sloth only offers this as a global flag) and a latency SLI (see below).

version: "prometheus/v1"
service: myservice
labels:
  owner: team-platform
slos:
  - name: requests-availability
    objective: 99.9
    period: 30d            # slokit extension; defaults to 30d
    sli:
      events:
        error_query: sum(rate(http_requests_total{code=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total[{{.window}}]))
    alerting:
      name: MyServiceHighErrorRate
      page_alert:
        labels: { severity: page }
      ticket_alert:
        labels: { severity: ticket }

Each SLO has exactly one of three SLI shapes:

  • events (error_query / total_query): bad events over total events.

  • raw (error_ratio_query): a query that already yields an error ratio.

  • latency (slokit extension): the fraction of requests slower than a histogram bucket threshold. slokit generates the bucket math so you do not hand-write it:

    sli:
      latency:
        histogram_metric: http_request_duration_seconds  # base name, no _bucket/_count suffix
        threshold: "0.3"                                  # the `le` bucket boundary
        selector: job="myservice"                         # optional label matchers, no braces
    

    This generates, at every window:

    1 - (
      sum(rate(http_request_duration_seconds_bucket{job="myservice", le="0.3"}[{{.window}}]))
      /
      sum(rate(http_request_duration_seconds_count{job="myservice"}[{{.window}}]))
    )
    

The events and raw query strings must contain the {{.window}} template token; latency is generated and needs none.

Library

The core has no serialization or CLI dependencies:

use slokit::{Objective, Slo, BurnRate, Window};

let slo = Slo::new(Objective::percent(99.9).unwrap(), Window::days(30));

// With a million events, 0.1% may fail: ~1,000 allowed failures.
let budget = slo.error_budget(1_000_000.0);
assert!((budget.allowed_bad_events() - 1_000.0).abs() < 1e-6);

// A sustained 1% error rate is a 10x burn against a 99.9% objective.
let burn = BurnRate::from_error_ratio(0.01, &slo);
assert!((burn.value() - 10.0).abs() < 1e-9);

Generation lives behind the default spec feature:

use slokit::spec::Spec;
use slokit::generate::generate_rules;

let spec = Spec::from_path("slos.yaml")?;
let ruleset = generate_rules(&spec)?;
println!("{}", ruleset.to_prometheus_yaml()?);
# Ok::<(), slokit::SlokitError>(())

Feature flags

Feature Default Pulls in Enables
cli yes clap, anyhow, spec, check, dashboard the slokit binary
spec yes serde, serde_norway spec parsing and rule generation
check yes reqwest, serde_json live Prometheus querying (PrometheusClient, check_spec)
dashboard yes serde_json Grafana dashboard generation (dashboard_json)

For the lean math-only core: slokit = { version = "0.1", default-features = false }.

The MWMBR model

slokit implements the burn-rate alerting from the Google SRE Workbook. For a 30-day SLO period:

Severity Long window Short window Burn rate Budget consumed
Page 1h 5m 14.4 2%
Page 6h 30m 6 5%
Ticket 1d 2h 3 10%
Ticket 3d 6h 1 10%

License

Licensed under either of Apache-2.0 or MIT at your option.