slokit
An SLO and error-budget engine for Rust.
slokit does two things the existing tools (all Go or Python) do not do together:
- Library core with no
serde, YAML, or CLI dependencies, so error-budget and burn-rate math embeds directly inside your services (for example, an Axum handler that reports live budget status). - A generator that reads a sloth-compatible YAML spec and emits Prometheus recording rules, metadata rules, and multi-window multi-burn-rate (MWMBR) page/ticket alerts as a single static binary.
It is drop-in compatible with the sloth prometheus/v1 spec, so existing
specs work unchanged, and the generated metrics use the same slo:... names and
sloth_* labels, so your Grafana dashboards keep working.
Install
CLI
# Generate Prometheus rules from a spec
# Generate a Prometheus Operator PrometheusRule instead
# Validate a spec without generating
# Lint a spec for advisory issues (100% objective, period shorter than the
# burn-rate windows, alerts missing routing labels, ...). --strict fails CI.
# Do the error-budget math from the terminal
# Check a live Prometheus and report current budget/burn (exits 1 if any SLO breaches)
# Check machine-readably, failing the build on warnings too
# Generate a Grafana dashboard (JSON) from a spec
Every command's -i accepts a single spec file or a directory of
*.yaml/*.yml specs. With a directory, generate merges all rules into one
document, check reports across every service, and dashboard emits a JSON
array of dashboards.
check exit codes: 0 healthy, 1 the --fail-on level was reached
(breach by default, or warning/never), 2 a runtime error. --output json prints the statuses as a JSON array for piping into other tools.
dashboard emits Grafana dashboard JSON with a block per SLO (error budget
remaining, current burn rate, objective, and the SLI error ratio over time),
querying the same slo:... metrics the generator produces. It declares a
datasource template variable, so it imports into any Grafana with a Prometheus
data source.
check evaluates each SLO's SLI directly against Prometheus (no deployed
recording rules required) and prints a status table:
service 'myservice' against http://localhost:9090 (current window 1h)
STATUS SLO CONSUMED REMAINING BURN
OK requests-availability 12.30% 87.70% 0.50x
BREACH requests-latency 120.00% -20.00% 15.00x
calc output:
Objective: 99.9% over 30d
Error budget: 0.1000% of events
Total events: 1000000
Allowed bad: 1000.00
Observed bad: 250
Burn rate: 0.25x
Consumed: 25.0000%
Remaining: 75.0000%
Exhausted in: 89d 23h
Burn-rate alert thresholds (error ratio that fires each window):
page long=1h short=5m factor=14.4 threshold=1.4400%
page long=6h short=30m factor=6 threshold=0.6000%
ticket long=1d short=2h factor=3 threshold=0.3000%
ticket long=3d short=6h factor=1 threshold=0.1000%
Spec format
slokit reads the sloth prometheus/v1 spec, plus slokit extensions: an
optional per-SLO period (sloth only offers this as a global flag) and a
latency SLI (see below).
version: "prometheus/v1"
service: myservice
labels:
owner: team-platform
slos:
- name: requests-availability
objective: 99.9
period: 30d # slokit extension; defaults to 30d
sli:
events:
error_query: sum(rate(http_requests_total{code=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total[{{.window}}]))
alerting:
name: MyServiceHighErrorRate
page_alert:
labels:
ticket_alert:
labels:
Each SLO has exactly one of three SLI shapes:
-
events(error_query/total_query): bad events over total events. -
raw(error_ratio_query): a query that already yields an error ratio. -
latency(slokit extension): the fraction of requests slower than a histogram bucket threshold. slokit generates the bucket math so you do not hand-write it:sli: latency: histogram_metric: http_request_duration_seconds # base name, no _bucket/_count suffix threshold: "0.3" # the `le` bucket boundary selector: job="myservice" # optional label matchers, no bracesThis generates, at every window:
1 - ( sum(rate(http_request_duration_seconds_bucket{job="myservice", le="0.3"}[{{.window}}])) / sum(rate(http_request_duration_seconds_count{job="myservice"}[{{.window}}])) )
The events and raw query strings must contain the {{.window}} template
token; latency is generated and needs none.
Library
The core has no serialization or CLI dependencies:
use ;
let slo = new;
// With a million events, 0.1% may fail: ~1,000 allowed failures.
let budget = slo.error_budget;
assert!;
// A sustained 1% error rate is a 10x burn against a 99.9% objective.
let burn = from_error_ratio;
assert!;
Generation lives behind the default spec feature:
use Spec;
use generate_rules;
let spec = from_path?;
let ruleset = generate_rules?;
println!;
# Ok::
Feature flags
| Feature | Default | Pulls in | Enables |
|---|---|---|---|
cli |
yes | clap, anyhow, spec, check, dashboard |
the slokit binary |
spec |
yes | serde, serde_norway |
spec parsing and rule generation |
check |
yes | reqwest, serde_json |
live Prometheus querying (PrometheusClient, check_spec) |
dashboard |
yes | serde_json |
Grafana dashboard generation (dashboard_json) |
For the lean math-only core: slokit = { version = "0.1", default-features = false }.
The MWMBR model
slokit implements the burn-rate alerting from the Google SRE Workbook. For a
30-day SLO period:
| Severity | Long window | Short window | Burn rate | Budget consumed |
|---|---|---|---|---|
| Page | 1h | 5m | 14.4 | 2% |
| Page | 6h | 30m | 6 | 5% |
| Ticket | 1d | 2h | 3 | 10% |
| Ticket | 3d | 6h | 1 | 10% |
License
Licensed under either of Apache-2.0 or MIT at your option.