<h1 align="center">
<img width="99" alt="Rust logo" src="https://raw.githubusercontent.com/jamesgober/rust-collection/72baabd71f00e14aa9184efcb16fa3deddda3a0a/assets/rust-logo.svg">
<br><b>metrics-lib</b><br>
<sub><sup>API REFERENCE</sup></sub>
</h1>
<div align="center">
<sup>
<a href="../README.md" title="Project Home"><b>HOME</b></a>
<span> │ </span>
<a href="./README.md" title="Documentation"><b>DOCS</b></a>
<span> │ </span>
<span>API</span>
<span> │ </span>
<a href="./GUIDELINES.md" title="Developer Guidelines"><b>GUIDELINES</b></a>
</sup>
</div>
<br>
<h4 id="example-pointers">Example Pointers</h4>
- Quick Tour: `examples/quick_tour.rs` — counter/gauge/timer/ratemeter/system health in one file.
- Async Batch + Timing: `examples/async_batch_timing.rs` — `AsyncTimerExt::time_async` and `AsyncMetricBatch`.
- Token Bucket Limiter: `examples/token_bucket_limiter.rs` — admission control with `RateMeter::tick_if_under_limit`.
- Custom Exporter (OpenMetrics-like): `examples/custom_exporter_openmetrics.rs` — text snapshot.
- Axum Middleware (minimal): `examples/axum_middleware_metrics.rs` — per-request metrics + lightweight endpoint.
- Contention & Admission: `examples/contention_admission.rs` — multi-threaded admission under target rate.
- Health Dashboard: `examples/health_dashboard.rs` — periodic snapshot of CPU/mem/load/threads/FDS/health.
- Cache Hit/Miss: `examples/cache_hit_miss.rs` — counters for hits/misses, ratio, and lookup latency.
- Broker Throughput: `examples/broker_throughput.rs` — producer/consumer RPS via `RateMeter`.
- CPU Stats Overview: `examples/cpu_stats.rs` — system CPU/load and process CPU sampling windows.
- Memory Stats Overview: `examples/memory_stats.rs` — total/used/free MB/GB and percentages (unit auto-detect).
- Axum Registry Integration: `examples/axum_registry_integration.rs` — minimal web service wiring.
- Streaming Rate Window: `examples/streaming_rate_window.rs` — periodic rate sampling demo.
- Benchmark Comparison: `examples/benchmark_comparison.rs` — microbench comparison runner.
- Quick Start: `examples/quick_start.rs` — shortest end-to-end usage.
<br>
Note: To run many non-blocking examples quickly in sequence, use the helper script:
```bash
bash tools/run_examples.sh
```
You can pass a custom comma-separated list via `EXAMPLES`, e.g.:
```bash
EXAMPLES="quick_start,quick_tour,cpu_stats" bash tools/run_examples.sh
```
## Table of Contents
- **[Installation](#installation)**
- **[Examples](#examples)**
- **[Quick Start](#quick-start)**
- **[Public APIs](#public-apis)**
- **[API Safety](#api-safety)**
- [Global initialization](#global-initialization)
- [`MetricsCore`](#metricscore)
- [`Registry`](#registry)
- [`Counter`](#counter)
- [`Gauge`](#gauge)
- [`Timer`](#timer)
- [`RateMeter`](#ratemeter)
- [`Histogram` (v0.9.3)](#histogram)
- [`LabelSet` & Labels (v0.9.3)](#labels)
- [Metric metadata (v0.9.3)](#metric-metadata)
- [`SystemHealth`](#systemhealth)
- [`HealthConfig` & `Step` (v0.9.5)](#health-config)
- [`ScopedRegistry` (v0.9.5)](#scoped-registry)
- [`TokenBucket` (v0.9.5)](#token-bucket)
- [Exporters (v0.9.3)](#exporters)
- [Async support](#async-support)
- [Adaptive controls](#adaptive-controls)
- [`tracing` integration (v0.9.5)](#tracing-ext)
- [Prelude](#prelude)
- **[Deployment Patterns](#deployment-patterns)**
- [Initialization Patterns](#1-initialization-patterns)
- [High-Volume Strategies](#2-high-volume-strategies)
- [Memory Management](#3-memory-management)
- [Multi-Service Patterns](#4-multi-service-patterns)
- [Export and Ingestion](#5-export-and-ingestion)
- [On-Call Diagnostics](#6-on-call-diagnostics)
- [Feature Gating Strategies](#7-feature-gating-strategies)
- **[Real-World Examples](#real-world-examples)**
- [High-Frequency Trading (HFT)](#real-world-high-frequency-trading)
- [Web Service Under Load](#real-world-web-service-under-load)
- [Batch Processing Pipeline](#real-world-batch-processing-pipeline)
- [Token Bucket Rate Limiter](#real-world-token-bucket-rate-limiter)
- [Building a Custom Exporter](#real-world-custom-exporter)
- [Memory Stats: total/used/free + percentages](#real-world-memory-stats)
- [Memory % used for an operation (estimate)](#real-world-memory-percent-operation)
- [CPU Stats: total/used/free + percentages](#real-world-cpu-stats)
- [CPU % used for an operation (estimate)](#real-world-cpu-percent-operation)
- **[Integration Examples](#integration-examples)**
- [1. Web Framework Integration](#web-framework-integration)
- [2. Database Pool Monitoring](#database-pool-monitoring)
- [3. Background Job Processing](#background-job-processing)
- [4. Observability Stack Integration](#observability-stack-integration)
- [5. Correlation with Tracing](#correlation-with-tracing)
- [6. Grafana Dashboard Setup](#grafana-dashboard-setup)
- [7. Message Brokers (Kafka/NATS) Throughput and Lag](#message-brokers-throughput)
- [8. Caches (Redis) Hit/Miss, Pool Metrics, TTL Health](#caches-hit-miss-pool-metrics)
- [9. Serverless (AWS Lambda) Cold-Start and Duration](#serverless-cold-start-and-duration)
- [10. Kubernetes Scraping & Pod-level Dashboards](#kubernetes-scraping)
- [11. OpenTelemetry Export Bridge (example skeleton)](#open-telemetry-export)
- [Example Pointers](#example-pointers)
- [12. NATS-Specific Queue Depth and Consumers](#nats-specific-queue)
- [13. Redis Latency Histogram and Dashboard Queries](#redis-latency-histogram)
- [14. AWS Lambda EMF (Embedded Metric Format) Emission](#aws-lambda-emf)
- [15. Kubernetes Helm Values (Prometheus Scrape Annotations)](#kubernetes-helm-values)
- [16. Full OTLP Exporter Skeleton (tonic)](#otlp-exporter)
- [17. Grafana Panels (Ready-to-Copy JSON)](#grafana-panels)
- [18. Prometheus Operator ServiceMonitor](#prometheus-operator-servicemonitor)
- [19. Full Grafana Dashboard (Ready-to-Import JSON)](#full-grafana-dashboard)
- [20. Prometheus Recording Rules (Latency and Rates)](#prometheus-recording-rules)
- [21. Prometheus Operator ServiceMonitor (Secured Endpoint)](#prometheus-operator-servicemonitor)
- [22. Helm Snippets (kube-prometheus-stack and App Chart)](#helm-snippets)
- **[Notes](#notes)**
<br><br>
## Installation
### Default Installation
#### Install Manually
Add this to your `Cargo.toml`:
```toml
[dependencies]
metrics-lib = "0.9.5"
```
<br>
#### Install via Terminal
```bash
# Basic installation
cargo add metrics-lib
```
<hr>
<br>
<a href="#top">↑ <b>TOP</b></a>
<br>
## Error handling and panic guarantees
All core metric types provide non-panicking `try_` variants that return `Result<_, MetricsError>` with explicit validation and overflow checks. Prefer these when inputs may be untrusted or when you want to handle errors explicitly.
- `Counter`: `try_inc`, `try_add`, `try_set`, `try_fetch_add`, `try_inc_and_get` — return `MetricsError::Overflow` on arithmetic overflow.
- `Gauge`: `try_set`, `try_add`, `try_sub`, `try_set_max`, `try_set_min` — return `MetricsError::InvalidValue { reason }` for non-finite values and `MetricsError::Overflow` if math overflows.
- `Timer`: `try_record_ns`, `try_record`, `try_record_batch` — overflow-checked on internal counters.
- `RateMeter`: `try_tick`, `try_tick_n`, `try_tick_if_under_limit` — overflow-checked; `try_tick_if_under_limit` returns `Ok(bool)` indicating admission; may return `MetricsError::OverLimit` for strict policies where applicable.
Panic guidelines:
- The non-`try_` methods prioritize ultra-low latency and assume valid inputs. They generally do not panic but may saturate or accept values without validation.
- Use `try_` methods for correctness-critical paths, external inputs, or when building safety-critical systems.
Example:
```rust
use metrics_lib::{init, metrics, MetricsError};
init();
let g = metrics().gauge("cpu_pct");
g.try_set(87.3)?; // Result<(), MetricsError>
let r = metrics().rate("api");
let ok = r.try_tick_if_under_limit(1000.0)?; // Result<bool, MetricsError>
if ok { /* proceed */ }
```
<hr>
<br>
<a href="#top">↑ <b>TOP</b></a>
<br>
## Examples
Run these self-contained examples to see the library in action:
- Quick Start
- File: `examples/quick_start.rs`
- Run:
```bash
cargo run --example quick_start --release
```
- Streaming Rate Window
- File: `examples/streaming_rate_window.rs`
- Run:
```bash
cargo run --example streaming_rate_window --release
```
- Axum Registry Integration (minimal web service)
- File: `examples/axum_registry_integration.rs`
- Run:
```bash
cargo run --example axum_registry_integration --release
```
- Endpoints:
- `GET /health` — liveness probe
- `GET /metrics-demo` — updates metrics (counter/gauge/timer/rate)
- `GET /export` — returns a JSON snapshot of selected metrics
<hr>
<br>
<a href="#top">↑ <b>TOP</b></a>
<br>
## Quick Start
```rust
use metrics_lib::{init, metrics};
fn main() {
// Initialize once at startup
init();
// Counter (ultra-fast)
metrics().counter("requests").inc();
// Gauge (atomic f64)
metrics().gauge("cpu_usage_pct").set(87.3);
// Timer (nanosecond precision)
let t = metrics().timer("db_query").start();
// ... do work ...
t.stop();
// Or time a closure and return its result
let user = metrics().time("fetch_user", || {
// ... expensive work ...
42
});
assert_eq!(user, 42);
}
```
<hr>
<br>
<a href="#top">↑ <b>TOP</b></a>
<br>
## Public APIs
### Global initialization
- `init() -> &'static MetricsCore`
- Initializes the global metrics singleton (`METRICS`). Safe to call multiple times; first call wins.
- `metrics() -> &'static MetricsCore`
- Returns the global `MetricsCore`. Panics if `init()` has not been called.
- `static METRICS: OnceLock<MetricsCore>`
- Exposed for advanced embeddings. Prefer `init()`/`metrics()` for normal use.
Example:
```rust
use metrics_lib::{init, metrics};
fn startup() {
init();
metrics().counter("boot").inc();
}
```
<br>
### `MetricsCore`
Source: `src/lib.rs` (`MetricsCore`)
- `MetricsCore::new() -> Self`
- `counter(name: &str) -> Arc<Counter>`
- `gauge(name: &str) -> Arc<Gauge>`
- `timer(name: &str) -> Arc<Timer>`
- `rate(name: &str) -> Arc<RateMeter>`
- `time<T>(name: &str, f: impl FnOnce() -> T) -> T`
- `system() -> &SystemHealth`
- `registry() -> &Registry`
**v0.9.2 note:** `name` is now `&str` (was `&'static str`). String literals
still work unchanged; runtime-derived names (per-route, per-tenant, etc.) work
without `Box::leak`. Repeated lookups of the same name return the same `Arc`
and perform no allocation on the hot path; the first registration allocates
a `String` key inside the registry.
Patterns:
```rust
use metrics_lib::{init, metrics};
init();
// Static name (compile-time string literal).
let c = metrics().counter("requests");
c.inc();
c.add(5);
let g = metrics().gauge("temp_c");
g.set(21.5);
// Runtime-derived name (was previously `Box::leak`'d).
let tenant_id = "acme";
let key = format!("requests.tenant.{tenant_id}");
metrics().counter(&key).inc();
// Measure work
### `Counter`
Source: `src/counter.rs`
Structs:
- `Counter` (cache-line aligned)
- `CounterStats { value: u64, age: Duration, rate_per_second: f64, total: u64 }`
Core methods (ultra-fast, lock-free):
- `Counter::new()`, `Counter::with_value(initial: u64)`
- `inc()`, `add(amount: u64)`
- `get() -> u64`, `is_zero() -> bool`, `age() -> Duration`, `rate_per_second() -> f64`
- `reset()`, `set(value: u64)`, `compare_and_swap(expected, new) -> Result<u64,u64>`
- `fetch_add(amount) -> u64`, `add_and_get(amount) -> u64`, `inc_and_get() -> u64`
- **v0.9.2:** `add_and_get` and `inc_and_get` now use `wrapping_add` and
will not panic on overflow in debug builds. The returned value wraps
modulo `2^64`, matching `AtomicU64::fetch_add` semantics. Use the
checked variants (`try_inc_and_get`, `try_fetch_add`) when an explicit
`MetricsError::Overflow` is required.
- `saturating_add(amount)`
- **v0.9.2:** internally uses `Relaxed compare_exchange_weak` (no more
`SeqCst`); observable behaviour unchanged.
- `batch_inc(count: usize)`, `inc_if(condition: bool)`, `inc_max(max_value: u64) -> bool`
- **v0.9.2:** `inc_max` switched to `Relaxed` CAS for the same
performance improvement.
- `stats() -> CounterStats`
Example:
```rust
use metrics_lib::{init, metrics};
init();
let c = metrics().counter("jobs_processed");
c.inc();
c.add(10);
// Rate since start
let rps = c.rate_per_second();
let s = c.stats();
println!("jobs={}, rps={:.1}", s.value, s.rate_per_second);
```
<br>
### `Gauge`
Source: `src/gauge.rs`
Structs:
- `Gauge` (atomic f64)
- `GaugeStats { value: f64, age: Duration, updates: Option<u64> }`
Common methods:
- `Gauge::new()`, `Gauge::with_value(initial: f64)`
- `set(v: f64)`, `get() -> f64`
- Arithmetic updates: `add(v)`, `sub(v)`
- Min/Max: `set_max(v)`, `set_min(v)`
- Math utilities: `multiply(factor)`, `divide(divisor)`, `abs()`, `clamp(min, max)`
- EMA: `update_ema(sample, alpha)`
- Introspection: `is_zero()`, `is_positive()`, `is_negative()`, `is_finite()`, `age()`
- CAS: `compare_and_swap(expected, new) -> Result<f64, f64>`
- Stats: `stats() -> GaugeStats`
Example:
```rust
use metrics_lib::{init, metrics};
init();
let cpu = metrics().gauge("cpu_pct");
cpu.set(12.0);
cpu.add(2.5);
println!("cpu now: {}%", cpu.get());
```
Specialized gauges (re-exported as `gauge_specialized`):
- `PercentageGauge`, `MemoryGauge`, etc. See `gauge::specialized` for details.
<br>
### `Timer`
Source: `src/timer.rs`
Concepts:
- `Timer`: records durations with nanosecond precision.
- `RunningTimer`: RAII guard from `start()`; call `stop()` to record.
Common methods:
- `Timer::new()`
- `start() -> RunningTimer`
- `record(duration: Duration)`
- `record_ns(ns: u64)` — fastest manual record path
- `record_batch(durations: &[Duration])`
- **v0.9.2:** batch totals are summed with `saturating_add` instead of
`+=`. Adversarial inputs that would have panicked in debug builds now
saturate at `u64::MAX` nanoseconds without panicking. The `try_record_batch`
checked variant continues to return `MetricsError::Overflow` instead.
- `count() -> u64`, `total() -> Duration`, `min() -> Duration`, `max() -> Duration`, `average() -> Duration`
- `stats() -> TimerStats { count, total, average, min, max, age, rate_per_second }`
- Helpers: macro/utility functions for timing blocks and functions (see source).
Example:
```rust
use metrics_lib::{init, metrics};
use std::time::Duration;
init();
let t = metrics().timer("encode");
{
let run = t.start();
// ... do work ...
run.stop();
}
// Manual recording
t.record(Duration::from_millis(3));
let s = t.stats();
println!("samples: {} avg: {:?}", s.count, s.average);
```
<br>
### `RateMeter`
Source: `src/rate_meter.rs`
Concepts:
- Tumbling-window rate calculations (events/sec, minute, hour)
- Optional lightweight rate-limiting helpers
Common methods:
- `RateMeter::new()`
- `tick()` — record an event
- `tick_n(n: u32)` — record multiple events
- `rate() -> f64` — recent events/second (alias: `rate_per_second()`)
- `rate_per_minute() -> f64`, `rate_per_hour() -> f64`
- `total() -> u64`, `reset()`
- `stats() -> RateStats { total_events, per_second, per_minute, per_hour, average_rate, age, window_fill }`
Example:
```rust
use metrics_lib::{init, metrics};
init();
let r = metrics().rate("api_calls");
for _ in 0..100 { r.tick(); }
println!("rate/sec: {:.1}", r.rate());
```
Specialized meters (re-exported as `rate_meter_specialized`):
- `ApiRateLimiter`, `ThroughputMeter`, etc. See `rate_meter::specialized`.
<br>
<h3 id="histogram"><code>Histogram</code></h3>
Source: `src/histogram.rs` — requires the `histogram` Cargo feature.
A bucketed observation type compatible with Prometheus / OpenMetrics histogram
semantics. Each bucket counts observations with value `<= upper_bound`;
exports render the buckets in cumulative form. The implicit `+Inf` bucket
always equals the total observation count. `sum` and `count` are tracked
separately for `_sum` / `_count` companion series.
Construction:
- `Histogram::with_buckets(bounds: impl IntoIterator<Item = f64>)` — explicit upper bounds.
- `Histogram::linear(start, width, count)` — `start, start+width, …, start+(count-1)*width`.
- `Histogram::exponential(start, factor, count)` — `start, start*factor, …`.
- `Histogram::default_seconds()` — the Prometheus default latency-seconds buckets
(`[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]`).
- `DEFAULT_SECONDS_BUCKETS: &[f64]` — re-exported constant for the same buckets.
Observation:
- `observe(value: f64)` — non-finite values are silently dropped.
- `try_observe(value: f64) -> Result<()>` — returns
`Err(MetricsError::InvalidValue)` for NaN / ±∞.
Read accessors:
- `count() -> u64`, `sum() -> f64`, `mean() -> f64`, `age() -> Duration`.
- `quantile(q: f64) -> f64` — clamped to `0.0..=1.0`; bucket-interpolated
estimate, returns `0.0` on empty.
- `snapshot() -> HistogramSnapshot { buckets: Vec<HistogramBucket>, sum, count, age }` —
buckets rendered cumulatively, trailing `+Inf` bucket appended.
- `reset()` — clears all bucket counters and sum/count.
Example:
```rust
# #[cfg(feature = "histogram")]
# {
use metrics_lib::Histogram;
let h = Histogram::with_buckets([0.01, 0.05, 0.1, 0.5, 1.0]);
h.observe(0.005);
h.observe(0.08);
h.observe(0.42);
h.observe(2.0); // +Inf bucket
assert_eq!(h.count(), 4);
assert!(h.quantile(0.5) > 0.0);
let snap = h.snapshot();
assert_eq!(snap.buckets.last().unwrap().upper_bound, f64::INFINITY);
# }
```
Registry integration:
```rust
# #[cfg(feature = "histogram")]
# {
use metrics_lib::{init, metrics, LabelSet};
init();
// Optional: pre-configure buckets for a metric name.
metrics().registry().configure_histogram(
"rpc_duration_seconds",
[0.005, 0.01, 0.025, 0.05, 0.1, 0.5, 1.0],
);
// Labeled histogram. First registration with a given `(name, labels)` tuple
// allocates a new `Arc<Histogram>` using the configured buckets (or the
// Prometheus default seconds layout if none configured).
let labels = LabelSet::from([("route", "/search")]);
let h = metrics().histogram_with("rpc_duration_seconds", &labels);
h.observe(0.087);
# }
```
<br>
<h3 id="labels"><code>LabelSet</code> & labeled metrics</h3>
Source: `src/labels.rs`.
`LabelSet` is a sorted, deduplicated `(key, value)` collection that
distinguishes one metric *instance* from another sharing the same name. The
inner storage is sorted by key so two label sets with the same contents but
different insertion orders hash and compare equal.
Construction:
- `LabelSet::EMPTY` / `LabelSet::new()` — empty set, allocation-free.
- `LabelSet::from([(k, v), ...])` / `FromIterator<(K, V)>` — accepts both
string literals (`&'static str`) and owned `String`s.
- `let mut l = LabelSet::new(); l.add("k", "v");` — incremental build,
builder-style.
- `.with("k", "v")` — consuming variant for chained construction.
Read accessors:
- `len() -> usize`, `is_empty() -> bool`.
- `iter() -> impl Iterator<Item = (&str, &str)>` — sorted by key.
- `get(key) -> Option<&str>`, `remove(key) -> bool`.
- `to_prometheus() -> String` — `{k="v",k="v"}` (used by exporters).
- `to_statsd() -> String` — `|#k:v,k:v` (DogStatsD format).
Cardinality control (on the registry):
- `Registry::set_cardinality_cap(usize)` — default `DEFAULT_CARDINALITY_CAP = 10_000`.
- `Registry::cardinality_cap() -> usize`,
`Registry::cardinality_count() -> usize`,
`Registry::cardinality_overflows() -> u64`.
When a fresh `(name, labels)` registration would exceed the cap:
- `try_*_with` returns `Err(MetricsError::CardinalityExceeded)`.
- `*_with` (non-`try`) routes to a process-global per-type overflow sink
(never exported; observable via `cardinality_overflows`).
Labeled lookup methods on `MetricsCore` (each gated on its metric-type
feature):
- `counter_with(name, &LabelSet) -> Arc<Counter>` /
`try_counter_with(...) -> Result<Arc<Counter>>`
- `gauge_with(name, &LabelSet) -> Arc<Gauge>` / `try_gauge_with`
- `timer_with(name, &LabelSet) -> Arc<Timer>` / `try_timer_with`
- `rate_with(name, &LabelSet) -> Arc<RateMeter>` / `try_rate_with`
- `histogram(name) -> Arc<Histogram>` (unlabeled),
`histogram_with(name, &LabelSet) -> Arc<Histogram>` / `try_histogram_with`
Example:
```rust
# #[cfg(feature = "count")]
# {
use metrics_lib::{init, metrics, LabelSet, MetricsError};
init();
let labels = LabelSet::from([("method", "GET"), ("status", "200")]);
metrics().counter_with("http_requests", &labels).inc();
// Tight cap + explicit overflow handling.
metrics().registry().set_cardinality_cap(4);
let bad = LabelSet::from([("trace_id", "deadbeef")]);
match metrics().try_counter_with("http_requests", &bad) {
Ok(c) => c.inc(),
Err(MetricsError::CardinalityExceeded) => { /* drop or downsample */ }
Err(e) => panic!("unexpected error: {e}"),
}
# }
```
<br>
<h3 id="metric-metadata">Metric metadata</h3>
Source: `src/metadata.rs`.
Per-metric metadata (help text, unit, kind) feeds the `# HELP` / `# TYPE` /
`# UNIT` lines in Prometheus / OpenMetrics output, the `description` field
in OTLP, and unit suffixes in StatsD. Metadata is optional — every metric
exports successfully without it.
Types:
- `MetricKind` — `Counter | Gauge | Timer | Rate | Histogram`.
- `Unit` — enumerated standard units (`Seconds`, `Milliseconds`, `Bytes`,
`Percent`, …) plus `Unit::Custom(&'static str)` for free-form unit names.
- `MetricMetadata { help: Cow<'static, str>, unit: Unit, kind: MetricKind }`.
Registry methods:
- `Registry::describe(name, MetricMetadata)` — store/replace metadata for a name.
- `Registry::describe_counter(name, help, unit)`
- `Registry::describe_gauge(name, help, unit)`
- `Registry::describe_timer(name, help, unit)`
- `Registry::describe_rate(name, help, unit)`
- `Registry::describe_histogram(name, help, unit)`
- `Registry::metadata(name) -> Option<MetricMetadata>`.
Example:
```rust
use metrics_lib::{init, metrics, Unit};
init();
metrics().registry().describe_counter(
"http_requests",
"Total HTTP requests handled",
Unit::Custom("1"),
);
metrics().registry().describe_histogram(
"http_request_duration_seconds",
"Request handler latency",
Unit::Seconds,
);
```
<br>
### `SystemHealth`
Source: `src/system_health.rs`
Highlights:
- CPU and memory usage (process/system)
- Load average, threads, file descriptors, health score
Key methods (see `src/system_health.rs` for full details):
- `cpu_used() -> f64`, `cpu_free() -> f64`
- `mem_used_mb() -> f64`, `mem_used_gb() -> f64`
- `process_cpu_used() -> f64`, `process_mem_used_mb() -> f64`
- `load_avg() -> f64`
- `thread_count() -> u32`, `fd_count() -> u32`
- `health_score() -> f64`, `quick_check() -> HealthStatus`
- `update()` (force refresh), `snapshot() -> SystemSnapshot`, `process() -> ProcessStats`
**v0.9.2 — refresh fixes:**
- `maybe_update()` now stores and compares the last-refresh timestamp in a
single time unit (milliseconds). Earlier revisions stored nanoseconds and
compared milliseconds, freezing the throttle so all values were pinned to
their initial reads. After upgrade, `cpu_used()` / `mem_used_mb()` /
`load_avg()` / process metrics refresh on the configured interval as
documented.
- `SystemSnapshot::last_update` now reports **time since last refresh**
(e.g., `Duration::from_millis(0..=interval_ms)`), not "monotonic time at
last refresh" as it incorrectly did before.
- Linux process CPU is now delta-sampled: `((utime+stime) - prev) /
(CLK_TCK * elapsed_s * cores) * 100`, normalized per-core and clamped to
`0..=100`. First sample returns `0.0` and seeds the baseline.
- The non-Linux sysinfo refresh now uses `parking_lot::Mutex` instead of
`std::sync::Mutex`. The redundant manual `unsafe impl Send/Sync` for
`SystemHealth` was removed (the compiler derives both automatically).
Example:
```rust
use metrics_lib::{init, metrics};
init();
let sys = metrics().system();
println!(
"cpu={:.1}% mem_mb={:.1}",
sys.cpu_used(),
sys.mem_used_mb()
);
```
<h4 id="systemhealth-platform-notes">Platform Notes</h4>
- Linux: Uses `/proc` for system and process sampling (CPU, memory, load, threads, FDs) for maximum performance and fidelity.
- Non‑Linux (macOS/Windows): Uses the `sysinfo` crate for cross‑platform values.
- System CPU, memory, and load are reported via `sysinfo`.
- Process CPU and memory are reported via `sysinfo`.
- Thread count and file descriptor/handle count return defaults (1 and 0 respectively) where not exposed portably.
- Future enhancement: native macOS (sysctl/mach) and Windows (PDH/WMI/WinAPI) backends can be added for per‑platform fidelity (e.g., accurate thread/FD counts) without adding dependencies.
Examples:
- CPU overview (system/process): `examples/cpu_stats.rs`
- Memory overview (system/process): `examples/memory_stats.rs`
<br>
<h5 id="systemhealth-memory-units-note">Memory Units Note</h5>
- Depending on platform and sysinfo version, raw memory values may be reported in KiB or bytes. The provided `examples/memory_stats.rs` auto‑detects units for display (MB/GB) while keeping percentage calculations consistent.
- For production use, prefer using percentages for alerts and apply consistent conversion for display. If you need exact byte precision on macOS or Windows, consider platform APIs (e.g., `sysctl` on macOS, WinAPI on Windows) in a background task, or contribute native backends to `SystemHealth`.
- The example includes a small documented helper `normalize_sysinfo_memory_to_mb(...)` explaining invariants and edge cases; see `examples/memory_stats.rs` (comment block above the function) for details.
<br>
<h3 id="health-config"><code>HealthConfig</code> & <code>Step</code> (v0.9.5)</h3>
Source: `src/system_health.rs`.
`SystemHealth::health_score()` is composed of step-wise penalties applied
to six metrics (system CPU, normalised load average, process CPU, memory
GB, threads, file descriptors). Prior to v0.9.5 the threshold ladder was
hardcoded; v0.9.5 exposes the full ladder as a tunable `HealthConfig`
value so deployments with different operating envelopes can pick their
own boundaries without forking the crate.
**Types:**
- `Step { threshold: f64, penalty: f64 }` — one `(threshold, penalty)`
pair. When the metric exceeds `threshold`, `penalty` is subtracted from
the running 0..=100 score. `Step::new(threshold, penalty)` is `const`.
- `HealthConfig { system_cpu, load_avg, process_cpu, memory_gb, threads,
fds: Vec<Step> }` — one penalty ladder per metric. Within each `Vec`,
steps **must be ordered descending by threshold** (first match wins).
- `HealthConfig::default()` — the v0.9.x defaults, preserved exactly so
existing dashboards do not shift on upgrade.
**Methods:**
- `SystemHealth::with_config(interval: Duration, config: HealthConfig) -> Self`
— new in v0.9.5. Combines a custom refresh interval with a custom
score config. `SystemHealth::with_interval(d)` continues to use the
default config.
- The `load_avg` ladder is interpreted as **multipliers of `num_cpus::get()`**
(e.g. `Step::new(2.0, 25.0)` ⇒ trips when 1-minute load > 2× core
count). Other ladders use the metric's natural unit.
**Examples:**
Tighter CPU thresholds for a CPU-bound service:
```rust
use metrics_lib::{HealthConfig, Step, SystemHealth};
use std::time::Duration;
let cfg = HealthConfig {
system_cpu: vec![
Step::new(70.0, 30.0),
Step::new(50.0, 15.0),
Step::new(30.0, 5.0),
],
..HealthConfig::default()
};
let health = SystemHealth::with_config(Duration::from_millis(500), cfg);
let score = health.health_score(); // 0..=100, lower under load
```
Relax FD thresholds for a connection-heavy service:
```rust
use metrics_lib::{HealthConfig, Step, SystemHealth};
use std::time::Duration;
let cfg = HealthConfig {
fds: vec![
Step::new(100_000.0, 15.0),
Step::new(50_000.0, 8.0),
Step::new(10_000.0, 3.0),
],
..HealthConfig::default()
};
let _ = SystemHealth::with_config(Duration::from_secs(1), cfg);
```
`HealthConfig` derives `serde::Serialize` behind the `serde` feature so
configurations can be loaded from JSON / TOML / YAML if desired.
<br>
<h3 id="scoped-registry"><code>ScopedRegistry</code> (v0.9.5)</h3>
Source: `src/registry.rs`.
A `ScopedRegistry` is a thin view over a `Registry` that **prepends a
fixed prefix** to every metric name on lookup / describe / configure.
Useful for tying a metrics namespace to a subsystem (`"http."`,
`"db."`, …) without rewriting every call site.
There is **no separate storage**: a scoped lookup lands in the same
underlying map as the unscoped equivalent, so
`scoped("http.").counter("requests")` and `counter("http.requests")`
return the **same** `Arc<Counter>`.
**Constructors:**
- `Registry::scoped(prefix: impl Into<String>) -> ScopedRegistry<'_>`
- `MetricsCore::scoped(prefix: impl Into<String>) -> ScopedRegistry<'_>`
(shorthand for `self.registry().scoped(prefix)`)
- `ScopedRegistry::scoped(sub_prefix) -> ScopedRegistry<'_>` — nested
scopes compose prefixes: `scoped("a.").scoped("b.")` ≡ `scoped("a.b.")`.
**Methods** (every method delegates to the underlying `Registry` with the
joined name):
- `counter(name)` / `gauge(name)` / `timer(name)` / `rate(name)` /
`histogram(name)` — unlabeled lookups.
- `counter_with(name, &labels)` / `gauge_with` / `timer_with` /
`rate_with` / `histogram_with` — labeled lookups (subject to the
registry's cardinality cap).
- `describe_counter` / `describe_gauge` / `describe_timer` /
`describe_rate` / `describe_histogram` — under the scoped name.
- `configure_histogram(name, buckets)` — bucket-layout pre-configuration
under the scoped name (requires `histogram`).
- `prefix() -> &str`, `registry() -> &Registry` — accessors.
**Examples:**
Subsystem namespacing:
```rust
use metrics_lib::{init, metrics, Unit};
init();
let http = metrics().scoped("http.");
http.describe_counter("requests", "Total HTTP requests", Unit::Custom("1"));
http.counter("requests").inc();
http.gauge("inflight").set(1.0);
// Equivalent unscoped name lookups return the same `Arc<Counter>`:
assert_eq!(metrics().counter("http.requests").get(), 1);
```
Nested scopes compose:
```rust
use metrics_lib::{init, metrics};
init();
let svc = metrics().scoped("svc.");
let db = svc.scoped("db.");
db.counter("queries").inc(); // → "svc.db.queries"
```
<br>
<h3 id="token-bucket"><code>TokenBucket</code> (v0.9.5)</h3>
Source: `src/token_bucket.rs`.
Strict-admission counterpart to `RateMeter::tick_if_under_limit`. Where
the rate-meter trades correctness for hot-path speed (it has known TOCTOU
overshoot of up to `num_threads − 1` events per window), `TokenBucket`
guarantees that the **capacity is never exceeded** — every `acquire`
goes through a single `compare_exchange_weak` on a packed `(tokens,
last_refill_ms)` `u64`.
Use for: billing, hard-limit admission control, downstream service
protection. For pure observability throttling, `RateMeter` is faster.
**Constructor:**
- `TokenBucket::new(capacity: u32, refill_per_second: f64) -> Self`
- `capacity` — burst size in whole tokens (max tokens the bucket
holds).
- `refill_per_second` — sustained refill rate. `0.0` produces a
static-capacity bucket (no refill). Non-finite or negative inputs
are coerced to `0.0`.
**Methods:**
- `try_acquire(n: u32) -> Result<()>` — atomically remove `n` tokens.
Returns `Ok(())` on success or `Err(MetricsError::WouldBlock)` when
fewer than `n` tokens are available after refill. `n == 0` is a no-op.
- `acquire(n: u32) -> bool` — convenience wrapper returning `true`/`false`.
- `available() -> u32` — approximate current token count (advisory; no
retry semantics).
- `capacity() -> u32`, `refill_per_second() -> f64` — configuration
accessors.
- `reset()` — refills the bucket to full `capacity`.
**Examples:**
Rate-limited request admission:
```rust
use metrics_lib::TokenBucket;
use std::time::Duration;
// 50 requests per second sustained, burst up to 100.
let limiter = TokenBucket::new(100, 50.0);
fn handle_request(limiter: &TokenBucket) {
if limiter.acquire(1) {
// … serve request …
} else {
// … return 429 Too Many Requests …
}
}
handle_request(&limiter);
// Reset for the next benchmark run.
limiter.reset();
```
Burst acquire (multi-token transactions):
```rust
use metrics_lib::{TokenBucket, MetricsError};
let limiter = TokenBucket::new(50, 10.0);
match limiter.try_acquire(5) {
Ok(()) => { /* admit the batch */ }
Err(MetricsError::WouldBlock) => { /* retry later */ }
Err(_) => unreachable!(),
}
```
Multiple concurrent threads racing for tokens: with capacity 100 and
8 threads each requesting 30, exactly 100 tokens are issued — no
overshoot:
```rust
use metrics_lib::TokenBucket;
use std::sync::Arc;
use std::thread;
let bucket = Arc::new(TokenBucket::new(100, 0.0));
let handles: Vec<_> = (0..8)
.map(|_| {
let b = Arc::clone(&bucket);
thread::spawn(move || {
let mut taken = 0u32;
for _ in 0..30 {
if b.acquire(1) { taken += 1; }
}
taken
})
})
.collect();
let total: u32 = handles.into_iter().map(|h| h.join().unwrap()).sum();
assert_eq!(total, 100);
```
<br>
<h3 id="exporters">Exporters (v0.9.3)</h3>
Five built-in exporters render the registry into popular telemetry formats.
Each is a stateless function (or thin sink for push transports) that
accepts a `&Registry` and produces a backend-specific output.
| Prometheus text | `metrics_lib::exporters::prometheus` | (always on) | `String` |
| OpenMetrics text | `metrics_lib::exporters::openmetrics` | (always on) | `String` (with trailing `# EOF\n`) |
| JSON snapshot | `metrics_lib::exporters::json` | `serde` | `String` / `RegistrySnapshot` |
| StatsD UDP | `metrics_lib::exporters::statsd` | `statsd` | UDP datagrams via `StatsdSink` |
| OTLP/HTTP+JSON | `metrics_lib::exporters::otlp` | `otlp` (→ `serde`) | `String` POST body |
All exporters honour [`LabelSet`](#labels) and [metric metadata](#metric-metadata):
- Prometheus / OpenMetrics: labels appear as `{k="v",k="v"}`; help/unit
metadata becomes `# HELP` / `# TYPE` / `# UNIT` lines. OpenMetrics adds
the `_total` suffix on counter samples and ends with `# EOF\n`.
- JSON snapshot: labels serialise as nested JSON objects; metadata appears
per-series.
- StatsD: labels become DogStatsD tags (`|#k:v,k:v`); the wire format uses
cumulative gauge mode (`|g`) since StatsD counters are deltas and the
registry stores totals.
- OTLP: labels become `attributes`; mapped to OTLP `Sum` (counters,
monotonic + cumulative), `Gauge`, or `Histogram` (timers + histograms).
Example — Prometheus `/metrics` body:
```rust
# #[cfg(feature = "count")]
# {
use metrics_lib::{init, metrics, LabelSet, Unit};
use metrics_lib::exporters::prometheus;
init();
metrics().registry().describe_counter(
"http_requests",
"Total HTTP requests",
Unit::Custom("1"),
);
let labels = LabelSet::from([("status", "200")]);
metrics().counter_with("http_requests", &labels).add(7);
let body = prometheus::render(metrics().registry());
assert!(body.contains("# HELP http_requests Total HTTP requests"));
assert!(body.contains(r#"http_requests{status="200"} 7"#));
# }
```
Example — JSON snapshot (feature = "serde"):
```rust
# #[cfg(all(feature = "serde", feature = "count"))]
# {
use metrics_lib::{init, metrics};
use metrics_lib::exporters::json;
init();
metrics().counter("hits").inc();
let snap = json::snapshot(metrics().registry());
assert_eq!(snap.schema_version, 1);
assert!(!snap.counters.is_empty());
// Or render directly to a JSON string.
let body = json::render(metrics().registry());
let _v: serde_json::Value = serde_json::from_str(&body).unwrap();
# }
```
Example — StatsD push (feature = "statsd"):
```no_run
# #[cfg(all(feature = "statsd", feature = "count"))]
# {
use metrics_lib::{init, metrics};
use metrics_lib::exporters::statsd::StatsdSink;
init();
metrics().counter("requests").inc();
let sink = StatsdSink::new("127.0.0.1:8125")
.expect("bind UDP")
.with_prefix("svc.");
sink.send(metrics().registry()).expect("statsd push");
# }
```
Example — OTLP/HTTP+JSON (feature = "otlp"):
```rust
# #[cfg(all(feature = "otlp", feature = "count"))]
# {
use metrics_lib::{init, metrics};
use metrics_lib::exporters::otlp;
init();
metrics().counter("requests").inc();
let payload: String = otlp::render(metrics().registry(), "my-service");
// POST `payload` to <collector>/v1/metrics with Content-Type: application/json
let _ = payload;
# }
```
End-to-end runnable examples live in `examples/`:
`labels_demo`, `histogram_latency`, `prometheus_endpoint`, `statsd_push`,
`otlp_push`, `snapshot_serde`.
<br>
### Async support
Source: `src/async_support.rs`
- `AsyncTimerGuard` — RAII timing for async blocks
- `AsyncTimerExt` — extension trait providing `start_async()` and `time_async()`
- `TimedFuture` — `Future` wrapper returned by `time_async()`
- `AsyncMetricBatch` — batch metric updates with `counter_inc`, `gauge_set`,
`timer_record`, `rate_tick`, `flush(&MetricsCore)`.
- **v0.9.2 note:** name arguments are now `impl Into<Cow<'static, str>>`
(was `&'static str`). Both string literals and owned `String`s are
accepted; the enum stores `Cow<'static, str>` internally so static names
cost nothing extra.
Example (Tokio):
```rust
use metrics_lib::{init, metrics, AsyncTimerExt, AsyncMetricBatch};
#[tokio::main]
async fn main() {
init();
// Time an async operation and get its result
let timer = metrics().timer("async_task");
let result: i32 = timer
.time_async(|| async {
// ... async work ...
42
})
.await;
assert_eq!(result, 42);
// RAII guard form
{
let _guard = timer.start_async();
// ... async work interleaved ...
// recorded on drop
}
// Batch updates (flush is synchronous and takes &MetricsCore)
let mut batch = AsyncMetricBatch::new();
batch.counter_inc("jobs_done", 1);
batch.gauge_set("queue_depth", 3.0);
batch.timer_record("async_task", 500_000); // ns
batch.rate_tick("qps");
batch.flush(metrics());
}
```
<br>
### Adaptive controls
Source: `src/adaptive.rs`
- `SamplingStrategy`
- `Fixed { rate: u32 }`
- `Dynamic { min_rate, max_rate, target_throughput }`
- `TimeBased { min_interval: u64 /* ns */ }`
- `AdaptiveSampler::new(strategy)`; `should_sample() -> bool`; `current_rate() -> u32`; `stats()`
- `MetricCircuitBreaker` with `CircuitBreakerConfig { failure_threshold, success_threshold, timeout, half_open_max_calls }`
- `is_allowed() -> bool`, `record_success()`, `record_failure()`
- `BackpressureController` (re-exported): utilities to reduce work under load
Example (sampling):
```rust
use metrics_lib::{AdaptiveSampler, SamplingStrategy};
let sampler = AdaptiveSampler::new(SamplingStrategy::Dynamic {
min_rate: 1,
max_rate: 1024,
target_throughput: 10_000,
});
if sampler.should_sample() {
// record detailed metrics/logging
}
```
Example (circuit breaker):
```rust
use metrics_lib::{AdaptiveSampler, MetricCircuitBreaker};
use metrics_lib::adaptive::CircuitBreakerConfig;
let cb = MetricCircuitBreaker::new(CircuitBreakerConfig { ..Default::default() });
if cb.is_allowed() {
// perform work and then report result
cb.record_success();
} else {
// shed load
}
```
<br>
<h3 id="tracing-ext"><code>tracing</code> integration (v0.9.5)</h3>
Source: `src/tracing_ext.rs` — requires the `tracing` Cargo feature.
Opt-in adapters that wrap existing `Timer` operations with a
[`tracing`](https://docs.rs/tracing) span, so a single call site
populates both the metric histogram and the user's tracing subscriber.
Hot paths in the metric types themselves are **unchanged**; enabling the
`tracing` feature does not slow `Counter::inc` / `Gauge::set` /
`Timer::record` / `Histogram::observe`.
**Functions:**
- `time_in_span<T>(name: &'static str, timer: &Timer, f: impl FnOnce() -> T) -> T`
— runs `f` inside both the supplied `Timer` and a
`tracing::info_span!("metric.time", name = name)`. Returns whatever
`f` returns.
- `time_global<T>(name: &'static str, f: impl FnOnce() -> T) -> T`
— shorthand that resolves the timer from the global registry by
`name` and forwards to `time_in_span`.
**Examples:**
Wrap a closure with both a metric timer and a tracing span:
```rust
# #[cfg(all(feature = "timer", feature = "tracing"))]
# {
use metrics_lib::{init, metrics};
use metrics_lib::tracing_ext::time_in_span;
init();
let timer = metrics().timer("db.query");
42
});
assert_eq!(result, 42);
assert_eq!(timer.count(), 1);
# }
```
Use the global shorthand:
```rust
# #[cfg(all(feature = "timer", feature = "tracing"))]
# {
use metrics_lib::init;
use metrics_lib::tracing_ext::time_global;
init();
```
<br>
### Prelude
Import the most common items ergonomically:
```rust
use metrics_lib::prelude::*;
fn main() {
init();
metrics().counter("ready").inc();
}
```
<hr>
<br>
<a href="#top">↑ <b>TOP</b></a>
<br>
## API Safety
The library prioritizes performance while preventing common misuse. Several read/return-value APIs are annotated with `#[must_use]`. This means the compiler warns if the return value is ignored. Ignoring these values usually indicates a logic bug or a lost control decision.
Key `#[must_use]` examples:
- `Counter`: `get()`, `stats()`, `age()`, `is_zero()`, `rate_per_second()`
- `Gauge`: `get()`, `stats()`, `age()`, `is_zero()`, `is_positive()`, `is_negative()`, `is_finite()`
- `Timer`: `count()`, `total()`, `average()`, `min()`, `max()`, `stats()`, `age()`, `is_empty()`, `rate_per_second()`, `RunningTimer::elapsed()`
- `RateMeter`: `rate()`, `rate_per_second()`, `rate_per_minute()`, `rate_per_hour()`, `total()`, `exceeds_rate()`, `can_allow()`, `tick_if_under_limit()`, `tick_burst_if_under_limit()`, `stats()`, `age()`, `is_empty()`
Misuse patterns to avoid:
- Dropping results without checking:
```rust
let _ = metrics().rate("api").tick_if_under_limit(1000.0);
```
- Computing values and not using them:
```rust
metrics().rate("api").rate();
```
Prefer explicit handling:
```rust
let r = metrics().rate("api");
if r.tick_if_under_limit(1000.0) {
// admitted
} else {
// throttled
}
let s = r.stats();
log::debug!("rate: {:.1}/s total: {} age: {:?}", s.per_second, s.total_events, s.age);
```
Notes:
- `Result<…>`-returning APIs are not additionally marked with `#[must_use]` since `Result` already carries it.
- Methods that mutate state (e.g., `Counter::inc()`, `Gauge::set()`) intentionally do not return values.
<hr>
<br>
<a href="#top">↑ <b>TOP</b></a>
<br>
## Deployment Patterns
This section documents proven deployment approaches for using `metrics-lib` in production systems at scale.
### 1. Initialization Patterns
```rust
// Where to initialize in different app types (Tokio web service example)
use metrics_lib::{init_with_config, Config};
#[tokio::main]
async fn main() {
// Initialize BEFORE spawning workers or background tasks
init_with_config(Config {
max_metrics: 10_000,
enable_system_metrics: true,
..Default::default()
});
// Now safe to use across all threads/tasks
// build_server().await;
}
```
Other patterns:
- CLI/tools: call `init()`/`init_with_config()` at the very start of `main()`.
- Libraries: accept `&MetricsCore` explicitly or rely on the global via `metrics()` when appropriate.
- Tests/benches: initialize once per process; subsequent calls are no-ops.
### 2. High-Volume Strategies
```rust
// Strategy 1: Adaptive Sampling (reduce overhead on hot paths)
use metrics_lib::{metrics, AdaptiveSampler, SamplingStrategy};
let sampler = AdaptiveSampler::new(SamplingStrategy::Dynamic {
min_rate: 1,
max_rate: 1024,
target_throughput: 1_000_000, // target ~1M ops/sec
});
if sampler.should_sample() {
metrics().timer("hot_path").record_ns(250); // fast-path manual ns record
}
```
```rust
// Strategy 2: Batch Collection (amortize costs under bursty load)
use metrics_lib::{metrics, AsyncMetricBatch};
let mut batch = AsyncMetricBatch::new();
batch.counter_inc("requests", 1);
batch.gauge_set("cpu", 82.4);
batch.timer_record("db", 120_000); // ns
batch.rate_tick("qps");
batch.flush(metrics()); // single synchronized flush
```
```rust
// Strategy 3: Thread-Local Aggregation (application-level)
// Aggregate counts locally and flush periodically to reduce contention
thread_local! {
static LOCAL_COUNT: std::cell::Cell<u64> = std::cell::Cell::new(0);
}
fn on_event() {
LOCAL_COUNT.with(|c| c.set(c.get() + 1));
}
fn flush_local() {
let count = LOCAL_COUNT.with(|c| { let v = c.get(); c.set(0); v });
if count > 0 {
metrics_lib::metrics().counter("events").add(count);
}
}
```
Guidelines:
- Prefer `record_ns`/`batch_inc`/`flush` in the hottest paths.
- Sample or downsample high-cardinality metrics.
- Avoid per-op string formatting or allocation; use `&'static str` names.
### 3. Memory Management
- Bounded vs. unbounded: limit `max_metrics` via `Config` for controlled memory use.
- Name cardinality: avoid embedding unbounded values (IDs, UUIDs) in metric names.
- Recycling: reuse metric instances via the `Registry`; avoid creating/dropping in tight loops.
- Cleanup: if dynamic names are required, provide explicit cleanup points (e.g., `Registry::clear()` in test lifecycles).
- Alignment: metrics are 64-byte cache-line aligned; avoid creating excessive distinct metrics to keep cache footprint small.
### 4. Multi-Service Patterns
- Naming: use service prefixes like `"auth.requests"`, `"billing.latency"`, `"api.v2.error_rate"`.
- Correlation: align metric names/labels with tracing spans or request IDs (in structured logs), not in the metric name itself.
- Boundaries: maintain separate registries per service when embedding `metrics-lib` inside multi-tenant binaries.
- Aggregation: push metrics to a single exporter/collector at service boundaries; keep in-process metrics lock-free and fast.
### 5. Export and Ingestion
`metrics-lib` focuses on ultra-fast in-process metrics. For exporting, consider bridging to your observability stack:
- Push gateway: periodically snapshot internal counters/gauges and send to an external collector.
- File/pipe sink: write snapshots to a file or stdout for sidecar ingestion.
- Structured logs: emit key metrics in JSON logs for log-based analytics.
Example (periodic snapshot skeleton):
```rust
use std::time::Duration;
use tokio::time::interval;
use metrics_lib::metrics;
#[tokio::main]
async fn main() {
metrics_lib::init();
let mut tick = interval(Duration::from_secs(10));
loop {
tick.tick().await;
// Example: read values atomically and ship to a gateway
let requests = metrics().counter("requests").get();
let error_rate = metrics().rate("errors").rate();
// send_to_gateway(requests, error_rate).await?;
}
}
```
Guidelines:
- Keep export paths off the hot path; use async tasks and backpressure-aware queues.
- Bound queue sizes; drop or sample on overload to protect the application.
- Prefer binary formats for high throughput (CBOR, protobuf) when applicable.
### 6. On-Call Diagnostics
Enable targeted, temporary metrics during incidents without long-term overhead:
- Compile-time flags: feature-gate diagnostic code.
- Runtime toggles: environment variables or admin endpoints enable additional metrics.
Examples:
```rust
// Compile-time gate (Cargo feature)
#[cfg(feature = "diagnostics")]
pub fn diag_tick() {
metrics_lib::metrics().counter("diag.slow_path").inc();
}
```
```rust
// Runtime gate via env var
if std::env::var("METRICS_DIAG").as_deref() == Ok("1") {
metrics_lib::metrics().gauge("diag.queue_depth").set(42.0);
}
```
Guidelines:
- Ensure diagnostic code is zero-overhead when disabled (compile-time or fast runtime checks).
- Use stable, prefixed names (e.g., `diag.*`) and document cleanup/removal plans.
### 7. Feature Gating Strategies
Use Cargo features to tailor performance and binary size to environments:
- `default` minimal footprint; enable heavier components only where needed.
- `async`: include async helpers only when an async runtime is used.
- `bench-tests`: keep benchmark-style tests out of default CI runs to avoid flakiness.
Cargo.toml example:
```toml
[features]
count = [] # Counter metric type
gauge = [] # Gauge metric type
timer = [] # Timer metric type
meter = [] # Rate meter metric type
sample = [] # Statistical sampling
histogram = ["sample"] # Histogram (requires sample)
async = ["dep:tokio"] # Async support (requires Tokio)
serde = ["dep:serde"] # Serde serialization
all = ["count","gauge","timer","meter","sample","histogram"]
full = ["count","gauge","timer","meter","sample","histogram","async","serde"]
minimal = ["count"] # Smallest useful build
default = ["count","gauge","timer"]
bench-tests = [] # Benchmark-style CI tests
```
CI best practices:
- Run unit tests with default features for consumer parity.
- Run all-features in a separate job when validating optional integrations.
- Keep benchmark-style tests gated behind `--features bench-tests -- --ignored`.
<hr>
<br>
<a href="#top">↑ <b>TOP</b></a>
<br>
<h2 id="real-world-examples">Real-World Examples</h2>
<br>
<h3 id="real-world-high-frequency-trading">High-Frequency Trading (HFT)</h3>
Constraints: sub-microsecond hot paths, no allocations, no locks, bounded cardinality.
Key patterns:
- Pre-register metric handles at startup.
- Use counters/gauges inline; export asynchronously off the hot path.
- Avoid per-symbol labels in names; sample or aggregate in fixed windows.
```rust
use metrics_lib::{metrics, Timer};
// Pre-register at init
pub fn init_metrics() {
let m = metrics();
m.counter("orders_submitted");
m.counter("orders_rejected");
m.timer("match_latency_ns");
m.gauge("orderbook_depth");
}
#[inline(always)]
pub fn on_match(orderbook_depth: u64) {
// Minimal work: record, no allocations
let _t = metrics().timer("match_latency_ns").start();
// ... matching logic ...
metrics().gauge("orderbook_depth").set(orderbook_depth as f64);
}
#[inline(always)]
pub fn submit_ok() { metrics().counter("orders_submitted").inc(); }
#[inline(always)]
pub fn submit_reject() { metrics().counter("orders_rejected").inc(); }
```
Guidance:
- Keep metrics names stable; do not embed symbol/account IDs.
- If symbol-level insight is required, sample 1/N events and export summaries via background task.
- Prefer histogram buckets sized for nanosecond ranges if using histograms.
<br>
<h3 id="real-world-web-service-under-load">Web Service Under Load</h3>
Track throughput, error rate, and tail latency. Use recording rules to reduce dashboard cost.
```rust
use metrics_lib::metrics;
pub async fn handle_request() -> Result<&'static str, anyhow::Error> {
let _t = metrics().timer("http_request_duration_s").start();
metrics().counter("http_requests_total").inc();
// ... work ...
Ok("ok")
}
pub fn on_error() {
metrics().counter("http_errors_total").inc();
}
```
Prometheus queries:
- Rate: `sum(rate(http_requests_total[5m]))` per job/route (avoid high-cardinality routes; use normalized labels or grouping).
- Error ratio: `sum(rate(http_errors_total[5m])) / sum(rate(http_requests_total[5m]))`.
- p95: `histogram_quantile(0.95, sum(rate(http_request_duration_s_bucket[5m])) by (le))` if using histogram form.
<br>
<h3 id="real-world-batch-processing-pipeline">Batch Processing Pipeline</h3>
Measure per-batch latency, items processed, and failures. Emit gauges for backlogs.
```rust
use metrics_lib::metrics;
pub fn process_batch(batch_size: usize) {
let _t = metrics().timer("batch_duration_s").start();
// ... process ...
metrics().counter("batch_processed_items_total").add(batch_size as u64);
}
pub fn record_failure() { metrics().counter("batch_failures_total").inc(); }
pub fn backlog_set(count: usize) { metrics().gauge("queue_backlog").set(count as f64); }
```
Grafana tips:
- Use dual-axis panel for `rate(batch_processed_items_total[5m])` and backlog gauge.
- Alert if backlog grows while throughput drops.
<br>
<h3 id="real-world-token-bucket-rate-limiter">Token Bucket Rate Limiter</h3>
Use `RateMeter` for observed rate and gauges for bucket level; timers for wait time.
```rust
use metrics_lib::{metrics, RateMeter};
pub struct Limiter {
meter: RateMeter,
capacity: u64,
tokens: u64,
}
impl Limiter {
pub fn allow(&mut self) -> bool {
self.meter.tick();
if self.tokens > 0 { self.tokens -= 1; true } else { false }
}
pub fn report(&self) {
metrics().gauge("ratelimit_tokens").set(self.tokens as f64);
metrics().gauge("ratelimit_capacity").set(self.capacity as f64);
}
}
```
<br>
<h3 id="real-world-custom-exporter">Building a Custom Exporter</h3>
Example skeleton to snapshot internal metrics and ship to a custom sink (file, TCP, UDP, HTTP, etc.) without perturbing hot paths:
```rust
use metrics_lib::metrics;
use std::fmt::Write;
/// Periodically called by a background task
pub fn snapshot_metrics() -> String {
let reg = metrics().registry();
let mut out = String::new();
// Example format: simple key=value lines (adapt to your collector)
for name in reg.counter_names() {
let v = metrics().counter(Box::leak(name.into_boxed_str())).get();
let _ = writeln!(out, "{} {}", name, v);
}
for name in reg.gauge_names() {
let v = metrics().gauge(Box::leak(name.into_boxed_str())).get();
let _ = writeln!(out, "{} {}", name, v);
}
for name in reg.timer_names() {
let s = metrics().timer(Box::leak(name.into_boxed_str())).stats();
let _ = writeln!(out, "{}.count {}", name, s.count);
let _ = writeln!(out, "{}.avg_ns {}", name, s.average.as_nanos());
}
for name in reg.rate_meter_names() {
let r = metrics().rate(Box::leak(name.into_boxed_str()));
let _ = writeln!(out, "{}.per_sec {:.3}", name, r.rate());
}
out
}
```
Guidelines:
- Run exporters on a timer or off a channel queue, not inline with critical work.
- Bound buffers and drop data on overload to protect application throughput.
- Prefer binary formats for high-throughput ingestion.
<br>
<h3 id="real-world-memory-stats">Memory Stats: total/used/free + percentages</h3>
The `SystemHealth` API provides convenient accessors for commonly used memory stats. Convert units as needed.
```rust
use metrics_lib::metrics;
fn fmt_size_mb(mb: f64) -> (f64, &'static str) {
// convert MB → GB/TB simplistically for display
if mb >= 1024.0 * 1024.0 { (mb / (1024.0 * 1024.0), "TB") }
else if mb >= 1024.0 { (mb / 1024.0, "GB") } else { (mb, "MB") }
}
pub fn memory_overview() {
let sys = metrics().system();
let used_mb = sys.mem_used_mb();
// If you need total/free, compute via platform helpers or your own sysinfo; here we display used directly.
let (v, unit) = fmt_size_mb(used_mb);
println!("mem.used: {:.2} {}", v, unit);
println!("mem.used.pct (process): {:.2}%", sys.process_mem_used_mb() / used_mb.max(1.0) * 100.0);
}
```
Notes:
- `mem_used_mb()` and `mem_used_gb()` report current system memory usage; `process_mem_used_mb()` reports this process’s memory.
- If you require precise total/free memory, integrate your platform’s system APIs alongside `SystemHealth` and compute `free = total - used` and percentages accordingly.
<br>
<h3 id="real-world-memory-percent-operation">Memory % used for an operation (estimate)</h3>
Estimate memory consumed by a single operation by sampling process memory before and after. Express as MB/GB and as a percentage of the pre-op process memory.
```rust
use metrics_lib::metrics;
pub fn measure_op_memory<T>(f: impl FnOnce() -> T) -> (T, f64 /* delta_mb */, f64 /* pct of process */) {
let sys = metrics().system();
let before_mb = sys.process_mem_used_mb();
let result = f();
let after_mb = sys.process_mem_used_mb();
let delta_mb = (after_mb - before_mb).max(0.0);
let pct = if before_mb > 0.0 { (delta_mb / before_mb) * 100.0 } else { 0.0 };
(result, delta_mb, pct)
}
```
Notes:
- This is a coarse estimate; allocator behavior and async tasks can skew instantaneous samples. For better accuracy, repeat and average.
<br>
<h3 id="real-world-cpu-stats">CPU Stats: total/used/free + percentages</h3>
`SystemHealth` exposes CPU usage percentages. Display them and convert as needed.
```rust
use metrics_lib::metrics;
pub fn cpu_overview() {
let sys = metrics().system();
let used = sys.cpu_used(); // e.g., 23.5 (percent)
let free = sys.cpu_free(); // e.g., 76.5 (percent)
println!("cpu.used: {:.1}%", used);
println!("cpu.free: {:.1}%", free);
}
```
Notes:
- For per-core or process-specific stats, use `process_cpu_used()` and, if needed, supplement with platform APIs for core counts/affinity.
<br>
<h3 id="real-world-cpu-percent-operation">CPU % used for an operation (estimate)</h3>
Estimate CPU for an operation by sampling process CPU usage and wall time before/after. This yields a coarse percentage useful for relative comparisons.
```rust
use metrics_lib::metrics;
use std::time::Instant;
pub fn measure_op_cpu<T>(f: impl FnOnce() -> T) -> (T, f64 /* cpu_used_delta_pct */, f64 /* wall_ms */) {
let sys = metrics().system();
let start = Instant::now();
let cpu_before = sys.process_cpu_used();
let result = f();
let wall = start.elapsed().as_millis() as f64;
let cpu_after = sys.process_cpu_used();
let cpu_delta = (cpu_after - cpu_before).max(0.0);
(result, cpu_delta, wall)
}
```
Notes:
- Short operations can under-report due to sampling granularity; repeat and average for stability.
- For rigorous accounting, sample over longer windows or use OS-level per-thread CPU accounting.
<hr>
<br>
<a href="#top">↑ <b>TOP</b></a>
<br>
## Integration Examples
This section shows how to integrate `metrics-lib` with common stacks. These examples are illustrative and may require adapting types to your application framework.
<h3 id="web-framework-integration">1. Web Framework Integration (Axum middleware)</h3>
```rust
use axum::{http::Request, middleware::Next, response::Response};
use metrics_lib::metrics;
pub async fn metrics_middleware<B>(req: Request<B>, next: Next<B>) -> Response {
let path = req.uri().path();
let timer = metrics().timer("http.request").start();
let response = next.run(req).await;
// Request/Status counters
metrics().counter("http.requests").inc();
metrics()
.counter(match response.status().as_u16() {
200..=299 => "http.status.2xx",
300..=399 => "http.status.3xx",
400..=499 => "http.status.4xx",
500..=599 => "http.status.5xx",
_ => "http.status.other",
})
.inc();
// Optional: per-path timer (beware cardinality)
metrics().timer(&format!("http.request.{}", path)).record(timer.elapsed());
response
}
```
Guidance:
- Prefer a small, bounded set of status counters over per-path status metrics.
- Use per-path timers sparingly to avoid high-cardinality names.
<br>
<h3 id="database-pool-monitoring">2. Database Pool Monitoring</h3>
```rust
use metrics_lib::metrics;
pub struct ConnectionPool {
inner: deadpool_postgres::Pool, // example; adapt to your pool type
}
impl ConnectionPool {
pub async fn get(&self) -> deadpool_postgres::Client {
let _wait = metrics().timer("db.pool.wait").start();
metrics().gauge("db.pool.active").add(1.0);
let client = self.inner.get().await.expect("db conn");
// Update gauges after acquiring (adjust per pool’s API)
metrics().gauge("db.pool.idle").set(self.idle_count() as f64);
client
}
fn idle_count(&self) -> usize {
// Implement based on your pool’s introspection
0
}
}
```
Guidance:
- Keep `db.pool.*` names stable. Prefer gauges for current levels and timers for waits.
- Consider periodic snapshots for totals (e.g., acquired/failed).
<br>
<h3 id="background-job-processing">3. Background Job Processing</h3>
```rust
use metrics_lib::metrics;
pub struct Job { pub kind: &'static str }
pub async fn process_job(job: Job) {
let _guard = metrics().timer(&format!("job.{}.duration", job.kind)).start();
match execute_job(job).await {
Ok(_) => metrics().counter("jobs.success").inc(),
Err(_) => {
metrics().counter("jobs.failed").inc();
// Optional: trip a circuit breaker based on failures
// my_breaker.record_failure();
}
}
}
async fn execute_job(_job: Job) -> Result<(), ()> {
Ok(())
}
```
Guidance:
- Name metrics by job-kind for aggregate SLOs; avoid embedding unbounded IDs in metric names.
- Add a rate meter (e.g., `jobs.rate`) in the worker loop if you need throughput.
<br>
<h3 id="observability-stack-integration">4. Observability Stack Integration (metrics endpoint)</h3>
```rust
use metrics_lib::metrics;
use std::fmt::Write;
/// Expose a simple text endpoint for scraping
pub async fn metrics_endpoint() -> String {
// Placeholder snapshot API; adapt to your registry access
let reg = metrics().registry();
let mut output = String::new();
// Example formatting; adapt to Prometheus/OpenMetrics as needed
for name in reg.counter_names() {
let v = metrics().counter(Box::leak(name.into_boxed_str())).get();
let _ = writeln!(output, "# TYPE {} counter", name);
let _ = writeln!(output, "{} {}", name, v);
}
output
}
```
Guidance:
- For Prometheus, prefer an OpenMetrics-compliant format and stable names.
- Keep export off the hot path; run in a separate async task.
<br>
<h3 id="correlation-with-tracing">5. Correlation with Tracing</h3>
```rust
use metrics_lib::metrics;
use std::time::Instant;
async fn do_work() {}
async fn traced_operation() {
// Example using an external tracing system; pseudocode span
// let span = tracing::span!(Level::INFO, "op");
// let _enter = span.enter();
let start = Instant::now();
do_work().await;
let dur = start.elapsed();
metrics().timer("operation").record(dur);
// span.record("timer.duration_ms", dur.as_millis() as i64);
}
```
Guidance:
- Use the same operation names between metrics and spans for easy join in dashboards.
- Record high-level spans and add targeted timers for critical sections.
<br>
<h3 id="grafana-dashboard-setup">6. Grafana Dashboard Setup (via Prometheus)</h3>
High-level steps:
1. Export metrics in a Prometheus/OpenMetrics-compatible format (see "Observability Stack Integration").
2. Configure Prometheus to scrape your service:
```yaml
scrape_configs:
- job_name: 'metrics-lib-example'
static_configs:
- targets: ['localhost:8080']
metrics_path: /metrics
scrape_interval: 15s
```
3. In Grafana, add Prometheus as a data source and create a dashboard:
- Panel examples:
- Rate: `rate(http_requests_total[5m])`
- Latency: `histogram_quantile(0.95, sum(rate(operation_duration_bucket[5m])) by (le))`
- In-flight: `db_pool_active`
Tips:
- Keep metric names compliant and low-cardinality.
- Consider per-service prefixes, e.g., `auth_*`, `api_*`.
<br>
<h3 id="message-brokers-throughput">7. Message Brokers (Kafka/NATS) Throughput and Lag</h3>
```rust
use metrics_lib::metrics;
pub struct BrokerConsumer;
impl BrokerConsumer {
pub async fn on_batch(&self, batch_size: usize, current_lag: u64) {
// Throughput
metrics().rate("broker.consume").tick_n(batch_size as u32);
metrics().counter("broker.messages").add(batch_size as u64);
// Lag (gauge)
metrics().gauge("broker.lag").set(current_lag as f64);
// Batch processing time
let _t = metrics().timer("broker.batch.duration").start();
// ... process batch ...
}
}
```
Guidance:
- Use `rate` for instantaneous throughput and `counter` for cumulative messages.
- For Kafka consumer lag, prefer a gauge fed by the broker/consumer metrics.
<br>
<h3 id="caches-hit-miss-pool-metrics">8. Caches (Redis) Hit/Miss, Pool Metrics, TTL Health</h3>
```rust
use metrics_lib::metrics;
pub async fn cache_get(key: &str) -> Option<Vec<u8>> {
let _t = metrics().timer("cache.get").start();
// let result = redis.get(key).await?;
let result: Option<Vec<u8>> = None;
match result {
Some(v) => {
metrics().counter("cache.hit").inc();
Some(v)
}
None => {
metrics().counter("cache.miss").inc();
None
}
}
}
pub fn update_pool_metrics(active: usize, idle: usize) {
metrics().gauge("cache.pool.active").set(active as f64);
metrics().gauge("cache.pool.idle").set(idle as f64);
}
pub fn ttl_health(sampled_ttl_secs: u64) {
metrics().gauge("cache.ttl.sample").set(sampled_ttl_secs as f64);
}
```
Guidance:
- Track `hit/miss` counters; derive hit ratio in your dashboard.
- Record pool size as gauges; avoid per-connection metrics.
<br>
<h3 id="serverless-cold-start-and-duration">9. Serverless (AWS Lambda) Cold-Start and Duration</h3>
```rust
use metrics_lib::{init, metrics};
use std::time::Instant;
static START: std::sync::OnceLock<Instant> = std::sync::OnceLock::new();
// Pseudocode handler
pub async fn handler() {
// Cold start detection: first set of START indicates cold start
let first = START.set(Instant::now()).is_ok();
if first {
metrics().counter("lambda.cold_start").inc();
}
let _t = metrics().timer("lambda.invoke.duration").start();
// ... handle request ...
}
```
Guidance:
- Cold-start counter increments once per fresh runtime.
- Use percentiles on `lambda.invoke.duration` to track tail latency.
<br>
<h3 id="kubernetes-scraping">10. Kubernetes Scraping & Pod-level Dashboards</h3>
Annotate your Deployment/Pod to expose metrics to Prometheus:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: metrics-lib-example
spec:
replicas: 2
selector:
matchLabels: { app: metrics-lib-example }
template:
metadata:
labels: { app: metrics-lib-example }
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "8080"
spec:
containers:
- name: app
image: your-image:tag
ports:
- containerPort: 8080
```
Dashboard tips:
- Per-pod panels: select by `pod` label for debugging noisy neighbors.
- SLO panels: aggregate across pods by `deployment`/`job`.
<br>
<h3 id="open-telemetry-export">11. OpenTelemetry Export Bridge (example skeleton)</h3>
```rust
// Bridge metrics-lib snapshot into OpenTelemetry metrics (pseudocode)
use metrics_lib::metrics;
pub async fn export_to_otel() {
// Access registry (adapt based on your API)
let reg = metrics().registry();
// Iterate counters
for name in reg.counter_names() {
let total = metrics().counter(Box::leak(name.clone().into_boxed_str())).get();
// otel_meter.u64_counter(name).add(total, &[]);
}
// Gauges, timers, and rates would be mapped similarly using OTLP exporters.
}
```
Guidance:
- Prefer push from a periodic task; avoid exporting on the hot path.
- Use OTLP/gRPC exporters and batch processors for efficiency.
<br>
<h3 id="nats-specific-queue">12. NATS-Specific Queue Depth and Consumers</h3>
```rust
use metrics_lib::metrics;
pub struct NatsStats { pub consumers: u32, pub pending: u64 }
pub fn record_nats_queue(queue: &'static str, stats: NatsStats) {
// Bounded name patterns per queue
metrics().gauge(&format!("nats.{}.consumers", queue)).set(stats.consumers as f64);
metrics().gauge(&format!("nats.{}.pending", queue)).set(stats.pending as f64);
}
```
Guidance:
- Prefer a fixed set of queue names; avoid dynamic/tenant IDs in metric names.
- For shard/partition details, use separate prefixed metrics rather than labels in names.
<br>
<h3 id="redis-latency-histogram">13. Redis Latency Histogram and Dashboard Queries</h3>
```rust
use metrics_lib::metrics;
use std::time::Instant;
pub async fn redis_set(key: &str, _val: &[u8]) {
let start = Instant::now();
// redis.set(key, val).await?;
metrics().timer("redis.set").record(start.elapsed());
}
pub async fn redis_get(key: &str) {
let start = Instant::now();
// let _ = redis.get::<_, Option<Vec<u8>>>(key).await?;
metrics().timer("redis.get").record(start.elapsed());
}
```
Grafana query tips (Prometheus examples):
- Hit ratio: `sum(rate(cache_hit[5m])) / (sum(rate(cache_hit[5m])) + sum(rate(cache_miss[5m])))`
- P95 get latency: `histogram_quantile(0.95, sum(rate(redis_get_duration_bucket[5m])) by (le))`
<br>
<h3 id="aws-lambda-emf">14. AWS Lambda EMF (Embedded Metric Format) Emission</h3>
```rust
// Emit selected metrics as EMF JSON to stdout for CloudWatch ingestion (pseudocode)
use metrics_lib::metrics;
use serde_json::json;
pub fn emit_emf() {
let requests = metrics().counter("requests").get();
let cold = metrics().counter("lambda.cold_start").get();
let doc = json!({
"_aws": {"Timestamp": chrono::Utc::now().timestamp_millis(),
"CloudWatchMetrics": [{
"Namespace": "metrics_lib",
"Dimensions": [["service"]],
"Metrics": [
{"Name": "requests", "Unit": "Count"},
{"Name": "lambda_cold_start", "Unit": "Count"}
]
}]},
"service": "example",
"requests": requests,
"lambda_cold_start": cold
});
println!("{}", doc.to_string());
}
```
Guidance:
- Keep EMF payloads small; emit periodically, not on every invocation.
- Use CloudWatch Logs subscription filters to forward to other sinks if needed.
<br>
<h3 id="kubernetes-helm-values">15. Kubernetes Helm Values (Prometheus Scrape Annotations)</h3>
```yaml
# values.yaml fragment
service:
port: 8080
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "{{ .Values.service.port }}"
```
```yaml
# deployment.yaml fragment
metadata:
annotations:
{{- toYaml .Values.podAnnotations | nindent 4 }}
```
Guidance:
- Centralize scrape annotations in `values.yaml` to keep templates clean.
- Prefer ServiceMonitors if using the Prometheus Operator.
<br>
<h3 id="otlp-exporter">16. Full OTLP Exporter Skeleton (tonic)</h3>
```rust
// Pseudocode: batch export counters/gauges to an OTLP collector via tonic
use metrics_lib::metrics;
// use opentelemetry_proto::collector::metrics::v1::metrics_service_client::MetricsServiceClient;
// use opentelemetry_proto::metrics::v1::*;
pub async fn export_otlp(_endpoint: &str) -> Result<(), Box<dyn std::error::Error>> {
// let mut client = MetricsServiceClient::connect(endpoint.to_string()).await?;
let reg = metrics().registry();
// Build ResourceMetrics/ScopeMetrics/Metric structures here from registry
// let request = ExportMetricsServiceRequest { resource_metrics: vec![ ... ] };
// client.export(request).await?;
Ok(())
}
```
Guidance:
- Use a background task and a bounded channel to batch and send metrics.
- Prefer gzip compression and delta temporality where supported for efficiency.
<br>
<h3 id="grafana-dashboard-setup">17. Grafana Panels (Ready-to-Copy JSON)</h3>
These minimal panels assume Prometheus as datasource with the name `Prometheus`. Adjust `datasource` UID/name as needed.
Rate panel (requests per second):
```json
{
"type": "timeseries",
"title": "HTTP Requests/s",
"datasource": { "type": "prometheus", "uid": "Prometheus" },
"targets": [
{ "expr": "rate(http_requests_total[5m])", "legendFormat": "req/s" }
],
"fieldConfig": { "defaults": { "unit": "req/s" }, "overrides": [] }
}
```
Latency panel (P95 from histogram):
```json
{
"type": "timeseries",
"title": "p95 Operation Duration",
"datasource": { "type": "prometheus", "uid": "Prometheus" },
"targets": [
{ "expr": "histogram_quantile(0.95, sum(rate(operation_duration_bucket[5m])) by (le))", "legendFormat": "p95" }
],
"fieldConfig": { "defaults": { "unit": "s" }, "overrides": [] }
}
```
Gauge panel (queue depth):
```json
{
"type": "gauge",
"title": "Queue Depth",
"datasource": { "type": "prometheus", "uid": "Prometheus" },
"targets": [
{ "expr": "nats_myqueue_pending" }
],
"fieldConfig": { "defaults": { "unit": "none" }, "overrides": [] }
}
```
Tip: To embed into an existing dashboard JSON, copy each object into the dashboard `panels` array and position/size them via `gridPos`.
<br>
<h3 id="prometheus-operator-servicemonitor">18. Prometheus Operator ServiceMonitor</h3>
If your cluster uses the Prometheus Operator, define a `ServiceMonitor` instead of raw scrape annotations.
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: metrics-lib-example
labels:
release: prometheus # matches your Prometheus helm release selector
spec:
selector:
matchLabels:
app: metrics-lib-example
namespaceSelector:
matchNames: ["default"]
endpoints:
- port: http
path: /metrics
interval: 15s
```
Example Service to pair with it:
```yaml
apiVersion: v1
kind: Service
metadata:
name: metrics-lib-example
labels:
app: metrics-lib-example
spec:
selector:
app: metrics-lib-example
ports:
- name: http
port: 8080
targetPort: 8080
```
<br>
<h3 id="full-grafana-dashboard">19. Full Grafana Dashboard (Ready-to-Import JSON)</h3>
This compact dashboard includes three panels (Requests/s, p95 latency, Queue depth). Replace the datasource `uid` as needed.
```json
{
"title": "metrics-lib Example",
"schemaVersion": 39,
"panels": [
{
"type": "timeseries",
"title": "HTTP Requests/s",
"datasource": { "type": "prometheus", "uid": "Prometheus" },
"targets": [{ "expr": "rate(http_requests_total[5m])", "legendFormat": "req/s" }],
"fieldConfig": { "defaults": { "unit": "req/s" }, "overrides": [] },
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
},
{
"type": "timeseries",
"title": "p95 Operation Duration",
"datasource": { "type": "prometheus", "uid": "Prometheus" },
"targets": [{ "expr": "histogram_quantile(0.95, sum(rate(operation_duration_bucket[5m])) by (le))", "legendFormat": "p95" }],
"fieldConfig": { "defaults": { "unit": "s" }, "overrides": [] },
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
},
{
"type": "gauge",
"title": "Queue Depth",
"datasource": { "type": "prometheus", "uid": "Prometheus" },
"targets": [{ "expr": "nats_myqueue_pending" }],
"fieldConfig": { "defaults": { "unit": "none" }, "overrides": [] },
"gridPos": { "h": 8, "w": 6, "x": 0, "y": 8 }
}
],
"time": { "from": "now-6h", "to": "now" },
"refresh": "30s"
}
```
<br>
<h3 id="prometheus-recording-rules">20. Prometheus Recording Rules (Latency and Rates)</h3>
Reduce query cost by materializing common expressions.
```yaml
groups:
- name: metrics-lib.rules
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:operation_duration:p95_5m
expr: |
histogram_quantile(0.95,
sum by (job, le) (rate(operation_duration_bucket[5m]))
)
- record: job:broker_consume:rate5m
expr: sum by (job) (rate(broker_messages_total[5m]))
```
<br>
<h3 id="prometheus-operator-servicemonitor">21. Prometheus Operator ServiceMonitor (Secured Endpoint)</h3>
For TLS/bearer-protected endpoints. Assumes a secret containing `token` and a CA bundle.
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: metrics-lib-example-secured
labels:
release: prometheus
spec:
selector:
matchLabels:
app: metrics-lib-example
namespaceSelector:
matchNames: ["default"]
endpoints:
- port: https
path: /metrics
interval: 15s
scheme: https
tlsConfig:
ca:
secret:
name: metrics-ca
key: ca.crt
insecureSkipVerify: false
bearerTokenSecret:
name: metrics-bearer
key: token
```
<br>
<h3 id="helm-snippets">22. Helm Snippets (kube-prometheus-stack and App Chart)</h3>
- kube-prometheus-stack values: `docs/k8s/helm/kube-prometheus-stack-values.yaml`
- Includes `additionalServiceMonitors` and `additionalPrometheusRulesMap` for a quick drop-in.
- Apply:
- `helm repo add prometheus-community https://prometheus-community.github.io/helm-charts`
- `helm repo update`
- `helm upgrade --install monitoring prometheus-community/kube-prometheus-stack -f docs/k8s/helm/kube-prometheus-stack-values.yaml`
- Example application Helm chart templates:
- Values: `docs/k8s/helm/app-chart/values.yaml`
- Templates: `docs/k8s/helm/app-chart/templates/servicemonitor.yaml`, `prometheusrule.yaml`
- Enable via values:
- `.Values.metrics.serviceMonitor.enabled: true`
- `.Values.metrics.rules.enabled: true`
<hr>
<br>
<a href="#top">↑ <b>TOP</b></a>
<br>
## Notes
- All hot-path operations are lock-free and allocation-free where possible.
- For best latency, prefer batching (`Counter::batch_inc`, `AsyncMetricBatch`) in bursty workloads.
- Avoid calling `metrics()` before `init()`. In library code, consider taking `&MetricsCore` explicitly.
- For specialized meters/gauges, see the `specialized` submodules re-exported as `gauge_specialized` and `rate_meter_specialized`.
- Keep limiter metrics sparse; avoid per-user limiters unless cardinality is controlled.
- For multi-tenant systems, expose only tier-level or route-level aggregates.
<hr>
<br>
<a href="#top">↑ <b>TOP</b></a>
<br>
<div align="center">
<h2></h2>
<sup>COPYRIGHT <small>©</small> 2025 <strong>JAMES GOBER.</strong></sup>
</div>