Expand description
Resilience and recovery mechanisms for fallible operations.
§Quick Start
Add resilience to fallible operations, such as RPC calls over the network, with just a few lines of code. Retry handles transient failures and Timeout prevents operations from hanging indefinitely:
use layered::{Execute, Service, Stack};
use seatbelt::retry::Retry;
use seatbelt::timeout::Timeout;
use seatbelt::{RecoveryInfo, ResilienceContext};
let context = ResilienceContext::new(&clock);
let service = (
// Retry middleware: Automatically retries failed operations
Retry::layer("retry", &context)
.clone_input()
.recovery_with(|output: &String, _| match output.as_str() {
"temporary_error" => RecoveryInfo::retry(),
"operation timed out" => RecoveryInfo::retry(),
_ => RecoveryInfo::never(),
}),
// Timeout middleware: Cancels operations that take too long
Timeout::layer("timeout", &context)
.timeout_output(|_| "operation timed out".to_string())
.timeout(Duration::from_secs(30)),
// Your core business logic
Execute::new(my_string_operation),
)
.into_service();
let result = service.execute("input data".to_string()).await;§Why?
Communicating over a network is inherently fraught with problems. The network can go down at any time,
sometimes for a millisecond or two. The endpoint you’re connecting to may crash or be rebooted,
network configuration may change from under you, etc. To deliver a robust experience to users, and to
achieve 5 or more 9s of availability, it is imperative to implement robust resilience patterns to
mask these transient failures.
This crate provides production-ready resilience middleware with excellent telemetry for building robust distributed systems that can automatically handle timeouts, retries, and other failure scenarios.
- Production-ready - Battle-tested middleware with sensible defaults and comprehensive configuration options.
- Excellent telemetry - Built-in support for metrics and structured logging to monitor resilience behavior in production.
- Runtime agnostic - Works seamlessly across any async runtime. Use the same resilience patterns across different projects and migrate between runtimes without changes.
§Overview
This crate uses the layered crate for composing middleware. The middleware layers
can be stacked together using tuples and built into a service using the Stack trait.
Resilience middleware also requires Clock from the tick crate for timing
operations like delays, timeouts, and backoff calculations. The clock is passed through
ResilienceContext when creating middleware layers.
§Core Types
ResilienceContext- Holds shared state for resilience middleware, including the clock.RecoveryInfo- Classifies errors as recoverable (transient) or non-recoverable (permanent).Recovery- A trait for types that can determine their recoverability.
§Built-in Middleware
This crate provides built-in resilience middleware that you can use out of the box. See the documentation for each module for details on how to use them.
timeout- Middleware that cancels long-running operations.retry- Middleware that automatically retries failed operations.hedging- Middleware that reduces tail latency via additional concurrent execution.breaker- Middleware that prevents cascading failures.fallback- Middleware that replaces invalid output with a user-defined alternative.
§Chaos Testing
The chaos module provides middleware for deliberately injecting faults into a service
pipeline, enabling teams to verify that their systems handle failures gracefully.
chaos::injection- Middleware that replaces service output with a user-provided value at a configurable probability.chaos::latency- Middleware that injects artificial delay before the inner service call at a configurable probability.
§Middleware Ordering
The order in which resilience middleware is composed matters. Layers apply outer to inner (the first layer in the tuple is outermost). A recommended ordering:
Request → [Fallback → [Retry → [Breaker → [Timeout → Operation]]]]- Fallback (outermost): guarantees a usable response even if every retry is exhausted.
- Retry: retries the entire inner stack; each attempt gets its own timeout.
- Breaker: short-circuits failing calls so retry can back off until the breaker resets.
- Timeout (innermost): bounds each individual attempt.
Keep Timeout inside Retry so that a timed-out attempt is aborted and retried
correctly. If Timeout were outside, a single timeout would govern all attempts combined
and could cancel everything with no chance to recover.
§Tower Compatibility
All resilience middleware are compatible with the Tower ecosystem when the tower-service
feature is enabled. This allows you to use tower::ServiceBuilder to compose middleware stacks:
use seatbelt::retry::Retry;
use seatbelt::timeout::Timeout;
use seatbelt::{RecoveryInfo, ResilienceContext};
use tower::ServiceBuilder;
let context: ResilienceContext<String, Result<String, String>> = ResilienceContext::new(&clock);
let service = ServiceBuilder::new()
.layer(
Retry::layer("my_retry", &context)
.clone_input()
.recovery_with(|result: &Result<String, String>, _| match result {
Ok(_) => RecoveryInfo::never(),
Err(_) => RecoveryInfo::retry(),
}),
)
.layer(
Timeout::layer("my_timeout", &context)
.timeout(Duration::from_secs(30))
.timeout_error(|_| "operation timed out".to_string()),
)
.service_fn(|input: String| async move { Ok::<_, String>(format!("processed: {input}")) });§Examples
Examples covering each middleware and common composition patterns:
timeout: Basic timeout that cancels long-running operations.timeout_advanced: Dynamic timeout duration and timeout callbacks.retry: Automatic retry with input cloning and recovery classification.retry_advanced: Custom input cloning with attempt metadata injection.retry_outage: Input restoration from errors when cloning is not possible.breaker: Circuit breaker that monitors failure rates.hedging: Hedging slow requests with parallel attempts to reduce tail latency.fallback: Substitutes default values for invalid outputs.resilience_pipeline: Composing retry and timeout with metrics.tower: TowerServiceBuilderintegration.config: Loading settings from a JSON file.chaos_injection: Fault injection with configurable probability.chaos_injection_advanced: Simulating an extended outage with dynamic injection rates.chaos_latency: Injecting artificial delay with configurable probability.
§Features
This crate provides several optional features that can be enabled in your Cargo.toml:
timeout- Enables thetimeoutmiddleware for canceling long-running operations.retry- Enables theretrymiddleware for automatically retrying failed operations with configurable backoff strategies, jitter, and recovery classification.hedging- Enables thehedgingmiddleware for reducing tail latency via additional concurrent requests with configurable delay modes.breaker- Enables thebreakermiddleware for preventing cascading failures.fallback- Enables thefallbackmiddleware for replacing invalid output with a user-defined alternative.chaos-injection- Enables thechaos::injectionmiddleware for injecting faults with a configurable probability.chaos-latency- Enables thechaos::latencymiddleware for injecting artificial delay with a configurable probability.metrics- Exposes the OpenTelemetry metrics API for collecting and reporting metrics.logs- Enables structured logging for resilience middleware using thetracingcrate.serde- Enablesserde::Serializeandserde::Deserializeimplementations for configuration types.tower-service- Enablestower_service::Servicetrait implementations for all resilience middleware.
Modules§
- breaker
breaker - Circuit breaker resilience middleware for preventing cascading failures.
- chaos
chaos-injectionorchaos-latency - Chaos engineering middleware for testing resilience under failure conditions.
- fallback
fallback - Fallback resilience middleware for services, applications, and libraries.
- hedging
hedging - Hedging resilience middleware for reducing tail latency via additional concurrent execution.
- retry
retry - Retry resilience middleware for services, applications, and libraries.
- timeout
timeout - Timeout resilience middleware for services, applications, and libraries.
- typestates
- Type state markers for builder patterns.
Structs§
- Attempt
- Tracks the current attempt within a resilience operation.
- Recovery
Info - The recovery information associated with an operation or condition.
- Resilience
Context - Shared configuration and dependencies for a pipeline of resilience middleware.
Enums§
- Recovery
Kind - Kind of recovery that can be attempted.
Traits§
- Recovery
- Enables types to indicate their recovery information.