Seatbelt

Resilience and recovery mechanisms for fallible operations.

Quick Start

Add resilience to fallible operations, such as RPC calls over the network, with just a few lines of code. Retry handles transient failures and Timeout prevents operations from hanging indefinitely:

use layered::{Execute, Service, Stack};
use seatbelt::retry::Retry;
use seatbelt::timeout::Timeout;
use seatbelt::{RecoveryInfo, ResilienceContext};

let context = ResilienceContext::new(&clock);
let service = (
    // Retry middleware: Automatically retries failed operations
    Retry::layer("retry", &context)
        .clone_input()
        .recovery_with(|output: &String, _| match output.as_str() {
            "temporary_error" => RecoveryInfo::retry(),
            "operation timed out" => RecoveryInfo::retry(),
            _ => RecoveryInfo::never(),
        }),
    // Timeout middleware: Cancels operations that take too long
    Timeout::layer("timeout", &context)
        .timeout_output(|_| "operation timed out".to_string())
        .timeout(Duration::from_secs(30)),
    // Your core business logic
    Execute::new(my_string_operation),
)
    .into_service();

let result = service.execute("input data".to_string()).await;

Why?

Communicating over a network is inherently fraught with problems. The network can go down at any time, sometimes for a millisecond or two. The endpoint you’re connecting to may crash or be rebooted, network configuration may change from under you, etc. To deliver a robust experience to users, and to achieve 5 or more 9s of availability, it is imperative to implement robust resilience patterns to mask these transient failures.

This crate provides production-ready resilience middleware with excellent telemetry for building robust distributed systems that can automatically handle timeouts, retries, and other failure scenarios.

Production-ready - Battle-tested middleware with sensible defaults and comprehensive configuration options.
Excellent telemetry - Built-in support for metrics and structured logging to monitor resilience behavior in production.
Runtime agnostic - Works seamlessly across any async runtime. Use the same resilience patterns across different projects and migrate between runtimes without changes.

Overview

This crate uses the layered crate for composing middleware. The middleware layers can be stacked together using tuples and built into a service using the Stack trait.

Resilience middleware also requires Clock from the tick crate for timing operations like delays, timeouts, and backoff calculations. The clock is passed through ResilienceContext when creating middleware layers.

Core Types

ResilienceContext - Holds shared state for resilience middleware, including the clock.
RecoveryInfo - Classifies errors as recoverable (transient) or non-recoverable (permanent).
Recovery - A trait for types that can determine their recoverability.

Built-in Middleware

This crate provides built-in resilience middleware that you can use out of the box. See the documentation for each module for details on how to use them.

timeout - Middleware that cancels long-running operations.
retry - Middleware that automatically retries failed operations.
breaker - Middleware that prevents cascading failures.

Features

This crate provides several optional features that can be enabled in your Cargo.toml:

timeout - Enables the timeout middleware for canceling long-running operations.
retry - Enables the retry middleware for automatically retrying failed operations with configurable backoff strategies, jitter, and recovery classification.
breaker - Enables the breaker middleware for preventing cascading failures.
metrics - Exposes the OpenTelemetry metrics API for collecting and reporting metrics.
logs - Enables structured logging for resilience middleware using the tracing crate.