seatbelt 0.4.4

Resilience and recovery mechanisms for fallible operations.
Documentation
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.

#![cfg_attr(coverage_nightly, feature(coverage_attribute))]
#![cfg_attr(docsrs, feature(doc_cfg))]
#![doc(html_logo_url = "https://media.githubusercontent.com/media/microsoft/oxidizer/refs/heads/main/crates/seatbelt/logo.png")]
#![doc(html_favicon_url = "https://media.githubusercontent.com/media/microsoft/oxidizer/refs/heads/main/crates/seatbelt/favicon.ico")]
#![cfg_attr(
    not(all(
        feature = "retry",
        feature = "timeout",
        feature = "breaker",
        feature = "fallback",
        feature = "hedging",
        feature = "chaos-injection",
        feature = "chaos-latency",
        feature = "metrics",
        feature = "logs"
    )),
    expect(
        rustdoc::broken_intra_doc_links,
        reason = "intra-doc links break when not all features are enabled"
    )
)]

//! Resilience and recovery mechanisms for fallible operations.
//!
//! # Quick Start
//!
//! Add resilience to fallible operations, such as RPC calls over the network, with just a few lines of code.
//! **Retry** handles transient failures and **Timeout** prevents operations from hanging indefinitely:
//!
//! ```rust
//! # #[cfg(all(feature = "retry", feature = "timeout"))]
//! # {
//! # use std::time::Duration;
//! # use tick::Clock;
//! use layered::{Execute, Service, Stack};
//! use seatbelt::retry::Retry;
//! use seatbelt::timeout::Timeout;
//! use seatbelt::{RecoveryInfo, ResilienceContext};
//!
//! # async fn main(clock: Clock) {
//! let context = ResilienceContext::new(&clock);
//! let service = (
//!     // Retry middleware: Automatically retries failed operations
//!     Retry::layer("retry", &context)
//!         .clone_input()
//!         .recovery_with(|output: &String, _| match output.as_str() {
//!             "temporary_error" => RecoveryInfo::retry(),
//!             "operation timed out" => RecoveryInfo::retry(),
//!             _ => RecoveryInfo::never(),
//!         }),
//!     // Timeout middleware: Cancels operations that take too long
//!     Timeout::layer("timeout", &context)
//!         .timeout_output(|_| "operation timed out".to_string())
//!         .timeout(Duration::from_secs(30)),
//!     // Your core business logic
//!     Execute::new(my_string_operation),
//! )
//!     .into_service();
//!
//! let result = service.execute("input data".to_string()).await;
//! # }
//! # async fn my_string_operation(input: String) -> String {
//! #     // Simulate processing that transforms the input string
//! #     format!("processed: {}", input)
//! # }
//! # }
//! ```
//!
//! # Why?
//!
//! Communicating over a network is inherently fraught with problems. The network can go down at any time,
//! sometimes for a millisecond or two. The endpoint you're connecting to may crash or be rebooted,
//! network configuration may change from under you, etc. To deliver a robust experience to users, and to
//! achieve `5` or more `9s` of availability, it is imperative to implement robust resilience patterns to
//! mask these transient failures.
//!
//! This crate provides production-ready resilience middleware with excellent telemetry for building
//! robust distributed systems that can automatically handle timeouts, retries, and other failure
//! scenarios.
//!
//! - **Production-ready** - Battle-tested middleware with sensible defaults and comprehensive
//!   configuration options.
//! - **Excellent telemetry** - Built-in support for metrics and structured logging to monitor
//!   resilience behavior in production.
//! - **Runtime agnostic** - Works seamlessly across any async runtime. Use the same resilience
//!   patterns across different projects and migrate between runtimes without changes.
//!
//! # Overview
//!
//! This crate uses the [`layered`] crate for composing middleware. The middleware layers
//! can be stacked together using tuples and built into a service using the [`Stack`][layered::Stack] trait.
//!
//! Resilience middleware also requires [`Clock`][tick::Clock] from the [`tick`] crate for timing
//! operations like delays, timeouts, and backoff calculations. The clock is passed through
//! [`ResilienceContext`] when creating middleware layers.
//!
//! ## Core Types
//!
//! - [`ResilienceContext`] - Holds shared state for resilience middleware, including the clock.
//! - [`RecoveryInfo`] - Classifies errors as recoverable (transient) or non-recoverable (permanent).
//! - [`Recovery`] - A trait for types that can determine their recoverability.
//!
//! ## Built-in Middleware
//!
//! This crate provides built-in resilience middleware that you can use out of the box. See the documentation
//! for each module for details on how to use them.
//!
//! - [`timeout`] - Middleware that cancels long-running operations.
//! - [`retry`] - Middleware that automatically retries failed operations.
//! - [`hedging`] - Middleware that reduces tail latency via additional concurrent execution.
//! - [`breaker`] - Middleware that prevents cascading failures.
//! - [`fallback`] - Middleware that replaces invalid output with a user-defined alternative.
//!
//! ## Chaos Testing
//!
//! The [`chaos`] module provides middleware for deliberately injecting faults into a service
//! pipeline, enabling teams to verify that their systems handle failures gracefully.
//!
//! - [`chaos::injection`] - Middleware that replaces service output with a user-provided value
//!   at a configurable probability.
//! - [`chaos::latency`] - Middleware that injects artificial delay before the inner service
//!   call at a configurable probability.
//!
//! # Middleware Ordering
//!
//! The order in which resilience middleware is composed **matters**. Layers apply outer to inner
//! (the first layer in the tuple is outermost). A recommended ordering:
//!
//! ```text
//! Request → [Fallback → [Retry → [Breaker → [Timeout → Operation]]]]
//! ```
//!
//! - **Fallback** (outermost): guarantees a usable response even if every retry is exhausted.
//! - **Retry**: retries the entire inner stack; each attempt gets its own timeout.
//! - **Breaker**: short-circuits failing calls so retry can back off until the breaker resets.
//! - **Timeout** (innermost): bounds each individual attempt.
//!
//! Keep `Timeout` **inside** `Retry` so that a timed-out attempt is aborted and retried
//! correctly. If `Timeout` were outside, a single timeout would govern all attempts combined
//! and could cancel everything with no chance to recover.
//!
//! # Tower Compatibility
//!
//! All resilience middleware are compatible with the Tower ecosystem when the `tower-service`
//! feature is enabled. This allows you to use `tower::ServiceBuilder` to compose middleware stacks:
//!
//! ```rust
//! # use std::time::Duration;
//! # use tick::Clock;
//! use seatbelt::retry::Retry;
//! use seatbelt::timeout::Timeout;
//! use seatbelt::{RecoveryInfo, ResilienceContext};
//! use tower::ServiceBuilder;
//!
//! # async fn example(clock: Clock) {
//! let context: ResilienceContext<String, Result<String, String>> = ResilienceContext::new(&clock);
//!
//! let service = ServiceBuilder::new()
//!     .layer(
//!         Retry::layer("my_retry", &context)
//!             .clone_input()
//!             .recovery_with(|result: &Result<String, String>, _| match result {
//!                 Ok(_) => RecoveryInfo::never(),
//!                 Err(_) => RecoveryInfo::retry(),
//!             }),
//!     )
//!     .layer(
//!         Timeout::layer("my_timeout", &context)
//!             .timeout(Duration::from_secs(30))
//!             .timeout_error(|_| "operation timed out".to_string()),
//!     )
//!     .service_fn(|input: String| async move { Ok::<_, String>(format!("processed: {input}")) });
//! # }
//! ```
//!
//! # Examples
//!
//! Examples covering each middleware and common composition patterns:
//!
//! - [`timeout`](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/timeout.rs): Basic timeout that cancels long-running operations.
//! - [`timeout_advanced`](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/timeout_advanced.rs): Dynamic timeout duration and timeout callbacks.
//! - [`retry`](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/retry.rs): Automatic retry with input cloning and recovery classification.
//! - [`retry_advanced`](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/retry_advanced.rs): Custom input cloning with attempt metadata injection.
//! - [`retry_outage`](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/retry_outage.rs): Input restoration from errors when cloning is not possible.
//! - [`breaker`](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/breaker.rs): Circuit breaker that monitors failure rates.
//! - [`hedging`](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/hedging.rs): Hedging slow requests with parallel attempts to reduce tail latency.
//! - [`fallback`](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/fallback.rs): Substitutes default values for invalid outputs.
//! - [`resilience_pipeline`](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/resilience_pipeline.rs): Composing retry and timeout with metrics.
//! - [`tower`](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/tower.rs): Tower `ServiceBuilder` integration.
//! - [`config`](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/config.rs): Loading settings from a [JSON file](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/config.json).
//! - [`chaos_injection`](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/chaos_injection.rs): Fault injection with configurable probability.
//! - [`chaos_injection_advanced`](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/chaos_injection_advanced.rs): Simulating an extended outage with dynamic injection rates.
//! - [`chaos_latency`](https://github.com/microsoft/oxidizer/blob/main/crates/seatbelt/examples/chaos_latency.rs): Injecting artificial delay with configurable probability.
//!
//! # Features
//!
//! This crate provides several optional features that can be enabled in your `Cargo.toml`:
//!
//! - **`timeout`** - Enables the [`timeout`] middleware for canceling long-running operations.
//! - **`retry`** - Enables the [`retry`] middleware for automatically retrying failed operations with
//!   configurable backoff strategies, jitter, and recovery classification.
//! - **`hedging`** - Enables the [`hedging`] middleware for reducing tail latency via additional
//!   concurrent requests with configurable delay modes.
//! - **`breaker`** - Enables the [`breaker`] middleware for preventing cascading failures.
//! - **`fallback`** - Enables the [`fallback`] middleware for replacing invalid output with a
//!   user-defined alternative.
//! - **`chaos-injection`** - Enables the [`chaos::injection`] middleware for injecting faults
//!   with a configurable probability.
//! - **`chaos-latency`** - Enables the [`chaos::latency`] middleware for injecting artificial
//!   delay with a configurable probability.
//! - **`metrics`** - Exposes the OpenTelemetry metrics API for collecting and reporting metrics.
//! - **`logs`** - Enables structured logging for resilience middleware using the `tracing` crate.
//! - **`serde`** - Enables `serde::Serialize` and `serde::Deserialize` implementations for
//!   configuration types.
//! - **`tower-service`** - Enables [`tower_service::Service`] trait implementations for all
//!   resilience middleware.

#[doc(inline)]
pub use recoverable::{Recovery, RecoveryInfo, RecoveryKind};

mod context;
pub use context::ResilienceContext;

pub(crate) mod attempt;
pub use attempt::Attempt;

pub mod typestates;

#[cfg(any(feature = "timeout", test))]
pub mod timeout;

#[cfg(any(feature = "retry", test))]
pub mod retry;

#[cfg(any(feature = "breaker", test))]
pub mod breaker;

#[cfg(any(feature = "fallback", test))]
pub mod fallback;

#[cfg(any(feature = "hedging", test))]
pub mod hedging;

#[cfg(any(feature = "chaos-injection", feature = "chaos-latency", test))]
pub mod chaos;

#[cfg(any(
    feature = "retry",
    feature = "breaker",
    feature = "chaos-injection",
    feature = "chaos-latency",
    test
))]
mod rnd;

#[cfg(any(
    feature = "retry",
    feature = "breaker",
    feature = "timeout",
    feature = "fallback",
    feature = "hedging",
    feature = "chaos-injection",
    feature = "chaos-latency",
    test
))]
pub(crate) mod utils;

#[cfg(any(feature = "metrics", test))]
mod metrics;

#[cfg_attr(coverage_nightly, coverage(off))]
#[cfg(test)]
pub(crate) mod testing;

pub(crate) type TelemetryString = std::borrow::Cow<'static, str>;