tower-resilience
A comprehensive resilience and fault-tolerance toolkit for Tower services, inspired by Resilience4j.
About
Tower-resilience provides composable middleware for building robust distributed systems in Rust. Tower is a library of modular and reusable components for building robust networking clients and servers. This crate extends Tower with resilience patterns commonly needed in production systems.
Inspired by Resilience4j, a fault tolerance library for Java, tower-resilience adapts these battle-tested patterns to Rust's async ecosystem and Tower's middleware model.
Resilience Patterns
- Circuit Breaker - Prevents cascading failures by stopping calls to failing services
- Bulkhead - Isolates resources to prevent system-wide failures
- Time Limiter - Advanced timeout handling with cancellation support
- Retry - Intelligent retry with exponential backoff and jitter
- Rate Limiter - Controls request rate to protect services
- Cache - Response memoization to reduce load
- Chaos - Inject failures and latency for testing resilience (development/testing only)
Features
- Composable - Stack multiple resilience patterns using Tower's ServiceBuilder
- Observable - Event system for monitoring pattern behavior (retries, state changes, etc.)
- Configurable - Builder APIs with sensible defaults
- Async-first - Built on tokio for async Rust applications
- Zero-cost abstractions - Minimal overhead when patterns aren't triggered
Quick Start
[]
= "0.1"
= "0.5"
use ServiceBuilder;
use *;
let service = new
.layer
.layer
.service;
Examples
Circuit Breaker
Prevent cascading failures by opening the circuit when error rate exceeds threshold:
use CircuitBreakerLayer;
use Duration;
let layer = builder
.name
.failure_rate_threshold // Open at 50% failure rate
.sliding_window_size // Track last 100 calls
.wait_duration_in_open // Stay open 60s
.on_state_transition
.build;
let service = layer.layer;
Full examples: circuitbreaker.rs | circuitbreaker_fallback.rs | circuitbreaker_health_check.rs
Bulkhead
Limit concurrent requests to prevent resource exhaustion:
use BulkheadLayer;
use Duration;
let layer = builder
.name
.max_concurrent_calls // Max 10 concurrent
.max_wait_duration // Wait up to 5s
.on_call_permitted
.on_call_rejected
.build;
let service = layer.layer;
Full examples: bulkhead.rs | bulkhead_demo.rs
Time Limiter
Enforce timeouts on operations with configurable cancellation:
use TimeLimiterLayer;
use Duration;
let layer = builder
.timeout_duration
.cancel_running_future // Cancel on timeout
.on_timeout
.build;
let service = layer.layer;
Full examples: timelimiter.rs | timelimiter_example.rs
Retry
Retry failed requests with exponential backoff and jitter:
use RetryLayer;
use Duration;
let layer = builder
.max_attempts
.exponential_backoff
.on_retry
.on_success
.build;
let service = layer.layer;
Full examples: retry.rs | retry_example.rs
Rate Limiter
Control request rate to protect downstream services:
use RateLimiterLayer;
use Duration;
let layer = builder
.limit_for_period // 100 requests
.refresh_period // per second
.timeout_duration // Wait up to 500ms
.on_permit_acquired
.build;
let service = layer.layer;
Full examples: ratelimiter.rs | ratelimiter_example.rs
Cache
Cache responses to reduce load on expensive operations:
use ;
use Duration;
let layer = builder
.max_size
.ttl // 5 minute TTL
.eviction_policy // LRU, LFU, or FIFO
.key_extractor
.on_hit
.on_miss
.build;
let service = layer.layer;
Full examples: cache.rs | cache_example.rs
Chaos (Testing Only)
Inject failures and latency to test your resilience patterns:
use ChaosLayer;
use Duration;
let chaos = builder
.name
.error_rate // 10% of requests fail
.error_fn
.latency_rate // 20% delayed
.min_latency
.max_latency
.seed // Deterministic chaos
.build;
let service = chaos.layer;
WARNING: Only use in development/testing environments. Never in production.
Full examples: chaos.rs | chaos_example.rs
Error Handling
Zero-Boilerplate with ResilienceError
When composing multiple resilience layers, use ResilienceError<E> to eliminate manual error conversion code:
use ResilienceError;
// Your application error
// That's it! No From implementations needed
type ServiceError = ;
// All resilience layer errors automatically convert
let service = new
.layer
.layer
.layer
.service;
Benefits:
- Zero boilerplate - no
Fromtrait implementations - Rich error context (layer names, counts, durations)
- Convenient helpers:
is_timeout(),is_rate_limited(), etc.
See the Layer Composition Guide for details.
Manual Error Handling
For specific use cases, you can still implement custom error types with manual From conversions. See examples for both approaches.
Pattern Composition
Stack multiple patterns for comprehensive resilience:
use ServiceBuilder;
// Client-side: timeout -> circuit breaker -> retry
let client = new
.layer
.layer
.layer
.service;
// Server-side: rate limit -> bulkhead -> timeout
let server = new
.layer
.layer
.layer
.service;
Performance
Benchmarks measure the overhead of each pattern in the happy path (no failures, circuit closed, permits available):
| Pattern | Overhead (ns) | vs Baseline |
|---|---|---|
| Baseline (no middleware) | ~10 ns | 1.0x |
| Retry (no retries) | ~80-100 ns | ~8-10x |
| Time Limiter | ~107 ns | ~10x |
| Rate Limiter | ~124 ns | ~12x |
| Bulkhead | ~162 ns | ~16x |
| Cache (hit) | ~250 ns | ~25x |
| Circuit Breaker (closed) | ~298 ns | ~29x |
| Circuit Breaker + Bulkhead | ~413 ns | ~40x |
Key Takeaways:
- All patterns add < 300ns overhead individually
- Overhead is additive when composing patterns
- Even the heaviest pattern (circuit breaker) is negligible for most use cases
- Retry and time limiter are the lightest weight options
Run benchmarks yourself:
Documentation
- API Documentation
- Pattern Guides - In-depth guides on when and how to use each pattern
Examples
Two sets of examples are provided:
- Top-level examples - Simple, getting-started examples matching this README (one per pattern)
- Module examples - Detailed examples in each crate's
examples/directory showing advanced features
Run top-level examples with:
# etc.
Stress Tests
Stress tests validate pattern behavior under extreme conditions (high volume, high concurrency, memory stability). They are opt-in and marked with #[ignore]:
# Run all stress tests
# Run specific pattern stress tests
# Run with output to see performance metrics
Example results:
- 1M calls through circuit breaker: ~2.8s (357k calls/sec)
- 10k fast operations through bulkhead: ~56ms (176k ops/sec)
- 100k cache entries: Fill + hit test validates performance
Stress tests cover:
- High volume (millions of operations)
- High concurrency (thousands of concurrent requests)
- Memory stability (leak detection, bounded growth)
- State consistency (correctness under load)
- Pattern composition (layered middleware)
Why tower-resilience?
Tower provides some built-in resilience (timeout, retry, rate limiting), but tower-resilience offers:
- Circuit Breaker - Not available in Tower
- Advanced retry - More backoff strategies and better control
- Bulkhead - True resource isolation with async-aware semaphores
- Unified events - Consistent observability across all patterns
- Builder APIs - Ergonomic configuration with sensible defaults
- Production-ready - Patterns inspired by battle-tested Resilience4j
Minimum Supported Rust Version (MSRV)
This crate's MSRV is 1.64.0, matching Tower's MSRV policy.
We follow Tower's approach:
- MSRV bumps are not considered breaking changes
- When increasing MSRV, the new version must have been released at least 6 months ago
- MSRV is tested in CI to prevent unintentional increases
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contributing
Contributions are welcome! Please see the contributing guidelines for more information.