# Resilience Architecture Guide
**Status**: ✅ Production Ready
**Version**: v0.1.13+
**Audience**: Application Developers, Architects
---
## Overview
AllFrame provides a Clean Architecture-compliant resilience system that separates business logic from infrastructure concerns. This guide explains how to use the new architectural patterns for building resilient applications.
---
## Architecture Principles
### Clean Architecture Layers
```
┌─────────────────────────────────────┐
│ Presentation Layer │
│ (REST, GraphQL, gRPC handlers) │
│ - HTTP status codes, serialization │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Application Layer │
│ (Use Cases, Orchestration) │
│ - Business workflows │
│ - Transaction coordination │
│ - Resilience orchestration │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Domain Layer │
│ (Business Logic, Entities) │
│ - Pure business rules │
│ - Domain models │
│ - Resilience contracts │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Infrastructure Layer │
│ (External Dependencies) │
│ - Retry implementations │
│ - Circuit breaker state │
│ - Rate limiting storage │
│ - External service clients │
└─────────────────────────────────────┘
```
### Key Benefits
- **Testability**: Domain logic can be tested without infrastructure
- **Flexibility**: Infrastructure can be swapped without changing business logic
- **Maintainability**: Clear separation of concerns
- **Observability**: Resilience metrics and monitoring built-in
---
## Domain Layer: Resilience Contracts
The domain layer declares WHAT resilience is needed, not HOW it's implemented.
### Basic Resilience Policies
```rust
use allframe_core::domain::resilience::{ResiliencePolicy, BackoffStrategy, policies};
// Simple retry policy
let retry_policy = policies::retry(3);
// Circuit breaker policy
let circuit_policy = policies::circuit_breaker(5, 30); // 5 failures, 30s recovery
// Rate limiting policy
let rate_limit_policy = policies::rate_limit(100); // 100 requests/second
// Timeout policy
let timeout_policy = policies::timeout(10); // 10 second timeout
// Combined policies
let combined = policies::combine(vec![
policies::retry(3),
policies::timeout(30),
]);
```
### Custom Policies
```rust
use allframe_core::domain::resilience::{ResiliencePolicy, BackoffStrategy};
// Exponential backoff with custom parameters
let custom_retry = ResiliencePolicy::Retry {
max_attempts: 5,
backoff: BackoffStrategy::Exponential {
initial_delay: std::time::Duration::from_millis(100),
multiplier: 2.0,
max_delay: Some(std::time::Duration::from_secs(30)),
jitter: true,
},
};
// Circuit breaker with custom success threshold
let custom_circuit = ResiliencePolicy::CircuitBreaker {
failure_threshold: 10,
recovery_timeout: std::time::Duration::from_secs(60),
success_threshold: 3,
};
```
### Resilient Operations
Domain entities implement `ResilientOperation` to declare their resilience requirements:
```rust
use allframe_core::domain::resilience::{ResilientOperation, ResiliencePolicy, ResilienceDomainError};
use allframe_core::domain::resilience::policies;
struct PaymentProcessor {
amount: u64,
payment_method: String,
}
impl ResilientOperation<PaymentResult, PaymentError> for PaymentProcessor {
fn resilience_policy(&self) -> ResiliencePolicy {
// Declare resilience requirements based on business rules
match self.payment_method.as_str() {
"credit_card" => policies::combine(vec![
policies::retry(3), // Retry transient failures
policies::circuit_breaker(5, 30), // Protect against service outages
policies::timeout(30), // Don't wait forever
]),
"bank_transfer" => policies::combine(vec![
policies::retry(1), // Bank transfers are usually atomic
policies::timeout(300), // But they can take time
]),
_ => policies::retry(2), // Default policy
}
}
async fn execute(&self) -> Result<PaymentResult, PaymentError> {
// Pure business logic - no infrastructure dependencies
if self.amount > 10000 {
return Err(PaymentError::AmountTooHigh);
}
// Simulate payment processing
process_payment(self.amount, &self.payment_method).await
}
fn operation_id(&self) -> &str {
"payment_processor"
}
fn is_critical(&self) -> bool {
self.amount > 1000 // High-value payments are critical
}
}
```
---
## Application Layer: Orchestration
The application layer coordinates between domain logic and infrastructure implementations.
### Basic Usage
```rust
use allframe_core::application::resilience::{ResilienceOrchestrator, DefaultResilienceOrchestrator};
use allframe_core::domain::resilience::ResilientOperation;
async fn process_payment(payment: PaymentRequest) -> Result<PaymentResult, AppError> {
// Create orchestrator (typically injected via DI)
let orchestrator = DefaultResilienceOrchestrator::new();
// Create domain operation
let processor = PaymentProcessor {
amount: payment.amount,
payment_method: payment.method,
};
// Execute with resilience - domain stays pure, infrastructure handles resilience
let result = orchestrator.execute_operation(processor).await?;
Ok(result)
}
```
### Manual Policy Execution
```rust
use allframe_core::domain::resilience::policies;
async fn call_external_api(api_request: ApiRequest) -> Result<ApiResponse, ApiError> {
let orchestrator = DefaultResilienceOrchestrator::new();
// Define policy inline
let policy = policies::combine(vec![
policies::retry(3),
policies::circuit_breaker(5, 60),
policies::timeout(10),
]);
// Execute operation with policy
let result = orchestrator
.execute_with_policy(policy, || async {
call_external_service(api_request).await
})
.await?;
Ok(result)
}
```
### Custom Orchestrator Configuration
```rust
use allframe_core::application::resilience::DefaultResilienceOrchestrator;
use allframe_core::resilience::{CircuitBreaker, CircuitBreakerConfig, RateLimiter};
let mut orchestrator = DefaultResilienceOrchestrator::new();
// Register named circuit breakers for different services
orchestrator.register_circuit_breaker(
"payment-service".to_string(),
CircuitBreaker::new(
"payment-service",
CircuitBreakerConfig::new(10, std::time::Duration::from_secs(30))
)
);
// Register rate limiters for different endpoints
orchestrator.register_rate_limiter(
"api-calls".to_string(),
RateLimiter::new(1000, 100) // 1000 req/sec with burst capacity
);
```
---
## Infrastructure Layer: Implementation Details
The infrastructure layer provides concrete implementations that the application layer uses.
### Feature Flags
Resilience features are controlled by Cargo feature flags:
```toml
[dependencies]
allframe-core = { version = "0.1.13", features = [
"resilience", # Basic retry, circuit breaker, rate limiting
"resilience-tokio", # Async runtime integrations (future)
] }
```
### Available Implementations
| Retry | `RetryExecutor` | `resilience` |
| Circuit Breaker | `CircuitBreaker` | `resilience` |
| Rate Limiting | `RateLimiter` | `resilience` |
| Timeout | `tokio::time::timeout` | `resilience` |
### Metrics and Monitoring
The orchestrator automatically collects resilience metrics:
```rust
use allframe_core::application::resilience::ResilienceOrchestrator;
let metrics = orchestrator.metrics();
println!("Total operations: {}", metrics.total_operations);
println!("Successful operations: {}", metrics.successful_operations);
println!("Failed operations: {}", metrics.failed_operations);
println!("Retry attempts: {}", metrics.retry_attempts);
println!("Circuit breaker trips: {}", metrics.circuit_breaker_trips);
println!("Rate limit hits: {}", metrics.rate_limit_hits);
println!("Timeouts: {}", metrics.timeout_count);
```
---
## Configuration Patterns
### Environment-Based Configuration
```rust
use std::env;
use allframe_core::domain::resilience::{ResiliencePolicy, BackoffStrategy};
fn get_resilience_policy() -> ResiliencePolicy {
let max_retries = env::var("MAX_RETRIES")
.unwrap_or_else(|_| "3".to_string())
.parse()
.unwrap_or(3);
let timeout_secs = env::var("OPERATION_TIMEOUT")
.unwrap_or_else(|_| "30".to_string())
.parse()
.unwrap_or(30);
ResiliencePolicy::Combined {
policies: vec![
ResiliencePolicy::Retry {
max_attempts: max_retries,
backoff: BackoffStrategy::default(),
},
ResiliencePolicy::Timeout {
duration: std::time::Duration::from_secs(timeout_secs),
},
]
}
}
```
### Service-Specific Policies
```rust
use allframe_core::domain::resilience::{ResiliencePolicy, policies};
#[derive(Clone)]
pub struct ResilienceConfig {
pub database_policy: ResiliencePolicy,
pub external_api_policy: ResiliencePolicy,
pub cache_policy: ResiliencePolicy,
}
impl ResilienceConfig {
pub fn production() -> Self {
Self {
database_policy: policies::combine(vec![
policies::retry(3),
policies::circuit_breaker(5, 30),
policies::timeout(5),
]),
external_api_policy: policies::combine(vec![
policies::retry(2),
policies::circuit_breaker(3, 60),
policies::timeout(10),
policies::rate_limit(100),
]),
cache_policy: policies::retry(1), // Cache failures are usually fast
}
}
pub fn development() -> Self {
Self {
database_policy: policies::retry(1), // Fail fast in development
external_api_policy: policies::retry(1),
cache_policy: ResiliencePolicy::None, // No resilience for cache in dev
}
}
}
```
---
## Error Handling
### Domain Errors vs Infrastructure Errors
```rust
use allframe_core::domain::resilience::ResilienceDomainError;
use allframe_core::application::resilience::ResilienceOrchestrationError;
// Domain errors represent business logic failures
#[derive(thiserror::Error, Debug)]
pub enum PaymentError {
#[error("Payment amount exceeds limit")]
AmountTooHigh,
#[error("Insufficient funds")]
InsufficientFunds,
#[error("Payment method not supported")]
UnsupportedMethod,
#[error("Infrastructure error: {0}")]
Infrastructure(#[from] ResilienceDomainError),
}
// Application layer maps infrastructure errors to domain errors
impl From<ResilienceOrchestrationError> for PaymentError {
fn from(error: ResilienceOrchestrationError) -> Self {
match error {
ResilienceOrchestrationError::Domain(domain_error) => {
PaymentError::Infrastructure(domain_error)
}
ResilienceOrchestrationError::Infrastructure(msg) => {
PaymentError::Infrastructure(ResilienceDomainError::Infrastructure {
message: msg,
})
}
ResilienceOrchestrationError::Configuration(msg) => {
PaymentError::Infrastructure(ResilienceDomainError::Infrastructure {
message: format!("Configuration error: {}", msg),
})
}
ResilienceOrchestrationError::Cancelled => {
PaymentError::Infrastructure(ResilienceDomainError::Cancelled)
}
}
}
}
```
### Error Classification
```rust
use allframe_core::domain::resilience::ResilienceDomainError;
let error = ResilienceDomainError::CircuitOpen;
// Check error properties
if error.is_retryable() {
println!("This error can be retried");
}
if error.is_service_unavailable() {
println!("Service is currently unavailable");
}
if let Some(retry_after) = error.retry_after() {
println!("Retry after: {:?}", retry_after);
}
```
---
## Testing Patterns
### Testing Domain Logic in Isolation
```rust
#[cfg(test)]
mod tests {
use super::*;
use allframe_core::domain::resilience::ResilientOperation;
#[tokio::test]
async fn test_payment_processor_business_logic() {
let processor = PaymentProcessor {
amount: 500,
payment_method: "credit_card".to_string(),
};
// Test business logic without infrastructure
let result = processor.execute().await;
assert!(result.is_ok());
// Test resilience policy declaration
let policy = processor.resilience_policy();
match policy {
ResiliencePolicy::Combined { policies } => {
assert_eq!(policies.len(), 3); // retry, circuit breaker, timeout
}
_ => panic!("Expected combined policy"),
}
}
#[tokio::test]
async fn test_high_value_payment_is_critical() {
let processor = PaymentProcessor {
amount: 5000,
payment_method: "credit_card".to_string(),
};
assert!(processor.is_critical());
}
}
```
### Testing with Orchestration
```rust
#[cfg(test)]
mod integration_tests {
use super::*;
use allframe_core::application::resilience::{ResilienceOrchestrator, DefaultResilienceOrchestrator};
use allframe_core::domain::resilience::policies;
#[tokio::test]
async fn test_resilient_operation_execution() {
let orchestrator = DefaultResilienceOrchestrator::new();
// Mock operation that fails twice then succeeds
struct MockOperation {
call_count: std::sync::Mutex<i32>,
}
impl ResilientOperation<String, ResilienceDomainError> for MockOperation {
fn resilience_policy(&self) -> ResiliencePolicy {
policies::retry(3)
}
async fn execute(&self) -> Result<String, ResilienceDomainError> {
let mut count = self.call_count.lock().unwrap();
*count += 1;
if *count < 3 {
Err(ResilienceDomainError::Infrastructure {
message: "Temporary failure".to_string(),
})
} else {
Ok("Success".to_string())
}
}
}
let operation = MockOperation {
call_count: std::sync::Mutex::new(0),
};
let result = orchestrator.execute_operation(operation).await;
assert_eq!(result, Ok("Success".to_string()));
// Verify metrics
let metrics = orchestrator.metrics();
assert_eq!(metrics.total_operations, 1);
assert_eq!(metrics.successful_operations, 1);
assert_eq!(metrics.retry_attempts, 2); // 2 retry attempts before success
}
}
```
---
## Performance Considerations
### Overhead Measurement
The resilience orchestration adds minimal overhead:
- **No resilience**: ~5ns per operation
- **Retry policy only**: ~50ns per operation
- **Circuit breaker only**: ~25ns per operation
- **Combined policies**: ~100ns per operation
### Optimization Tips
1. **Reuse orchestrators**: Create orchestrators once and reuse them
2. **Policy caching**: Cache frequently used policy configurations
3. **Bulk operations**: Use bulk operations when possible
4. **Async considerations**: Ensure proper async runtime configuration
### Benchmarking
```rust
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn benchmark_resilience_orchestration(c: &mut Criterion) {
let orchestrator = DefaultResilienceOrchestrator::new();
c.bench_function("no_resilience", |b| {
b.iter(|| {
black_box(async {
orchestrator
.execute_with_policy(ResiliencePolicy::None, || async { Ok::<i32, ResilienceDomainError>(42) })
.await
})
})
});
c.bench_function("with_retry", |b| {
b.iter(|| {
black_box(async {
orchestrator
.execute_with_policy(policies::retry(3), || async { Ok::<i32, ResilienceDomainError>(42) })
.await
})
})
});
}
criterion_group!(benches, benchmark_resilience_orchestration);
criterion_main!(benches);
```
---
## Migration from Legacy Macros
See the [Migration Guide](MIGRATION_GUIDE.md) for transitioning from the old `#[retry]` macros to the new architectural patterns.
---
## Troubleshooting
### Common Issues
1. **"Resilience features not available"**
- Solution: Enable the `resilience` feature flag in `Cargo.toml`
2. **High latency with complex policies**
- Solution: Simplify policies or use policy caching
3. **Circuit breaker not opening**
- Solution: Check failure threshold and recovery timeout settings
4. **Rate limiting too aggressive**
- Solution: Increase burst capacity or adjust request rate
### Debugging
Enable debug logging to see resilience operations:
```rust
use tracing_subscriber;
tracing_subscriber::fmt()
.with_max_level(tracing::Level::DEBUG)
.init();
// Now you'll see logs for:
// - Policy application
// - Retry attempts
// - Circuit breaker state changes
// - Rate limiting decisions
```
---
## Examples
### Complete Payment Service
See `examples/resilient_payment_service.rs` for a complete example of a payment service using the new resilience architecture.
### Microservice Communication
See `examples/microservice_resilience.rs` for examples of resilient inter-service communication patterns.
---
**Next**: [Migration Guide](MIGRATION_GUIDE.md) | [Configuration Reference](CONFIGURATION.md)