Skip to main content

hmac_circuit_breaker/
lib.rs

1//! # hmac-circuit-breaker
2//!
3//! An HMAC-protected circuit breaker with **fail-open semantics** for service resilience.
4//!
5//! ## The Problem
6//!
7//! Standard circuit breakers that persist failure state to disk introduce an
8//! attack surface: an adversary with write access to the state file can trip every
9//! circuit—causing a denial-of-service without touching the services themselves.
10//!
11//! ## The Solution
12//!
13//! This crate computes **HMAC-SHA256** over the circuit state and embeds it in the
14//! state file. On every reload the HMAC is verified before any state is trusted.
15//!
16//! ### Why Fail-Open on HMAC Mismatch?
17//!
18//! When the HMAC doesn't match, the crate **clears** all circuit state (fail-open)
19//! rather than blocking all traffic (fail-closed). This is a deliberate security
20//! decision:
21//!
22//! | On-tamper response | What the attacker achieves |
23//! |---|---|
24//! | **Fail-closed** (block all) | Full self-DoS — attacker writes bad MAC, every circuit trips |
25//! | **Fail-open** (clear all) | Temporary removal of protection — worst case is the baseline behaviour *without* a circuit breaker |
26//!
27//! Fail-open means an attacker can at most *remove* circuit protection for one reload
28//! cycle, not weaponise it. The integrity violation is logged as a warning so operators
29//! are alerted immediately.
30//!
31//! ## State Machines
32//!
33//! ### File-based state (written by external producer)
34//!
35//! ```text
36//!                        pass
37//! ┌──────────┐  fail   ┌──────┐  fail×N  ┌─────────┐
38//! │  Closed  │────────►│ Open │─────────►│ Tripped │
39//! │ (normal) │         │      │          │(blocked)│
40//! └──────────┘         └──┬───┘          └────┬────┘
41//!      ▲                  │ pass               │ pass (next health cycle)
42//!      └──────────────────┴────────────────────┘
43//! ```
44//!
45//! ### In-process runtime state (managed by the axum middleware)
46//!
47//! ```text
48//!                 fail×N           cooldown elapsed
49//! ┌──────────┐  ────────►  ┌─────────┐  ──────────►  ┌──────────┐
50//! │  Closed  │             │ Tripped │               │ HalfOpen │
51//! │ (normal) │  ◄────────  │(blocked)│  ◄──────────  │ (1 probe)│
52//! └──────────┘   recover   └─────────┘  probe fail   └────┬─────┘
53//!                                                         │ probe ok
54//!      ▲──────────────────────────────────────────────────┘
55//! ```
56//!
57//! * **Closed** – no in-process failures; requests pass through.
58//! * **Tripped** – `threshold` consecutive 5xx responses from the inner service.
59//!   Requests are rejected with 503 until the cooldown elapses.
60//! * **HalfOpen** – one probe request is allowed through.  Success → Closed;
61//!   failure → Tripped (cooldown restarts).
62//!
63//! ## Architecture
64//!
65//! Circuit state is tracked in two complementary layers:
66//!
67//! 1. **On disk** — a JSON file written by an external producer (health-check
68//!    cron, monitoring daemon) with an embedded HMAC-SHA256 tag.
69//!    Verified on every reload; mismatch → fail-open.
70//! 2. **In memory** — two `Arc<RwLock<HashMap>>` maps:
71//!    * `SharedState` — reloaded from disk every *N* seconds; reflects the
72//!      external producer's view of each service.
73//!    * `RuntimeState` — managed entirely by the axum middleware; trips
74//!      immediately when the **current process** observes consecutive failures,
75//!      then auto-recovers via half-open probing without waiting for the next
76//!      health-check cycle.
77//!
78//! ## Quick Start
79//!
80//! ```rust,no_run
81//! use hmac_circuit_breaker::{CircuitBreakerConfig, CircuitBreakerHandle};
82//! use std::path::PathBuf;
83//! use std::time::Duration;
84//!
85//! #[tokio::main]
86//! async fn main() {
87//!     let config = CircuitBreakerConfig::builder()
88//!         .state_file(PathBuf::from("/var/run/myapp/circuit_breaker.json"))
89//!         .secret("my-hmac-secret")
90//!         .threshold(3)
91//!         .reload_interval(Duration::from_secs(60))
92//!         .build();
93//!
94//!     let handle = CircuitBreakerHandle::new(config);
95//!     handle.spawn_reload(); // background reload every 60 s
96//!
97//!     // Check before dispatching work
98//!     if handle.is_tripped("payment-service").await {
99//!         eprintln!("payment-service is currently unavailable");
100//!     }
101//! }
102//! ```
103//!
104//! ## Features
105//!
106//! | Feature | Default | Description |
107//! |---|---|---|
108//! | `reload` | yes | Enables `CircuitBreakerHandle::spawn_reload()` (requires tokio) |
109//! | `axum` | no | Enables `circuit_breaker_layer()` axum middleware |
110
111pub mod config;
112pub mod integrity;
113pub mod loader;
114pub mod state;
115pub mod writer;
116
117#[cfg(feature = "axum")]
118pub mod middleware;
119
120pub use config::{CircuitBreakerConfig, CircuitBreakerConfigBuilder};
121pub use state::{AlgorithmCircuitState, CircuitStatus, RuntimeServiceState, RuntimeStatus};
122pub use writer::write_state;
123
124use std::collections::HashMap;
125use std::sync::Arc;
126use tokio::sync::RwLock;
127
128/// Shared in-memory file-based circuit state, cheaply cloneable.
129pub type SharedState = Arc<RwLock<HashMap<String, AlgorithmCircuitState>>>;
130
131/// Shared in-process runtime circuit state, cheaply cloneable.
132///
133/// Managed entirely by the axum middleware — never persisted to disk.
134pub type RuntimeState = Arc<RwLock<HashMap<String, RuntimeServiceState>>>;
135
136/// High-level handle that owns the shared state and the config.
137///
138/// Clone it freely — all clones share the same underlying `Arc`.
139#[derive(Clone)]
140pub struct CircuitBreakerHandle {
141    pub(crate) state: SharedState,
142    pub(crate) runtime: RuntimeState,
143    pub(crate) config: Arc<CircuitBreakerConfig>,
144}
145
146impl CircuitBreakerHandle {
147    /// Create a new handle. The initial in-memory state is empty (all circuits closed).
148    pub fn new(config: CircuitBreakerConfig) -> Self {
149        Self {
150            state: Arc::new(RwLock::new(HashMap::new())),
151            runtime: Arc::new(RwLock::new(HashMap::new())),
152            config: Arc::new(config),
153        }
154    }
155
156    /// Load state from the configured file once, verifying HMAC integrity.
157    ///
158    /// On HMAC mismatch the in-memory state is cleared (fail-open) and a `tracing::warn`
159    /// is emitted.  This is safe to call at startup and before spawning the reload task.
160    pub async fn load(&self) {
161        loader::load_into(&self.state, &self.config).await;
162    }
163
164    /// Spawn a background tokio task that calls [`load`](Self::load) every
165    /// `config.reload_interval`.
166    ///
167    /// The task runs until the last clone of this handle is dropped.
168    #[cfg(feature = "reload")]
169    pub fn spawn_reload(&self) {
170        let state = self.state.clone();
171        let config = self.config.clone();
172        tokio::spawn(async move {
173            loop {
174                loader::load_into(&state, &config).await;
175                tokio::time::sleep(config.reload_interval).await;
176            }
177        });
178    }
179
180    /// Returns `true` if the named service has been tripped (consecutive failures ≥
181    /// threshold).  Returns `false` for unknown services (fail-open default).
182    pub async fn is_tripped(&self, service: &str) -> bool {
183        let guard = self.state.read().await;
184        guard
185            .get(service)
186            .map(|s| s.status == CircuitStatus::Tripped)
187            .unwrap_or(false)
188    }
189
190    /// Returns the full circuit state for a service, or `None` if not tracked.
191    pub async fn get(&self, service: &str) -> Option<AlgorithmCircuitState> {
192        let guard = self.state.read().await;
193        guard.get(service).cloned()
194    }
195
196    /// Returns a snapshot of the complete in-memory state.
197    pub async fn snapshot(&self) -> HashMap<String, AlgorithmCircuitState> {
198        self.state.read().await.clone()
199    }
200
201    /// Access the raw file-based shared state (e.g. to pass to the axum middleware).
202    pub fn shared_state(&self) -> SharedState {
203        self.state.clone()
204    }
205
206    /// Access the in-process runtime state (e.g. to pass to the axum middleware).
207    ///
208    /// Pass this alongside [`shared_state`](Self::shared_state) to
209    /// [`circuit_breaker_layer`](crate::middleware::circuit_breaker_layer) to
210    /// enable in-process failure detection and half-open probing.
211    pub fn runtime_state(&self) -> RuntimeState {
212        self.runtime.clone()
213    }
214}