# Health Check HTTP Server
Production-grade health check HTTP server for ELARA Runtime with support for Kubernetes liveness and readiness probes.
## Overview
The health check HTTP server exposes three endpoints for monitoring node health:
- **`/health`** - Overall health status (all checks)
- **`/ready`** - Readiness probe (is the node ready to accept traffic?)
- **`/live`** - Liveness probe (is the node alive and not deadlocked?)
## Features
- ✅ **Production-Ready**: Proper error handling, logging, and graceful shutdown
- ✅ **Non-Blocking**: Uses async/await and Tokio runtime
- ✅ **Fast**: Leverages cached health check results (no expensive checks on request)
- ✅ **Kubernetes Integration**: Native support for liveness and readiness probes
- ✅ **JSON Responses**: Structured JSON responses with detailed status information
- ✅ **Proper HTTP Status Codes**: 200 OK for healthy/degraded, 503 for unhealthy
- ✅ **Comprehensive Testing**: Unit tests and integration tests included
## Quick Start
```rust
use elara_runtime::health::{HealthChecker, MemoryHealthCheck};
use elara_runtime::health_server::{HealthServer, HealthServerConfig};
use std::sync::Arc;
use std::time::Duration;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create health checker
let mut checker = HealthChecker::new(Duration::from_secs(30));
checker.add_check(Box::new(MemoryHealthCheck::new(1800)));
let checker = Arc::new(checker);
// Configure and start health server
let config = HealthServerConfig {
bind_address: "0.0.0.0:8080".parse()?,
};
let server = HealthServer::new(checker, config);
server.serve().await?;
Ok(())
}
```
## Endpoints
### `/health` - Overall Health Status
Returns the overall health status of the node, including all registered health checks.
**Response Codes:**
- `200 OK` - Node is Healthy or Degraded (can serve traffic)
- `503 Service Unavailable` - Node is Unhealthy (should not serve traffic)
**Example Request:**
```bash
**Example Response:**
```json
{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"checks": {
"connections": {
"status": "healthy"
},
"memory": {
"status": "healthy"
},
"time_drift": {
"status": "healthy"
},
"state_convergence": {
"status": "healthy"
}
}
}
```
**Degraded Example:**
```json
{
"status": "degraded",
"timestamp": "2024-01-15T10:30:00Z",
"checks": {
"connections": {
"status": "degraded",
"reason": "Only 2 active connections (minimum: 3)"
},
"memory": {
"status": "healthy"
}
}
}
```
**Unhealthy Example:**
```json
{
"status": "unhealthy",
"timestamp": "2024-01-15T10:30:00Z",
"checks": {
"memory": {
"status": "unhealthy",
"reason": "Memory usage 1850MB exceeds limit 1800MB"
}
}
}
```
### `/ready` - Readiness Probe
Kubernetes readiness probe endpoint. Indicates whether the node is ready to accept traffic.
**Response Codes:**
- `200 OK` - Node is ready to accept traffic
- `503 Service Unavailable` - Node is not ready
**Use Case:** A node may be alive but not ready (e.g., still initializing, warming up caches, establishing connections). Kubernetes will not route traffic to pods that fail the readiness probe.
**Example Request:**
```bash
curl http://localhost:8080/ready
```
### `/live` - Liveness Probe
Kubernetes liveness probe endpoint. Indicates whether the node is alive and functioning.
**Response Codes:**
- `200 OK` - Node is alive
- `503 Service Unavailable` - Node is deadlocked or unresponsive
**Use Case:** If this check fails, Kubernetes will restart the pod. Liveness checks should be more lenient than readiness checks to avoid unnecessary restarts. A degraded node is still considered "alive".
**Example Request:**
```bash
curl http://localhost:8080/live
```
## Kubernetes Integration
### Example Deployment Configuration
```yaml
apiVersion: v1
kind: Pod
metadata:
name: elara-node
spec:
containers:
- name: elara
image: elara-node:latest
ports:
- containerPort: 8080
name: health
livenessProbe:
httpGet:
path: /live
port: health
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: health
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
```
### Probe Configuration Guidelines
**Liveness Probe:**
- `initialDelaySeconds: 30` - Give the node time to start up
- `periodSeconds: 10` - Check every 10 seconds
- `timeoutSeconds: 5` - Allow 5 seconds for response
- `failureThreshold: 3` - Restart after 3 consecutive failures (30 seconds)
**Readiness Probe:**
- `initialDelaySeconds: 10` - Start checking readiness early
- `periodSeconds: 5` - Check more frequently
- `timeoutSeconds: 3` - Shorter timeout for readiness
- `failureThreshold: 2` - Remove from service after 2 failures (10 seconds)
## Built-in Health Checks
The health server works with the following built-in health checks:
### 1. ConnectionHealthCheck
Monitors the number of active connections to ensure the node is properly connected to the network.
```rust
use elara_runtime::health::ConnectionHealthCheck;
checker.add_check(Box::new(ConnectionHealthCheck::new(node.clone(), 3)));
```
**Status:**
- `Healthy`: Active connections >= min_connections
- `Degraded`: Active connections < min_connections
### 2. MemoryHealthCheck
Monitors process memory usage using the `sysinfo` crate.
```rust
use elara_runtime::health::MemoryHealthCheck;
checker.add_check(Box::new(MemoryHealthCheck::new(1800))); // 1800 MB
```
**Status:**
- `Healthy`: Memory usage < max_memory_mb
- `Unhealthy`: Memory usage >= max_memory_mb
### 3. TimeDriftCheck
Monitors time drift between the local node and network consensus time.
```rust
use elara_runtime::health::TimeDriftCheck;
checker.add_check(Box::new(TimeDriftCheck::new(node.clone(), 100))); // 100ms
```
**Status:**
### 4. StateDivergenceCheck
Monitors the state reconciliation engine to ensure state is converging properly.
```rust
use elara_runtime::health::StateDivergenceCheck;
checker.add_check(Box::new(StateDivergenceCheck::new(node)));
```
**Status:**
- `Healthy`: State is converging normally
- `Degraded`: State convergence is slow or stalled
## Configuration
### Server Configuration
```rust
use elara_runtime::health_server::HealthServerConfig;
let config = HealthServerConfig {
bind_address: "0.0.0.0:8080".parse()?,
};
```
### Health Checker Configuration
```rust
use elara_runtime::health::HealthChecker;
use std::time::Duration;
// Cache health check results for 30 seconds
let checker = HealthChecker::new(Duration::from_secs(30));
```
**Cache TTL Guidelines:**
- **High-frequency polling (< 1s)**: Use 5-10 second cache
- **Normal polling (5-10s)**: Use 30 second cache
- **Low-frequency polling (> 30s)**: Use 60 second cache
## Performance Considerations
### Response Time
The health server is designed to respond quickly:
- **Cached responses**: < 1ms (read lock only)
- **Cache miss**: < 50ms (depends on health checks)
- **Target**: < 10ms for 99th percentile
### Caching
Health check results are cached to avoid excessive checking overhead:
1. First request executes all health checks
2. Subsequent requests within TTL return cached results
3. Expired cache triggers new health check execution
4. Cache updates are atomic and thread-safe
### Resource Usage
- **Memory**: < 1MB for health server
- **CPU**: < 0.1% when idle
- **Network**: Minimal (HTTP responses are small)
## Monitoring and Alerting
### Prometheus Integration
The health endpoints can be monitored by Prometheus using the `blackbox_exporter`:
```yaml
scrape_configs:
- job_name: 'elara-health'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://elara-node:8080/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
```
### Alert Rules
Example Prometheus alert rules:
```yaml
groups:
- name: elara_health
rules:
- alert: ElaraNodeUnhealthy
expr: probe_success{job="elara-health"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "ELARA node {{ $labels.instance }} is unhealthy"
description: "Health check has been failing for 2 minutes"
- alert: ElaraNodeDegraded
expr: probe_http_status_code{job="elara-health"} == 200 and probe_http_content_length{job="elara-health"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "ELARA node {{ $labels.instance }} is degraded"
description: "Node is operational but degraded for 5 minutes"
```
## Load Balancer Integration
### HAProxy Configuration
```haproxy
backend elara_nodes
option httpchk GET /health
http-check expect status 200
server node1 10.0.1.10:8080 check inter 5s fall 3 rise 2
server node2 10.0.1.11:8080 check inter 5s fall 3 rise 2
server node3 10.0.1.12:8080 check inter 5s fall 3 rise 2
```
### NGINX Configuration
```nginx
upstream elara_nodes {
server 10.0.1.10:8080 max_fails=3 fail_timeout=30s;
server 10.0.1.11:8080 max_fails=3 fail_timeout=30s;
server 10.0.1.12:8080 max_fails=3 fail_timeout=30s;
}
server {
location /health {
proxy_pass http://elara_nodes/health;
proxy_connect_timeout 3s;
proxy_read_timeout 5s;
}
}
```
## Troubleshooting
### Health Check Always Returns Degraded
**Symptom:** `/health` returns 200 OK but status is "degraded"
**Possible Causes:**
1. Connection count below minimum threshold
2. Time drift exceeds threshold
3. State convergence is slow
**Solution:**
- Check individual check results in the JSON response
- Adjust thresholds if they're too strict
- Investigate underlying issues (network, time sync, etc.)
### Health Check Returns 503
**Symptom:** `/health` returns 503 Service Unavailable
**Possible Causes:**
1. Memory usage exceeds threshold
2. Critical component failure
**Solution:**
- Check the `reason` field in the JSON response
- Investigate memory leaks or resource exhaustion
- Review logs for errors
### Health Check Timeout
**Symptom:** Health check requests timeout
**Possible Causes:**
1. Server not running
2. Network issues
3. Server deadlocked
**Solution:**
- Verify server is running: `netstat -an | grep 8080`
- Check server logs for errors
- Restart the node if deadlocked
## Examples
### Running the Example
```bash
cargo run --example health_server
```
### Testing with curl
```bash
# Check overall health
# Check readiness
# Check liveness
# Check HTTP status code
curl -I http://localhost:8080/health
```
### Testing with httpie
```bash
# Pretty-printed JSON
http :8080/health
# Show headers
http -h :8080/health
```
## Best Practices
1. **Set Appropriate Thresholds**: Configure health check thresholds based on your deployment size and requirements
2. **Use Caching**: Set cache TTL to match your polling frequency to avoid excessive checking
3. **Monitor Health Endpoints**: Use Prometheus or similar tools to monitor health check status over time
4. **Separate Liveness and Readiness**: In production, consider implementing separate checks for liveness and readiness
5. **Graceful Degradation**: Design health checks to return "degraded" for non-critical issues rather than "unhealthy"
6. **Log Health Changes**: The server automatically logs when health status changes to degraded or unhealthy
7. **Test Failure Scenarios**: Regularly test that health checks correctly detect failures
8. **Document Thresholds**: Document why specific thresholds were chosen for your deployment
## Security Considerations
1. **No Authentication**: The health endpoints do not require authentication by default. Consider adding authentication if exposing publicly.
2. **Rate Limiting**: Consider adding rate limiting to prevent abuse of health endpoints.
3. **No Sensitive Data**: Health check responses do not include sensitive data (keys, passwords, etc.).
4. **Internal Network**: Deploy health endpoints on an internal network or use firewall rules to restrict access.
## References
- [Kubernetes Liveness and Readiness Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
- [Health Check API Design](https://microservices.io/patterns/observability/health-check-api.html)
- [Prometheus Blackbox Exporter](https://github.com/prometheus/blackbox_exporter)