elara-runtime 0.2.0

ELARA Protocol - Node runtime with event loop, session management, and stream processing
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
# Health Check HTTP Server


Production-grade health check HTTP server for ELARA Runtime with support for Kubernetes liveness and readiness probes.

## Overview


The health check HTTP server exposes three endpoints for monitoring node health:

- **`/health`** - Overall health status (all checks)
- **`/ready`** - Readiness probe (is the node ready to accept traffic?)
- **`/live`** - Liveness probe (is the node alive and not deadlocked?)

## Features


- **Production-Ready**: Proper error handling, logging, and graceful shutdown
-**Non-Blocking**: Uses async/await and Tokio runtime
-**Fast**: Leverages cached health check results (no expensive checks on request)
-**Kubernetes Integration**: Native support for liveness and readiness probes
-**JSON Responses**: Structured JSON responses with detailed status information
-**Proper HTTP Status Codes**: 200 OK for healthy/degraded, 503 for unhealthy
-**Comprehensive Testing**: Unit tests and integration tests included

## Quick Start


```rust
use elara_runtime::health::{HealthChecker, MemoryHealthCheck};
use elara_runtime::health_server::{HealthServer, HealthServerConfig};
use std::sync::Arc;
use std::time::Duration;

#[tokio::main]

async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create health checker
    let mut checker = HealthChecker::new(Duration::from_secs(30));
    checker.add_check(Box::new(MemoryHealthCheck::new(1800)));
    let checker = Arc::new(checker);

    // Configure and start health server
    let config = HealthServerConfig {
        bind_address: "0.0.0.0:8080".parse()?,
    };

    let server = HealthServer::new(checker, config);
    server.serve().await?;

    Ok(())
}
```

## Endpoints


### `/health` - Overall Health Status


Returns the overall health status of the node, including all registered health checks.

**Response Codes:**
- `200 OK` - Node is Healthy or Degraded (can serve traffic)
- `503 Service Unavailable` - Node is Unhealthy (should not serve traffic)

**Example Request:**
```bash
curl http://localhost:8080/health | jq
```

**Example Response:**
```json
{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "checks": {
    "connections": {
      "status": "healthy"
    },
    "memory": {
      "status": "healthy"
    },
    "time_drift": {
      "status": "healthy"
    },
    "state_convergence": {
      "status": "healthy"
    }
  }
}
```

**Degraded Example:**
```json
{
  "status": "degraded",
  "timestamp": "2024-01-15T10:30:00Z",
  "checks": {
    "connections": {
      "status": "degraded",
      "reason": "Only 2 active connections (minimum: 3)"
    },
    "memory": {
      "status": "healthy"
    }
  }
}
```

**Unhealthy Example:**
```json
{
  "status": "unhealthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "checks": {
    "memory": {
      "status": "unhealthy",
      "reason": "Memory usage 1850MB exceeds limit 1800MB"
    }
  }
}
```

### `/ready` - Readiness Probe


Kubernetes readiness probe endpoint. Indicates whether the node is ready to accept traffic.

**Response Codes:**
- `200 OK` - Node is ready to accept traffic
- `503 Service Unavailable` - Node is not ready

**Use Case:** A node may be alive but not ready (e.g., still initializing, warming up caches, establishing connections). Kubernetes will not route traffic to pods that fail the readiness probe.

**Example Request:**
```bash
curl http://localhost:8080/ready
```

### `/live` - Liveness Probe


Kubernetes liveness probe endpoint. Indicates whether the node is alive and functioning.

**Response Codes:**
- `200 OK` - Node is alive
- `503 Service Unavailable` - Node is deadlocked or unresponsive

**Use Case:** If this check fails, Kubernetes will restart the pod. Liveness checks should be more lenient than readiness checks to avoid unnecessary restarts. A degraded node is still considered "alive".

**Example Request:**
```bash
curl http://localhost:8080/live
```

## Kubernetes Integration


### Example Deployment Configuration


```yaml
apiVersion: v1
kind: Pod
metadata:
  name: elara-node
spec:
  containers:
  - name: elara
    image: elara-node:latest
    ports:
    - containerPort: 8080
      name: health
    livenessProbe:
      httpGet:
        path: /live
        port: health
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: health
      initialDelaySeconds: 10
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 2
```

### Probe Configuration Guidelines

**Liveness Probe:**
- `initialDelaySeconds: 30` - Give the node time to start up
- `periodSeconds: 10` - Check every 10 seconds
- `timeoutSeconds: 5` - Allow 5 seconds for response
- `failureThreshold: 3` - Restart after 3 consecutive failures (30 seconds)

**Readiness Probe:**
- `initialDelaySeconds: 10` - Start checking readiness early
- `periodSeconds: 5` - Check more frequently
- `timeoutSeconds: 3` - Shorter timeout for readiness
- `failureThreshold: 2` - Remove from service after 2 failures (10 seconds)

## Built-in Health Checks

The health server works with the following built-in health checks:

### 1. ConnectionHealthCheck

Monitors the number of active connections to ensure the node is properly connected to the network.

```rust
use elara_runtime::health::ConnectionHealthCheck;

checker.add_check(Box::new(ConnectionHealthCheck::new(node.clone(), 3)));
```

**Status:**
- `Healthy`: Active connections >= min_connections
- `Degraded`: Active connections < min_connections

### 2. MemoryHealthCheck


Monitors process memory usage using the `sysinfo` crate.

```rust
use elara_runtime::health::MemoryHealthCheck;

checker.add_check(Box::new(MemoryHealthCheck::new(1800))); // 1800 MB
```

**Status:**
- `Healthy`: Memory usage < max_memory_mb
- `Unhealthy`: Memory usage >= max_memory_mb

### 3. TimeDriftCheck


Monitors time drift between the local node and network consensus time.

```rust
use elara_runtime::health::TimeDriftCheck;

checker.add_check(Box::new(TimeDriftCheck::new(node.clone(), 100))); // 100ms
```

**Status:**
- `Healthy`: |time_drift| < max_drift_ms
- `Degraded`: |time_drift| >= max_drift_ms

### 4. StateDivergenceCheck


Monitors the state reconciliation engine to ensure state is converging properly.

```rust
use elara_runtime::health::StateDivergenceCheck;

checker.add_check(Box::new(StateDivergenceCheck::new(node)));
```

**Status:**
- `Healthy`: State is converging normally
- `Degraded`: State convergence is slow or stalled

## Configuration


### Server Configuration


```rust
use elara_runtime::health_server::HealthServerConfig;

let config = HealthServerConfig {
    bind_address: "0.0.0.0:8080".parse()?,
};
```

### Health Checker Configuration


```rust
use elara_runtime::health::HealthChecker;
use std::time::Duration;

// Cache health check results for 30 seconds
let checker = HealthChecker::new(Duration::from_secs(30));
```

**Cache TTL Guidelines:**
- **High-frequency polling (< 1s)**: Use 5-10 second cache
- **Normal polling (5-10s)**: Use 30 second cache
- **Low-frequency polling (> 30s)**: Use 60 second cache

## Performance Considerations


### Response Time


The health server is designed to respond quickly:

- **Cached responses**: < 1ms (read lock only)
- **Cache miss**: < 50ms (depends on health checks)
- **Target**: < 10ms for 99th percentile

### Caching


Health check results are cached to avoid excessive checking overhead:

1. First request executes all health checks
2. Subsequent requests within TTL return cached results
3. Expired cache triggers new health check execution
4. Cache updates are atomic and thread-safe

### Resource Usage


- **Memory**: < 1MB for health server
- **CPU**: < 0.1% when idle
- **Network**: Minimal (HTTP responses are small)

## Monitoring and Alerting


### Prometheus Integration


The health endpoints can be monitored by Prometheus using the `blackbox_exporter`:

```yaml
scrape_configs:
  - job_name: 'elara-health'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - http://elara-node:8080/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115
```

### Alert Rules

Example Prometheus alert rules:

```yaml
groups:
  - name: elara_health
    rules:
      - alert: ElaraNodeUnhealthy
        expr: probe_success{job="elara-health"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "ELARA node {{ $labels.instance }} is unhealthy"
          description: "Health check has been failing for 2 minutes"
      
      - alert: ElaraNodeDegraded
        expr: probe_http_status_code{job="elara-health"} == 200 and probe_http_content_length{job="elara-health"} > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ELARA node {{ $labels.instance }} is degraded"
          description: "Node is operational but degraded for 5 minutes"
```

## Load Balancer Integration


### HAProxy Configuration


```haproxy
backend elara_nodes
    option httpchk GET /health
    http-check expect status 200
    server node1 10.0.1.10:8080 check inter 5s fall 3 rise 2
    server node2 10.0.1.11:8080 check inter 5s fall 3 rise 2
    server node3 10.0.1.12:8080 check inter 5s fall 3 rise 2
```

### NGINX Configuration


```nginx
upstream elara_nodes {
    server 10.0.1.10:8080 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:8080 max_fails=3 fail_timeout=30s;
    server 10.0.1.12:8080 max_fails=3 fail_timeout=30s;
}

server {
    location /health {
        proxy_pass http://elara_nodes/health;
        proxy_connect_timeout 3s;
        proxy_read_timeout 5s;
    }
}
```

## Troubleshooting


### Health Check Always Returns Degraded


**Symptom:** `/health` returns 200 OK but status is "degraded"

**Possible Causes:**
1. Connection count below minimum threshold
2. Time drift exceeds threshold
3. State convergence is slow

**Solution:**
- Check individual check results in the JSON response
- Adjust thresholds if they're too strict
- Investigate underlying issues (network, time sync, etc.)

### Health Check Returns 503


**Symptom:** `/health` returns 503 Service Unavailable

**Possible Causes:**
1. Memory usage exceeds threshold
2. Critical component failure

**Solution:**
- Check the `reason` field in the JSON response
- Investigate memory leaks or resource exhaustion
- Review logs for errors

### Health Check Timeout


**Symptom:** Health check requests timeout

**Possible Causes:**
1. Server not running
2. Network issues
3. Server deadlocked

**Solution:**
- Verify server is running: `netstat -an | grep 8080`
- Check server logs for errors
- Restart the node if deadlocked

## Examples


### Running the Example


```bash
cargo run --example health_server
```

### Testing with curl


```bash
# Check overall health

curl http://localhost:8080/health | jq

# Check readiness

curl http://localhost:8080/ready | jq

# Check liveness

curl http://localhost:8080/live | jq

# Check HTTP status code

curl -I http://localhost:8080/health
```

### Testing with httpie


```bash
# Pretty-printed JSON

http :8080/health

# Show headers

http -h :8080/health
```

## Best Practices


1. **Set Appropriate Thresholds**: Configure health check thresholds based on your deployment size and requirements

2. **Use Caching**: Set cache TTL to match your polling frequency to avoid excessive checking

3. **Monitor Health Endpoints**: Use Prometheus or similar tools to monitor health check status over time

4. **Separate Liveness and Readiness**: In production, consider implementing separate checks for liveness and readiness

5. **Graceful Degradation**: Design health checks to return "degraded" for non-critical issues rather than "unhealthy"

6. **Log Health Changes**: The server automatically logs when health status changes to degraded or unhealthy

7. **Test Failure Scenarios**: Regularly test that health checks correctly detect failures

8. **Document Thresholds**: Document why specific thresholds were chosen for your deployment

## Security Considerations


1. **No Authentication**: The health endpoints do not require authentication by default. Consider adding authentication if exposing publicly.

2. **Rate Limiting**: Consider adding rate limiting to prevent abuse of health endpoints.

3. **No Sensitive Data**: Health check responses do not include sensitive data (keys, passwords, etc.).

4. **Internal Network**: Deploy health endpoints on an internal network or use firewall rules to restrict access.

## References


- [Kubernetes Liveness and Readiness Probes]https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
- [Health Check API Design]https://microservices.io/patterns/observability/health-check-api.html
- [Prometheus Blackbox Exporter]https://github.com/prometheus/blackbox_exporter