blvm-node 0.1.3

# High Availability Features

## Overview

blvm-node implements Phase 2 and 3 high availability features for production deployment: Prometheus metrics export, health check endpoints, disk space monitoring, peer reconnection, enhanced rate limiting, and structured logging.

## Metrics Endpoint

### Prometheus Metrics Export

**Endpoint**: `GET /metrics`

**Purpose**: Exports Prometheus-formatted metrics for monitoring.

**Metrics Exported**:
- Block processing metrics (blocks processed, validation time)
- Network metrics (peers connected, bytes sent/received)
- Storage metrics (database size, UTXO count)
- RPC metrics (requests processed, errors)
- Mempool metrics (transaction count, size)

**Example**:
```bash
curl http://localhost:18332/metrics
```

**Response Format**: Prometheus text format

**Usage**: Configure Prometheus to scrape this endpoint for monitoring dashboards.

---

## Health Check Endpoints

### Basic Health Check

**Endpoint**: `GET /health`

**Purpose**: Simple health check for load balancers.

**Response**:
```json
{
  "status": "healthy",
  "timestamp": 1234567890
}
```

**Status Codes**:
- `200 OK`: Node is healthy
- `503 Service Unavailable`: Node is unhealthy

---

### Liveness Probe

**Endpoint**: `GET /health/live`

**Purpose**: Kubernetes liveness probe - indicates if node process is running.

**Response**:
```json
{
  "status": "alive"
}
```

**Status Codes**:
- `200 OK`: Process is alive
- `503 Service Unavailable`: Process is dead/unresponsive

---

### Readiness Probe

**Endpoint**: `GET /health/ready`

**Purpose**: Kubernetes readiness probe - indicates if node is ready to serve requests.

**Response**:
```json
{
  "status": "ready",
  "chain_initialized": true,
  "storage_available": true
}
```

**Status Codes**:
- `200 OK`: Node is ready
- `503 Service Unavailable`: Node is not ready (e.g., initializing chain)

---

### Detailed Health Check

**Endpoint**: `GET /health/detailed`

**Purpose**: Comprehensive health status for debugging.

**Response**:
```json
{
  "status": "healthy",
  "chain": {
    "initialized": true,
    "height": 123456,
    "tip_hash": "0000..."
  },
  "storage": {
    "available": true,
    "size_bytes": 1234567890
  },
  "network": {
    "peers_connected": 8,
    "peers_max": 100
  },
  "rpc": {
    "enabled": true,
    "requests_processed": 12345
  }
}
```

---

## Disk Space Monitoring

### Automatic Pruning

blvm-node monitors disk space and automatically prunes old blocks when space is low.

**Configuration**:
```toml
[storage]
pruning_mode = "normal"  # or "aggressive", "custom", "disabled"
pruning_threshold_gb = 100  # Prune when disk usage exceeds this
pruning_target_gb = 80      # Prune down to this size
```

**Pruning Modes**:
- `disabled`: No automatic pruning
- `normal`: Prune old blocks, keep recent blocks
- `aggressive`: Prune aggressively, keep only recent blocks
- `custom`: Custom pruning configuration

**Behavior**:
- Monitors disk space periodically
- Triggers pruning when threshold exceeded
- Prunes to target size
- Logs pruning operations

---

## Peer Reconnection

### Automatic Reconnection

blvm-node automatically reconnects to disconnected peers with exponential backoff.

**Features**:
- Exponential backoff: Reconnection attempts with increasing delays
- Quality-based prioritization: Reconnect to high-quality peers first
- Connection queue: Manages reconnection queue
- Max retries: Limits reconnection attempts

**Configuration**:
```toml
[network]
reconnect_enabled = true
reconnect_max_retries = 10
reconnect_initial_delay_secs = 5
reconnect_max_delay_secs = 3600
```

**Behavior**:
- Detects peer disconnections
- Adds peer to reconnection queue
- Attempts reconnection with exponential backoff
- Prioritizes high-quality peers
- Stops after max retries

---

## Rate Limiting

### Enhanced Rate Limiting

blvm-node implements multi-layer rate limiting for RPC requests.

**Layers**:
1. **Per-IP Rate Limiting**: Limits requests per IP address
2. **Per-User Rate Limiting**: Limits requests per authenticated user
3. **Per-Method Rate Limiting**: Limits requests per RPC method

**Configuration**:
```toml
[rpc.auth]
rate_limit_enabled = true
rate_limit_rate = 100      # Requests per second
rate_limit_burst = 200     # Burst capacity
per_method_limits = {      # Per-method overrides
  "getblocktemplate" = { rate = 10, burst = 20 }
  "sendrawtransaction" = { rate = 5, burst = 10 }
}
```

**Rate Limiter**: Token bucket algorithm

**Response**: `429 Too Many Requests` when limit exceeded

---

## Structured Logging

### Request IDs and Tracing

blvm-node uses structured logging with request IDs and tracing spans.

**Features**:
- Request IDs: Unique ID per RPC request
- Tracing spans: Hierarchical tracing context
- Request/response metrics: Logged with each request
- Client address tracking: Logged for each request

**Log Format**:
```
[2025-01-01T00:00:00Z INFO rpc_request] request_id=abc12345 method=getblockhash client_addr=127.0.0.1:12345 request_size=123
```

**Configuration**:
```toml
[logging]
format = "json"  # or "text"
level = "info"   # trace, debug, info, warn, error
```

---

## Configuration

### Complete HA Configuration

```toml
[network]
reconnect_enabled = true
reconnect_max_retries = 10
reconnect_initial_delay_secs = 5
reconnect_max_delay_secs = 3600

[storage]
pruning_mode = "normal"
pruning_threshold_gb = 100
pruning_target_gb = 80

[rpc]
metrics_enabled = true
health_checks_enabled = true

[rpc.auth]
rate_limit_enabled = true
rate_limit_rate = 100
rate_limit_burst = 200

[logging]
format = "json"
level = "info"
```

---

## Monitoring Setup

### Prometheus Configuration

```yaml
scrape_configs:
  - job_name: 'blvm-node'
    static_configs:
      - targets: ['localhost:18332']
    metrics_path: '/metrics'
    scrape_interval: 15s
```

### Health Check Configuration

**Kubernetes**:
```yaml
livenessProbe:
  httpGet:
    path: /health/live
    port: 18332
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 18332
  initialDelaySeconds: 10
  periodSeconds: 5
```

**Load Balancer**:
- Health check endpoint: `/health`
- Health check interval: 10 seconds
- Unhealthy threshold: 3 failures

---

## Related Documentation

- [RPC Reference](RPC_REFERENCE.md) - Complete RPC API
- [Configuration Guide](CONFIGURATION_GUIDE.md) - Node configuration (this repo)
- [Production mainnet node](https://docs.thebitcoincommons.org/getting-started/first-node.html#production-mainnet-node) ([BLVM Documentation](https://docs.thebitcoincommons.org/))