rollblock 0.4.1

# Observability in Rollblock

This guide covers the observability features built into Rollblock, including metrics collection, health monitoring, and structured logging.

## Overview

Rollblock provides comprehensive observability through:
- **Real-time metrics** - Performance counters and statistics
- **Health monitoring** - Store health status tracking
- **Structured logging** - Tracing integration for detailed insights

## Metrics

### Accessing Metrics

Get metrics from the store facade:

```rust
use rollblock::{SimpleStoreFacade, StoreConfig};

let config = StoreConfig::new("./data", 4, 1000, 1, false)?.enable_remote_server()?;
let store = SimpleStoreFacade::new(config)?;

// Access metrics
if let Some(metrics) = store.metrics() {
    let snapshot = metrics.snapshot();
    println!("Operations: {}", snapshot.operations_applied);
}

// Remote server metrics (only available when `enable_server` was set)
if let Some(remote) = store.remote_server_metrics() {
    println!("Active remote connections: {}", remote.active_connections);
}
```

### Available Metrics

#### Counters
- `operations_applied` - Total number of operations applied
- `set_operations_applied` - Total number of set operations applied
- `zero_value_deletes_applied` - Total number of empty-value (`Value::empty()`) delete operations applied
- `blocks_committed` - Number of blocks successfully committed
- `rollbacks_executed` - Number of rollback operations
- `lookups_performed` - Total read operations
- `failed_operations` - Number of failed operations
- `checksum_errors` - Number of data integrity errors

#### Timing Metrics (microseconds)
- `avg_apply_time_us` - Average time to apply a block
- `avg_rollback_time_us` - Average time to rollback
- `avg_lookup_time_us` - Average time for lookups

#### Percentile Metrics (microseconds)
- `apply_p50_us`, `apply_p95_us`, `apply_p99_us` - Apply latency percentiles
- `rollback_p50_us`, `rollback_p95_us`, `rollback_p99_us` - Rollback latency percentiles

#### State Metrics
- `current_block_height` *(a.k.a. `applied_block_height`)* - Highest block applied **in memory**. In asynchronous durability it can be ahead of the on-disk state.
- `durable_block_height` - Highest block durably persisted to journal + metadata.
- `total_keys_stored` - Number of keys in the store
- `last_operation_secs` - Seconds since last operation

### Metrics Snapshot

Capture a point-in-time snapshot of all metrics:

```rust
let snapshot = metrics.snapshot();

// Serialize to JSON for monitoring systems
let json = serde_json::to_string_pretty(&snapshot)?;
println!("{}", json);
```

Output example:
```json
{
  "operations_applied": 1000,
  "set_operations_applied": 900,
  "zero_value_deletes_applied": 100,
  "blocks_committed": 100,
  "rollbacks_executed": 2,
  "avg_apply_time_us": 15000,
  "apply_p95_us": 25000,
  "current_block_height": 120,
  "durable_block_height": 100,
  "failed_operations": 0
}
```

## Health Monitoring

### Health States

Rollblock reports four health states:

- **Healthy** ✅ - Store operating normally with recent activity
- **Idle** 💤 - Store is functional but has no recent operations
- **Degraded** ⚠️ - Store has experienced errors but is operational
- **Unhealthy** ❌ - Store has critical issues

### Checking Health

```rust
if let Some(health) = store.health() {
    match health.state {
        HealthState::Healthy => {
            println!("✅ Store is healthy");
            println!("Current block: {}", health.current_block);
        }
        HealthState::Degraded => {
            println!("⚠️ Store has issues:");
            println!("  Failed operations: {}", health.failed_operations);
            println!("  Checksum errors: {}", health.checksum_errors);
        }
        HealthState::Unhealthy => {
            println!("❌ Store is unhealthy - investigate immediately");
        }
        HealthState::Idle => {
            println!("💤 Store is idle");
        }
    }
}
```

### Health Criteria

A store is considered:
- **Healthy** if: no failed operations, no checksum errors, and activity in the last 60 seconds.
- **Idle** if: no failed operations or checksum errors, but no recent activity.
- **Degraded** if: any checksum errors have been recorded (even if failures also exist).
- **Unhealthy** if: there are failed operations but no checksum errors (checksum errors always map to `Degraded` first).

## Structured Logging with Tracing

Rollblock uses the `tracing` crate for structured, leveled logging.

### Enabling Tracing

Initialize the tracing subscriber in your application:

```rust
fn main() {
    // Basic setup
    tracing_subscriber::fmt::init();

    // Or with environment filter
    tracing_subscriber::fmt()
        .with_env_filter(
            tracing_subscriber::EnvFilter::try_from_default_env()
                .unwrap_or_else(|_| tracing_subscriber::EnvFilter::new("info"))
        )
        .init();

    // Your code here
}
```

### Log Levels

Control verbosity with the `RUST_LOG` environment variable:

```bash
# Info level (default) - Major operations only
RUST_LOG=rollblock=info cargo run

# Debug level - Detailed operation traces
RUST_LOG=rollblock=debug cargo run

# Trace level - Very detailed, including lookups
RUST_LOG=rollblock=trace cargo run
```

### Log Output Examples

**INFO level** - Block operations:
```
2025-11-07T22:47:07.328991Z INFO apply_operations: Block applied successfully 
  block_height=1 operations=10 modified_keys=10
```

**DEBUG level** - Internal steps:
```
2025-11-07T22:47:07.328991Z DEBUG apply_operations: Acquiring mutation mutex
2025-11-07T22:47:07.329234Z DEBUG apply_operations: Preparing journal
2025-11-07T22:47:07.329256Z DEBUG apply_operations: Journal prepared
2025-11-07T22:47:07.329567Z DEBUG apply_operations: State committed operations=10 modified_keys=10
```

**TRACE level** - Individual lookups:
```
2025-11-07T22:47:07.364123Z TRACE fetch_many: Lookup batch executed
  key_count=1 duration_us=42
```

### Background Pruner Telemetry

The journal pruner emits structured logs every iteration so operators can confirm
that retention stays within policy:

- `journal_prune_success` (level `INFO`): includes `pruned_through`,
  `chunks_removed`, `entries_removed`, `bytes_freed`, `duration_ms`, and whether
  the iteration was triggered by the background loop, an internal/manual
  `prune_now()` (used in tests), or crash recovery.
- `journal_pruner_nop` (level `INFO`): fired when no sealed chunks qualified for
  pruning. Seeing repeated no-ops usually means you just reached the configured
  window and should expect deletions once more blocks accumulate.
- `pruner_clamped_by_snapshot` (level `WARN`): indicates the most recent snapshot
  watermark forced the pruner to stop early. Either take a new snapshot or widen
  the rollback window if you routinely see this event.
- `journal_pruner_iteration_failed` (level `ERROR`): captures any IO/metadata
  failure along with the error. The `gc_watermark` entry remains in LMDB until
  the work is retried successfully.

You can inspect the `metadata/gc_watermark` tree with tooling like `mdb_dump` to
see whether a plan is in flight. On restart the pruner replays that plan before
writers come back online, so a non-empty watermark combined with the absence of
`journal_prune_success` logs usually points to a crash in the middle of a prune.

### Structured Fields

All logs include structured fields that can be parsed by log aggregators:

- `block_height` - Block number (`apply_operations` spans)
- `ops_count` - Number of operations provided to the block
- `operations` - Operations applied after planning
- `modified_keys` - Keys affected
- `duration_ms` / `duration_us` - Timing information (apply/rollback vs lookups)
- `target_block` - Rollback target in `revert_to`

### JSON Output

For production environments, output logs as JSON:

```rust
tracing_subscriber::fmt()
    .json()
    .with_env_filter("rollblock=info")
    .init();
```

Output:
```json
{
  "timestamp": "2025-11-07T22:47:07.328991Z",
  "level": "INFO",
  "fields": {
    "message": "Block applied successfully",
    "block_height": 1,
    "operations": 10,
    "modified_keys": 10,
    "duration_ms": 25
  },
  "target": "rollblock::orchestrator"
}
```

## Integration with Monitoring Systems

### Prometheus

Export metrics to Prometheus:

```rust
use rollblock::metrics::MetricsSnapshot;

fn export_prometheus(snapshot: &MetricsSnapshot) -> String {
    format!(
        r#"
# HELP rollblock_operations_total Total operations applied
# TYPE rollblock_operations_total counter
rollblock_operations_total {{}} {}

# HELP rollblock_blocks_total Total blocks committed
# TYPE rollblock_blocks_total counter
rollblock_blocks_total {{}} {}

# HELP rollblock_apply_duration_microseconds Block apply duration
# TYPE rollblock_apply_duration_microseconds summary
rollblock_apply_duration_microseconds{{quantile="0.5"}} {}
rollblock_apply_duration_microseconds{{quantile="0.95"}} {}
rollblock_apply_duration_microseconds{{quantile="0.99"}} {}
"#,
        snapshot.operations_applied,
        snapshot.blocks_committed,
        snapshot.apply_p50_us,
        snapshot.apply_p95_us,
        snapshot.apply_p99_us
    )
}
```

### Health Check Endpoint

Implement a health check endpoint:

```rust
use rollblock::metrics::HealthState;

fn health_check(store: &SimpleStoreFacade) -> (u16, &'static str) {
    match store.health() {
        Some(health) => match health.state {
            HealthState::Healthy => (200, "OK"),
            HealthState::Idle => (200, "OK - Idle"),
            HealthState::Degraded => (503, "Service Degraded"),
            HealthState::Unhealthy => (503, "Service Unhealthy"),
        },
        None => (500, "Metrics Unavailable"),
    }
}
```

## Performance Considerations

### Metrics Overhead

Metrics collection has minimal overhead:
- Counter increments: ~5-10 nanoseconds
- Timestamp recording: ~50-100 nanoseconds
- Percentile calculation: Only on snapshot (not per-operation)

### Tracing Overhead

Tracing overhead varies by level:
- **INFO**: <1% overhead (only major operations)
- **DEBUG**: ~1-2% overhead
- **TRACE**: ~5-10% overhead (logs every lookup)

For production, use INFO or DEBUG level only.

### Best Practices

1. **Sample traces in production** - Use `RUST_LOG=rollblock=info` by default
2. **Export metrics periodically** - Take snapshots every 10-60 seconds
3. **Monitor health continuously** - Check health status every 5-10 seconds
4. **Alert on degradation** - Set up alerts for `Degraded` or `Unhealthy` states
5. **Track percentiles** - P95 and P99 latencies reveal performance issues

## Example: Complete Observability Setup

```rust
use rollblock::{SimpleStoreFacade, StoreConfig};
use std::time::Duration;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize tracing
    tracing_subscriber::fmt()
        .with_env_filter("rollblock=info")
        .init();

    // Create store
    let config = StoreConfig::new("./data", 4, 1000, 1, false)?;
    let store = SimpleStoreFacade::new(config)?;

    // Start metrics exporter thread
    let store_clone = store.clone();
    std::thread::spawn(move || {
        loop {
            if let Some(metrics) = store_clone.metrics() {
                let snapshot = metrics.snapshot();
                println!("Metrics: ops={} blocks={} p95={}μs",
                    snapshot.operations_applied,
                    snapshot.blocks_committed,
                    snapshot.apply_p95_us);
            }
            std::thread::sleep(Duration::from_secs(10));
        }
    });

    // Start health checker thread
    let store_clone = store.clone();
    std::thread::spawn(move || {
        loop {
            if let Some(health) = store_clone.health() {
                if health.state != rollblock::metrics::HealthState::Healthy {
                    eprintln!("⚠️ Health alert: {:?}", health);
                }
            }
            std::thread::sleep(Duration::from_secs(5));
        }
    });

    // Your application logic here
    Ok(())
}
```

## Troubleshooting

### No Metrics Available

If `store.metrics()` returns `None`, check:
- You're using `SimpleStoreFacade` (custom orchestrators may not have metrics)
- The store was initialized correctly

### Unexpected Health State

**Degraded state:**
- Check `checksum_errors` counter
- Verify disk integrity
- Review journal files for corruption

**Unhealthy state:**
- Check `failed_operations` counter
- Review error logs with `RUST_LOG=debug`
- Ensure sufficient disk space

### High Latencies

If P95/P99 latencies are high:
1. Check operation batch sizes
2. Verify shard count is appropriate
3. Consider enabling parallelism
4. Monitor disk I/O performance

## See Also

- [Architecture Documentation](architecture.md)
- [Examples](examples.md)
- [Observability Example](../examples/observability.rs)