# Observability Guide
This guide covers how to enable and use observability features in duroxide for production monitoring and debugging.
## Overview
Duroxide provides production-grade observability through:
- **Structured Logging**: Correlated logs with instance_id, execution_id, orchestration_name, and more
- **Metrics Facade**: Counters, histograms, and gauges via the `metrics` crate - install your preferred exporter
- **Log Analytics Integration**: Ready-to-use patterns for Elasticsearch, Loki, CloudWatch, and Azure Monitor
## Quick Start
### Basic Structured Logging (Text to stdout)
The simplest setup uses the Compact format:
```rust
use duroxide::runtime::{RuntimeOptions, ObservabilityConfig, LogFormat};
let options = RuntimeOptions {
observability: ObservabilityConfig {
log_format: LogFormat::Compact,
log_level: "info".to_string(),
..Default::default()
},
..Default::default()
};
let rt = Runtime::start_with_options(store, activities, orchestrations, options).await;
```
**Output Format**:
```
2025-11-01T17:22:04.494Z INFO duroxide::runtime Orchestration started instance_id=order-123 execution_id=1 orchestration_name=ProcessOrder worker_id=orch-0-cd54
2025-11-01T17:22:04.596Z INFO duroxide::runtime Activity started instance_id=order-123 execution_id=1 activity_name=ValidatePayment activity_id=3 worker_id=work-1-cd54
2025-11-01T17:22:04.698Z INFO duroxide::runtime Activity completed instance_id=order-123 execution_id=1 activity_name=ValidatePayment worker_id=work-1-cd54 outcome="success" duration_ms=102
2025-11-01T17:22:04.806Z INFO duroxide::orchestration Payment validated successfully instance_id=order-123 execution_id=1 orchestration_name=ProcessOrder worker_id=orch-0-cd54
2025-11-01T17:22:05.010Z INFO duroxide::runtime Orchestration completed instance_id=order-123 execution_id=1 worker_id=orch-0-cd54 history_events=9
```
Session-related log events include:
```
... Session claimed/refreshed on fetch session_id=sess-abc owner_id=my-pod
... Session locks renewed sessions_renewed=3
... Session capacity exceeded after fetch (race), abandoning work item session_id=sess-xyz
```
Format: `timestamp level module message field1=value1 field2=value2 ...`
All correlation fields are included directly on each log line (flat structure, no span nesting).
## Log Formats
### Compact Format
Best for development and debugging with minimal noise:
```rust
ObservabilityConfig {
log_format: LogFormat::Compact,
log_level: "info".to_string(),
..Default::default()
}
```
Output: `timestamp level module [instance_id] [worker_id] message`
### Pretty Format
Full structured fields for detailed debugging:
```rust
ObservabilityConfig {
log_format: LogFormat::Pretty,
log_level: "debug".to_string(),
..Default::default()
}
```
Output includes all fields as `key=value` pairs:
```
2024-10-30T10:15:23.456Z INFO duroxide::runtime instance_id=order-123 execution_id=1 orchestration_name=ProcessOrder version=1.0.0 worker_id=orch-0-a3f9: Orchestration started
```
### JSON Format
For log aggregators and analytics platforms:
```rust
ObservabilityConfig {
log_format: LogFormat::Json,
log_level: "info".to_string(),
..Default::default()
}
```
Output: One JSON object per line with all structured fields
## Correlation Fields
Every log entry includes these fields for correlation:
- **instance_id** - Unique orchestration instance identifier
- **execution_id** - Execution number (1, 2, 3... for ContinueAsNew)
- **orchestration_name** - Name of the orchestration
- **orchestration_version** - Semantic version
- **activity_name** - Activity name (in activity context)
- **activity_id** - Event ID of the activity
- **worker_id** - Dispatcher worker ID (format: `work-{index}-{runtime_id}` or `orch-{index}-{runtime_id}`)
- **timestamp** - ISO8601 timestamp
### Example: Tracing an Orchestration
All logs for instance `order-123`:
```bash
# In Elasticsearch
instance_id:"order-123"
# In CloudWatch Insights
# In Loki (LogQL)
## User Orchestration Logging
Use `ctx.trace_*()` methods in your orchestrations for replay-safe logging:
```rust
async fn process_order(ctx: OrchestrationContext, order: String) -> Result<String, String> {
ctx.trace_info("Processing order started");
let validated = ctx.schedule_activity("ValidateOrder", order.clone())
.await?;
ctx.trace_info(format!("Validation result: {}", validated));
if validated == "invalid" {
ctx.trace_error("Order validation failed");
return Err("validation_failed".to_string());
}
ctx.trace_info("Order processing complete");
Ok("processed".to_string())
}
```
**Output**:
```
2024-10-30T10:15:23.690Z INFO duroxide::orchestration [order-123] Processing order started
2024-10-30T10:15:23.750Z INFO duroxide::orchestration [order-123] Validation result: valid
2024-10-30T10:15:23.890Z INFO duroxide::orchestration [order-123] Order processing complete
```
**Key Points**:
- All `ctx.trace_*()` calls are deterministic (replay-safe)
- Context fields automatically included (instance_id, execution_id, etc.)
- Only logged on first execution (not during replay)
- Creates SystemCall events in history for determinism
## Metrics
Duroxide emits metrics via the [`metrics`](https://crates.io/crates/metrics) facade crate. To export metrics, install a recorder before starting the runtime:
```rust
// Option 1: Prometheus (direct scraping)
metrics_exporter_prometheus::PrometheusBuilder::new()
.with_http_listener(([0, 0, 0, 0], 9090))
.install()?;
// Option 2: OpenTelemetry (via metrics-exporter-opentelemetry)
// metrics_exporter_opentelemetry::Recorder::builder("my-service").install_global()?;
// Option 3: None - metrics become zero-cost no-ops (default)
```
### Available Metrics
All metrics follow Prometheus naming conventions with `_total` for counters and `_seconds` for time measurements.
## Metrics Summary Table
| **Orchestration Lifecycle** |
| `duroxide_orchestration_starts_total` | Counter | Orchestration | orchestration_name, version, initiated_by | Track which orchestrations start |
| `duroxide_orchestration_completions_total` | Counter | Orchestration | orchestration_name, version, status, final_turn_count | Success rate, completion tracking |
| `duroxide_orchestration_failures_total` | Counter | Orchestration | orchestration_name, version, error_type, error_category | Failure analysis |
| `duroxide_orchestration_duration_seconds` | Histogram | Orchestration | orchestration_name, version, status | E2E duration, p95/p99 latency |
| `duroxide_orchestration_history_size` | Histogram | Orchestration | orchestration_name | Detect history growth issues |
| `duroxide_orchestration_turns` | Histogram | Orchestration | orchestration_name | Identify multi-turn orchestrations |
| `duroxide_orchestration_continue_as_new_total` | Counter | Orchestration | orchestration_name, execution_id | Track CAN operations |
| `duroxide_active_orchestrations` | Gauge | Orchestration | _(none)_ | **Current** active count |
| **Activity Execution** |
| `duroxide_activity_executions_total` | Counter | Activity | activity_name, outcome, retry_attempt | Track success/failure rates |
| `duroxide_activity_duration_seconds` | Histogram | Activity | activity_name, outcome | Activity latency p95/p99 |
| `duroxide_activity_errors_total` | Counter | Activity | activity_name, error_type, retryable | Detailed error tracking |
| **Sub-Orchestration** |
| `duroxide_suborchestration_calls_total` | Counter | Sub-Orch | parent_orchestration, child_orchestration, outcome | Track sub-orch calls |
| `duroxide_suborchestration_duration_seconds` | Histogram | Sub-Orch | parent_orchestration, child_orchestration, outcome | Sub-orch latency |
| **Provider (Storage)** |
| `duroxide_provider_operation_duration_seconds` | Histogram | Provider | operation, status | Database operation latency |
| `duroxide_provider_errors_total` | Counter | Provider | operation, error_type | Storage layer errors |
| **Client Operations** |
| `duroxide_client_orchestration_starts_total` | Counter | Client | orchestration_name | Client-initiated starts |
| `duroxide_client_external_events_raised_total` | Counter | Client | event_name | External events raised |
| `duroxide_client_cancellations_total` | Counter | Client | _(none)_ | Cancellation requests |
| `duroxide_client_wait_duration_seconds` | Histogram | Client | _(none)_ | Client wait latency |
| **Internal (Dispatcher)** |
| `duroxide.orchestration.dispatcher.items_fetched` | Counter | Internal | _(none)_ | Items fetched by orch dispatcher |
| `duroxide.orchestration.dispatcher.processing_duration_ms` | Histogram | Internal | _(none)_ | Orch processing time |
| `duroxide.worker.dispatcher.items_fetched` | Counter | Internal | _(none)_ | Items fetched by worker dispatcher |
| `duroxide.worker.dispatcher.execution_duration_ms` | Histogram | Internal | _(none)_ | Worker execution time |
**Note**: Internal dispatcher metrics use dot notation and milliseconds (not exported to Prometheus by default).
#### Orchestration Lifecycle Metrics
**`duroxide_orchestration_starts_total`** (Counter)
- **Description**: Total orchestrations started
- **Labels**: `orchestration_name`, `version`, `initiated_by` (client|suborchestration|continueAsNew)
- **Use**: Track which orchestrations are running, identify version distribution
**`duroxide_orchestration_completions_total`** (Counter)
- **Description**: Orchestrations that completed (successfully or failed)
- **Labels**: `orchestration_name`, `version`, `status` (success|failed|cancelled), `final_turn_count` (1-5|6-10|11-50|50+)
- **Use**: Success rate tracking, identify orchestrations with high turn counts
**`duroxide_orchestration_failures_total`** (Counter)
- **Description**: Orchestration failures with detailed error classification
- **Labels**: `orchestration_name`, `version`, `error_type` (app_error|infrastructure_error|config_error), `error_category`
- **Use**: Root cause analysis, distinguish infrastructure vs application errors
**`duroxide_orchestration_duration_seconds`** (Histogram)
- **Description**: End-to-end orchestration execution time
- **Labels**: `orchestration_name`, `version`, `status`
- **Buckets**: `[0.1, 0.5, 1, 2, 5, 10, 30, 60, 300, 600, 1800, 3600]` seconds
- **Use**: Identify slow orchestrations, track p50/p95/p99 latency
**`duroxide_orchestration_history_size`** (Histogram)
- **Description**: History event count at orchestration completion
- **Labels**: `orchestration_name`
- **Buckets**: `[10, 50, 100, 500, 1000, 5000, 10000]` events
- **Use**: Identify orchestrations with unbounded history growth
**`duroxide_orchestration_turns`** (Histogram)
- **Description**: Number of turns to orchestration completion
- **Labels**: `orchestration_name`
- **Buckets**: `[1, 2, 5, 10, 20, 50, 100, 200, 500]`
- **Use**: Detect orchestrations requiring many turns (potential optimization targets)
**`duroxide_orchestration_continue_as_new_total`** (Counter)
- **Description**: Continue-as-new operations performed
- **Labels**: `orchestration_name`, `execution_id`
- **Use**: Verify continue-as-new is working, identify long-running actors
**`duroxide_active_orchestrations`** (Gauge) 🆕
- **Description**: Current number of orchestration instances that are actively running (not completed/failed)
- **Value**: Count of orchestrations currently in progress
- **Use**: Track concurrent orchestrations, detect leaks, capacity planning
- **Critical**: This is a GAUGE (can increase/decrease), unlike counters that only increase
- **Note**: Continue-as-new does NOT change this count (orchestration stays active)
#### Activity Execution Metrics
**`duroxide_activity_executions_total`** (Counter)
- **Description**: Activity execution attempts
- **Labels**: `activity_name`, `outcome` (success|app_error|infra_error|config_error), `retry_attempt` (0|1|2|3+)
- **Use**: Identify flaky activities, track retry rates
**`duroxide_activity_duration_seconds`** (Histogram)
- **Description**: Activity execution time (wall clock)
- **Labels**: `activity_name`, `outcome`
- **Buckets**: `[0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300]` seconds
- **Use**: Identify slow activities, set appropriate timeouts
**`duroxide_activity_errors_total`** (Counter)
- **Description**: Detailed activity error tracking
- **Labels**: `activity_name`, `error_type` (app_error|infrastructure_error|config_error), `retryable` (true|false)
- **Use**: Distinguish transient vs permanent errors
#### Sub-Orchestration Metrics
**`duroxide_suborchestration_calls_total`** (Counter)
- **Description**: Sub-orchestration invocations
- **Labels**: `parent_orchestration`, `child_orchestration`, `outcome`
- **Use**: Understand orchestration composition, trace call graphs
**`duroxide_suborchestration_duration_seconds`** (Histogram)
- **Description**: Sub-orchestration execution time
- **Labels**: `parent_orchestration`, `child_orchestration`, `outcome`
- **Buckets**: `[0.1, 0.5, 1, 5, 10, 30, 60, 300, 600]` seconds
#### Provider Metrics
**`duroxide_provider_operation_duration_seconds`** (Histogram)
- **Description**: Database operation latency
- **Labels**: `operation` (fetch|ack|save_event|create_instance), `status` (success|error)
- **Buckets**: `[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]` seconds
- **Use**: Database performance monitoring, identify slow queries
**`duroxide_provider_errors_total`** (Counter)
- **Description**: Provider/storage layer errors
- **Labels**: `operation`, `error_type` (timeout|connection|deadlock|corruption)
- **Use**: Database health monitoring, alerting
#### Client Metrics
**`duroxide_client_orchestration_starts_total`** (Counter)
- **Description**: Orchestrations started via client
- **Labels**: `orchestration_name`
**`duroxide_client_external_events_raised_total`** (Counter)
- **Description**: External events raised via client
- **Labels**: `event_name`
**`duroxide_client_cancellations_total`** (Counter)
- **Description**: Orchestration cancellations via client
**`duroxide_client_wait_duration_seconds`** (Histogram)
- **Description**: Client wait operation duration
- **Buckets**: `[0.1, 0.5, 1, 2, 5, 10, 30, 60, 120]` seconds
## Error Classification
Duroxide categorizes errors into three types for better observability:
### Infrastructure Errors
Provider failures, data corruption, network issues:
- Transaction failures
- Lock timeouts
- Serialization errors
- Database connectivity issues
**Metrics**: `*.infrastructure_errors` counters
### Configuration Errors
Deployment/registration issues:
- Unregistered orchestrations/activities
- Missing versions
- Nondeterminism detected
**Metrics**: `*.configuration_errors` counters
### Application Errors
Business logic failures:
- Activity returns `Err(...)`
- User-initiated cancellations
- Business validation failures
**Metrics**: Activity/orchestration failure counters with `app_error` outcome
## Log Analytics Integration
### 1. Elasticsearch / OpenSearch
**Setup**:
```yaml
# otel-collector-config.yaml
exporters:
elasticsearch:
endpoints: ["http://elasticsearch:9200"]
logs_index: duroxide-logs
```
**Queries**:
```
# All logs for an orchestration
instance_id:"order-123"
# Failed activities
activity_name:* AND outcome:"app_error"
# Nondeterminism errors
error_type:"nondeterminism" AND level:"ERROR"
```
### 2. Grafana Loki
**Setup**:
```yaml
# promtail-config.yaml
scrape_configs:
- job_name: duroxide
static_configs:
- targets:
- localhost
labels:
job: duroxide-worker
```
**LogQL**:
```logql
# Instance timeline
{job="duroxide"} | json | instance_id="order-123"
# Error rate
rate({job="duroxide", level="error"}[5m]) by (orchestration_name)
```
### 3. AWS CloudWatch Logs
**Setup**: Use CloudWatch Logs agent or OTLP exporter
**CloudWatch Insights Queries**:
```sql
# Orchestration trace
fields @timestamp, instance_id, orchestration_name, message
| filter instance_id = "order-123"
| sort @timestamp asc
# Activity success rate
fields activity_name, outcome
| filter activity_name != ""
| stats count() by activity_name, outcome
```
### 4. Azure Monitor / Application Insights
**Setup**: Use Azure Monitor OTLP exporter
**KQL Queries**:
```kql
// Full orchestration timeline
traces
| where customDimensions.instance_id == "order-123"
| project timestamp, severityLevel, message
| order by timestamp asc
```
## Debugging Workflows
### Trace a Single Orchestration
1. Note the instance_id when starting: `order-123`
2. Query logs by instance_id
3. Filter by execution_id if using ContinueAsNew
4. Trace activity execution within the orchestration
### Find Slow Activities
Query activity duration metrics:
```promql
# Prometheus query - p99 latency
histogram_quantile(0.99,
rate(duroxide_activity_duration_seconds_bucket[5m])
) by (activity_name)
# Activity success rate
rate(duroxide_activity_executions_total{outcome="success"}[5m])
/
rate(duroxide_activity_executions_total[5m])
by (activity_name)
```
### Identify Nondeterminism Issues
Search logs for configuration errors:
```
error_type:"nondeterminism" AND level:"ERROR"
```
These indicate scheduling order mismatches that need code fixes.
## Performance Impact
### Without a Metrics Recorder (default)
- Zero overhead from metrics collection (no-op fast path)
- Minimal logging overhead (only structured field extraction)
- Recommended for development and when metrics export is not needed
### With a Metrics Recorder Installed
- ~2-5% overhead for metrics collection
- Lock-free atomic operations
- Background export (no blocking)
- Recommended for production monitoring
### Log Format Impact
- **Compact**: Lowest overhead
- **Pretty**: Slight overhead for formatting
- **JSON**: Minimal additional overhead
## Environment Variables
Override configuration via environment variables:
```bash
# Logging level
RUST_LOG=info,duroxide=debug
# OpenTelemetry (if using OTLP exporter in your application)
# These are consumed by opentelemetry crates, not duroxide directly
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
OTEL_SERVICE_NAME=duroxide-worker
OTEL_SERVICE_VERSION=1.0.0
```
## Example Prometheus Queries
### Orchestration Success Rate
```promql
rate(duroxide_orchestration_completions_total{status="success"}[5m])
/
rate(duroxide_orchestration_completions_total[5m])
by (orchestration_name)
```
### P95 Orchestration Duration
```promql
histogram_quantile(0.95,
rate(duroxide_orchestration_duration_seconds_bucket[5m])
) by (orchestration_name)
```
### Activity Error Rate
```promql
rate(duroxide_activity_executions_total{outcome!="success"}[5m])
by (activity_name)
```
### Top 5 Slowest Activities
```promql
topk(5,
histogram_quantile(0.95,
rate(duroxide_activity_duration_seconds_bucket[5m])
) by (activity_name)
)
```
### Infrastructure Errors (Actionable)
```promql
rate(duroxide_orchestration_failures_total{error_type="infrastructure_error"}[5m])
by (orchestration_name)
```
### Orchestrations Requiring Many Turns
```promql
# Using completion labels (bucketed ranges)
sum(duroxide_orchestration_completions_total{final_turn_count="50+"})
by (orchestration_name)
# Using turns histogram (p95 turn count)
histogram_quantile(0.95,
rate(duroxide_orchestration_turns_bucket[5m])
) by (orchestration_name)
# Orchestrations that took >100 turns
sum(rate(duroxide_orchestration_turns_bucket{le="+Inf"}[5m]))
-
sum(rate(duroxide_orchestration_turns_bucket{le="100"}[5m]))
```
### Currently Active Orchestrations (Gauge)
```promql
# Total active right now
duroxide_active_orchestrations
# Detect orchestration leaks (if this keeps growing, you have a leak)
increase(duroxide_active_orchestrations[1h]) > 100
```
## Best Practices
1. **Use Compact format in development** for readable logs
2. **Use JSON format in production** for log aggregation
3. **Include context in user traces**: Use descriptive messages
4. **Monitor infrastructure_errors** closely - indicates system issues
5. **Track configuration_errors** - may require deployment fixes
6. **Use instance_id** as primary correlation field
7. **Query by orchestration_name** to find patterns across instances
8. **Enable metrics in production** for performance insights
9. **Use histogram metrics** for p95/p99 latency tracking, not just averages
10. **Alert on error_type labels** to distinguish actionable errors from business logic failures
## Troubleshooting
### Logs Not Showing
Check the log level:
```rust
ObservabilityConfig {
log_level: "debug".to_string(), // Lower for more logs
..Default::default()
}
```
### Missing Context Fields
Ensure logs are within an orchestration or activity span. Framework logs automatically include context.
### Metrics Not Exporting
Verify:
1. `observability` feature flag is enabled
2. OTLP endpoint is accessible
3. Export interval allows time for collection
## Example: Full Production Configuration
```rust
use duroxide::runtime::{RuntimeOptions, ObservabilityConfig, LogFormat};
let options = RuntimeOptions {
observability: ObservabilityConfig {
// Metrics
metrics_enabled: true,
metrics_export_endpoint: Some("http://otel-collector:4317".to_string()),
metrics_export_interval_ms: 10000,
// Logging
log_format: LogFormat::Json,
log_level: "info".to_string(),
// Service identification
service_name: "payment-processor".to_string(),
service_version: Some(env!("CARGO_PKG_VERSION").to_string()),
},
orchestration_concurrency: 4,
worker_concurrency: 8,
..Default::default()
};
```
## Grafana Dashboard
A complete Grafana dashboard with 11 panels is available at:
- [examples/grafana-dashboard.json](../examples/grafana-dashboard.json)
Import this dashboard into Grafana to get:
- Orchestration success rate by name
- P95/P99 duration tracking
- Activity execution rates and latency
- Error classification
- Continue-as-new operations
- History size distribution
- Provider operation latency
- Turn count distribution
## See Also
- [Telemetry Implementation Summary](telemetry-implementation-summary.md) - Complete metrics specification
- [Provider Observability Guide](provider-observability.md) - For custom provider implementors
- [Library Observability Guide](library-observability.md) - For library developers
- [examples/with_observability.rs](../examples/with_observability.rs) - Working example
- [examples/grafana-dashboard.json](../examples/grafana-dashboard.json) - Production dashboard