---
title: Monitoring and Observability Guide
layout: documentation
description: Comprehensive monitoring and observability setup for Caxton multi-agent systems using OpenTelemetry, Prometheus, Jaeger, and custom dashboards.
---
# Monitoring and Observability Guide
This guide covers setting up comprehensive monitoring and observability for Caxton multi-agent systems, including metrics collection, distributed tracing, log aggregation, alerting, and performance monitoring.
## Observability Architecture
Caxton's observability stack is built on industry-standard tools:
- **OpenTelemetry**: Unified telemetry collection and export
- **Prometheus**: Metrics storage and alerting
- **Jaeger**: Distributed tracing
- **Grafana**: Visualization and dashboards
- **Fluentd/Loki**: Log aggregation
- **AlertManager**: Alert routing and management
## OpenTelemetry Integration
### Configuration
Caxton has built-in OpenTelemetry support that can be configured through the main configuration file:
```toml
# caxton.toml
[observability]
# Enable OpenTelemetry instrumentation
enable_tracing = true
enable_metrics = true
enable_logging = true
# OTLP export configuration
otlp_endpoint = "http://otel-collector:4317"
otlp_timeout = "10s"
otlp_headers = { "api-key" = "your-api-key" }
# Sampling configuration
trace_sample_rate = 0.1 # Sample 10% of traces
metrics_export_interval = "10s"
# Resource attributes
service_name = "caxton-runtime"
service_version = "0.2.0"
deployment_environment = "production"
```
### OpenTelemetry Collector Configuration
```yaml
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
resource:
attributes:
- key: environment
value: production
action: upsert
tail_sampling:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 10
policies:
- name: error_sampling
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow_requests
type: latency
latency: {threshold_ms: 1000}
- name: random_sampling
type: probabilistic
probabilistic: {sampling_percentage: 10}
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: caxton
const_labels:
environment: production
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
tenant_id: caxton
service:
pipelines:
traces:
receivers: [otlp]
processors: [resource, tail_sampling, batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [resource, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [resource, batch]
exporters: [loki]
```
## Metrics Collection
### Core Metrics
Caxton automatically exposes the following metrics categories:
#### Runtime Metrics
- `caxton_agents_total`: Total number of agents
- `caxton_agents_active`: Currently active agents
- `caxton_agent_executions_total`: Total agent executions
- `caxton_agent_execution_duration_seconds`: Agent execution time
- `caxton_wasm_memory_usage_bytes`: WASM memory usage per agent
- `caxton_runtime_memory_usage_bytes`: Runtime memory usage
- `caxton_runtime_cpu_usage_ratio`: CPU utilization
#### Message Protocol Metrics
- `caxton_messages_sent_total`: FIPA messages sent
- `caxton_messages_received_total`: FIPA messages received
- `caxton_message_processing_duration_seconds`: Message processing time
- `caxton_message_queue_size`: Current message queue size
- `caxton_message_errors_total`: Message processing errors
#### HTTP API Metrics
- `caxton_http_requests_total`: HTTP requests by method/status
- `caxton_http_request_duration_seconds`: HTTP request duration
- `caxton_http_active_connections`: Active HTTP connections
- `caxton_websocket_connections`: Active WebSocket connections
#### System Resource Metrics
- `caxton_disk_usage_bytes`: Disk space usage
- `caxton_network_bytes_total`: Network I/O
- `caxton_file_descriptors`: Open file descriptors
### Custom Metrics Configuration
```toml
[observability.metrics]
# Custom histogram buckets for latency metrics
http_duration_buckets = [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
agent_execution_buckets = [0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 30.0, 60.0]
# Metric labels to include
include_labels = ["agent_type", "agent_id", "message_type"]
exclude_labels = ["sensitive_data"]
# Export configuration
export_interval = "15s"
export_timeout = "5s"
```
### Prometheus Configuration
```yaml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "caxton_rules.yml"
scrape_configs:
- job_name: 'caxton-runtime'
static_configs:
- targets: ['caxton-runtime:9090']
scrape_interval: 10s
metrics_path: /metrics
- job_name: 'caxton-kubernetes'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: (+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']
```
## Distributed Tracing
### Trace Context Propagation
Caxton automatically propagates trace context through:
- HTTP headers (W3C Trace Context)
- FIPA message metadata
- Internal agent communications
- Database operations
### Custom Spans
```rust
// Example: Adding custom spans in agent code
use opentelemetry::trace::Tracer;
#[tracing::instrument(name = "agent.execute_task")]
async fn execute_task(&self, task: Task) -> Result<TaskResult> {
let span = tracer.start("task.processing");
span.set_attribute("task.type", task.task_type.clone());
span.set_attribute("task.priority", task.priority as i64);
// Process task
let result = self.process_task_internal(task).await;
match &result {
Ok(_) => span.set_status(Status::Ok),
Err(e) => {
span.set_status(Status::Error {
description: e.to_string().into()
});
}
}
result
}
```
### Jaeger Configuration
```yaml
# jaeger-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.45
ports:
- containerPort: 16686 # UI
- containerPort: 14268 # HTTP collector
- containerPort: 14250 # gRPC collector
- containerPort: 6831 # UDP agent
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"
- name: SPAN_STORAGE_TYPE
value: elasticsearch
- name: ES_SERVER_URLS
value: http://elasticsearch:9200
resources:
requests:
memory: 512Mi
cpu: 250m
limits:
memory: 1Gi
cpu: 500m
```
### Trace Sampling Strategies
```json
{
"service_strategies": [
{
"service": "caxton-runtime",
"type": "probabilistic",
"param": 0.1,
"max_traces_per_second": 100,
"operation_strategies": [
{
"operation": "agent.execute",
"type": "probabilistic",
"param": 0.2
},
{
"operation": "message.process",
"type": "ratelimiting",
"param": 50
}
]
}
],
"default_strategy": {
"type": "probabilistic",
"param": 0.05
}
}
```
## Log Aggregation
### Structured Logging Configuration
```toml
[observability.logging]
level = "info"
format = "json"
output = "stdout"
# Log correlation
include_trace_id = true
include_span_id = true
# Field configuration
timestamp_format = "rfc3339"
level_key = "level"
message_key = "message"
trace_id_key = "trace_id"
span_id_key = "span_id"
# Sampling for high-volume logs
debug_sample_rate = 0.01
trace_sample_rate = 0.1
```
### Fluentd Configuration
```yaml
# fluentd-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
data:
fluent.conf: |
<source>
@type tail
@id caxton_logs
path /var/log/containers/caxton-*.log
pos_file /var/log/fluentd-caxton.log.pos
tag caxton.*
format json
read_from_head true
</source>
<filter caxton.**>
@type parser
key_name message
reserve_data true
<parse>
@type json
json_parser_error_class JSONParserError
</parse>
</filter>
<filter caxton.**>
@type record_transformer
<record>
service "caxton-runtime"
environment "production"
cluster "#{ENV['CLUSTER_NAME']}"
</record>
</filter>
<match caxton.**>
@type copy
<store>
@type loki
url http://loki:3100
tenant ""
extra_labels {"service":"caxton"}
line_format json
<label>
level
service
environment
agent_id
</label>
</store>
<store>
@type elasticsearch
host elasticsearch
port 9200
index_name caxton-logs
type_name _doc
include_timestamp true
</store>
</match>
```
### Loki Configuration
```yaml
# loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
path_prefix: /tmp/loki
storage:
filesystem:
chunks_directory: /tmp/loki/chunks
rules_directory: /tmp/loki/rules
replication_factor: 1
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 100
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://alertmanager:9093
limits_config:
ingestion_rate_mb: 16
ingestion_burst_size_mb: 24
max_streams_per_user: 10000
max_line_size: 256KB
```
## Alerting Rules
### Prometheus Alert Rules
```yaml
# caxton_rules.yml
groups:
- name: caxton.rules
rules:
# Agent health alerts
- alert: CaxtonAgentHighFailureRate
expr: rate(caxton_agent_executions_total{status="error"}[5m]) / rate(caxton_agent_executions_total[5m]) > 0.05
for: 2m
labels:
severity: warning
service: caxton
annotations:
summary: "High agent execution failure rate"
description: "Agent execution failure rate is {{ $value | humanizePercentage }} for {{ $labels.agent_type }}"
- alert: CaxtonAgentMemoryLeak
expr: increase(caxton_wasm_memory_usage_bytes[30m]) > 100*1024*1024
for: 5m
labels:
severity: critical
service: caxton
annotations:
summary: "Potential memory leak in agent"
description: "Agent {{ $labels.agent_id }} memory usage increased by {{ $value | humanizeBytes }} in 30 minutes"
# System resource alerts
- alert: CaxtonHighCPUUsage
expr: caxton_runtime_cpu_usage_ratio > 0.8
for: 5m
labels:
severity: warning
service: caxton
annotations:
summary: "High CPU usage"
description: "CPU usage is {{ $value | humanizePercentage }}"
- alert: CaxtonHighMemoryUsage
expr: caxton_runtime_memory_usage_bytes / 1024/1024/1024 > 16
for: 5m
labels:
severity: critical
service: caxton
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value | humanizeBytes }}"
# API performance alerts
- alert: CaxtonHighResponseTime
expr: histogram_quantile(0.95, rate(caxton_http_request_duration_seconds_bucket[5m])) > 2.0
for: 2m
labels:
severity: warning
service: caxton
annotations:
summary: "High HTTP response time"
description: "95th percentile response time is {{ $value }}s"
- alert: CaxtonServiceDown
expr: up{job="caxton-runtime"} == 0
for: 1m
labels:
severity: critical
service: caxton
annotations:
summary: "Caxton service is down"
description: "Caxton runtime service is not responding"
# Message processing alerts
- alert: CaxtonMessageQueueBacklog
expr: caxton_message_queue_size > 1000
for: 5m
labels:
severity: warning
service: caxton
annotations:
summary: "Large message queue backlog"
description: "Message queue size is {{ $value }} messages"
- alert: CaxtonMessageProcessingErrors
expr: rate(caxton_message_errors_total[5m]) > 10
for: 2m
labels:
severity: critical
service: caxton
annotations:
summary: "High message processing error rate"
description: "Message processing error rate is {{ $value }}/second"
```
### AlertManager Configuration
```yaml
# alertmanager.yml
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alerts@example.com'
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
service: caxton
receiver: 'caxton-team'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
- name: 'critical-alerts'
email_configs:
- to: 'oncall@example.com'
subject: '[CRITICAL] Caxton Alert'
body: |
Alert: {{ .GroupLabels.alertname }}
Summary: {{ .CommonAnnotations.summary }}
Description: {{ .CommonAnnotations.description }}
slack_configs:
- channel: '#alerts-critical'
title: 'Critical Alert: {{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.summary }}'
- name: 'caxton-team'
slack_configs:
- channel: '#caxton-alerts'
title: 'Caxton Alert: {{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.summary }}'
```
## Dashboard Setup
### Grafana Dashboards
#### Runtime Overview Dashboard
```json
{
"dashboard": {
"title": "Caxton Runtime Overview",
"panels": [
{
"title": "Active Agents",
"type": "stat",
"targets": [
{
"expr": "caxton_agents_active",
"legendFormat": "Active Agents"
}
]
},
{
"title": "Agent Execution Rate",
"type": "graph",
"targets": [
{
"expr": "rate(caxton_agent_executions_total[5m])",
"legendFormat": "Executions/sec"
}
]
},
{
"title": "Memory Usage by Agent Type",
"type": "graph",
"targets": [
{
"expr": "sum by (agent_type) (caxton_wasm_memory_usage_bytes)",
"legendFormat": "{{ agent_type }}"
}
]
},
{
"title": "Response Time Percentiles",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(caxton_http_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
},
{
"expr": "histogram_quantile(0.50, rate(caxton_http_request_duration_seconds_bucket[5m]))",
"legendFormat": "50th percentile"
}
]
}
]
}
}
```
### Performance Monitoring Dashboard
Key performance indicators to monitor:
1. **Throughput Metrics**:
- Requests per second
- Agent executions per second
- Message processing rate
2. **Latency Metrics**:
- Request response time
- Agent execution time
- Message processing delay
3. **Resource Utilization**:
- CPU usage
- Memory consumption
- Disk I/O
- Network I/O
4. **Error Rates**:
- HTTP error responses
- Agent execution failures
- Message processing errors
### Custom Grafana Plugins
Install useful plugins for enhanced monitoring:
```bash
# Install Grafana plugins
grafana-cli plugins install grafana-polystat-panel
grafana-cli plugins install grafana-worldmap-panel
grafana-cli plugins install grafana-piechart-panel
```
## Performance Monitoring
### Benchmarking
Set up automated performance benchmarks:
```yaml
# benchmark-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: caxton-benchmark
spec:
schedule: "0 2 * * *" # Run daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: benchmark
image: caxton/benchmark:latest
command:
- /bin/sh
- -c
- |
/benchmark --target http://caxton-service:8080 \
--duration 300s \
--concurrent-users 100 \
--report-to-prometheus http://prometheus:9090
restartPolicy: OnFailure
```
### Load Testing
Example load test configuration:
```javascript
// k6-load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = {
stages: [
{ duration: '2m', target: 100 }, // Ramp up to 100 users
{ duration: '5m', target: 100 }, // Stay at 100 users
{ duration: '2m', target: 200 }, // Ramp up to 200 users
{ duration: '5m', target: 200 }, // Stay at 200 users
{ duration: '2m', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(95)<2000'], // 95% of requests under 2s
http_req_failed: ['rate<0.1'], // Error rate under 10%
},
};
export default function() {
const response = http.post('http://caxton.local:8080/api/agents', {
agent_type: 'test-agent',
config: { test: true }
});
check(response, {
'status is 200': (r) => r.status === 200,
'response time < 1000ms': (r) => r.timings.duration < 1000,
});
sleep(1);
}
```
### Continuous Profiling
Enable continuous profiling in production:
```toml
[profiling]
enabled = true
endpoint = "http://pyroscope:4040"
application_name = "caxton-runtime"
server_address = "0.0.0.0:6060"
sample_rate = 100 # Hz
profile_types = ["cpu", "alloc_objects", "alloc_space", "inuse_objects", "inuse_space"]
```
## Troubleshooting Monitoring
### Common Issues
1. **Missing Metrics**:
- Check OpenTelemetry collector configuration
- Verify network connectivity
- Review Prometheus scrape configuration
2. **High Cardinality**:
- Limit label values
- Use recording rules for pre-aggregation
- Implement metric sampling
3. **Trace Sampling Issues**:
- Adjust sampling rates
- Check trace context propagation
- Verify Jaeger storage capacity
### Debug Commands
```bash
# Check metrics endpoint
# Validate OpenTelemetry export
curl -X POST http://otel-collector:4318/v1/traces \
-H "Content-Type: application/json" \
-d '{"test": "data"}'
# Query Prometheus
curl 'http://prometheus:9090/api/v1/query?query=up'
# Check Jaeger traces
curl http://jaeger:16686/api/traces?service=caxton-runtime
```