caxton 0.1.4

A secure WebAssembly runtime for multi-agent systems
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
# Metrics Integration Guide

## Overview
This guide documents Caxton's metrics aggregation and monitoring strategy using Prometheus and OpenTelemetry, ensuring comprehensive observability across all components.

## Architecture

### Metrics Pipeline
```
Agents → OpenTelemetry Collector → Prometheus → Grafana
          Alternative Backends
         (Datadog, New Relic, etc.)
```

## Prometheus Integration

### Configuration
```yaml
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'caxton-orchestrator'
    static_configs:
      - targets: ['localhost:9090']
    metrics_path: '/metrics'

  - job_name: 'caxton-agents'
    static_configs:
      - targets: ['localhost:9091-9099']
    metrics_path: '/metrics'

  - job_name: 'opentelemetry-collector'
    static_configs:
      - targets: ['localhost:8888']
```

### Key Metrics

#### Orchestrator Metrics
```rust
// Core orchestrator metrics
pub static MESSAGES_PROCESSED: Counter = Counter::new(
    "caxton_messages_processed_total",
    "Total number of messages processed"
);

pub static MESSAGE_LATENCY: Histogram = Histogram::new(
    "caxton_message_latency_seconds",
    "Message processing latency in seconds"
);

pub static ACTIVE_AGENTS: Gauge = Gauge::new(
    "caxton_active_agents",
    "Number of currently active agents"
);

pub static AGENT_MEMORY_USAGE: Gauge = Gauge::new(
    "caxton_agent_memory_bytes",
    "Memory usage per agent in bytes"
);
```

#### Agent Metrics
```rust
// Per-agent metrics
pub static TASK_DURATION: Histogram = Histogram::new(
    "caxton_task_duration_seconds",
    "Task execution duration in seconds"
);

pub static TASK_SUCCESS_RATE: Gauge = Gauge::new(
    "caxton_task_success_rate",
    "Task success rate (0-1)"
);

pub static AGENT_CPU_USAGE: Gauge = Gauge::new(
    "caxton_agent_cpu_usage_percent",
    "CPU usage percentage per agent"
);
```

### Metric Labels and Cardinality

#### Best Practices
- Keep cardinality under control (< 10 label values per metric)
- Use consistent label names across metrics
- Avoid high-cardinality labels (user IDs, request IDs)

#### Standard Labels
```rust
pub struct StandardLabels {
    pub agent_id: String,      // Agent identifier
    pub agent_type: String,     // Agent type/capability
    pub conversation_id: String, // Conversation correlation
    pub environment: String,    // dev/staging/prod
    pub version: String,        // Software version
}
```

## OpenTelemetry Collector Configuration

### Collector Setup
```yaml
# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'caxton-metrics'
          scrape_interval: 10s
          static_configs:
            - targets: ['localhost:9090']

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  resource:
    attributes:
      - key: service.name
        value: "caxton"
      - key: service.version
        from_attribute: version

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"

  logging:
    loglevel: debug

  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true

service:
  pipelines:
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch, resource]
      exporters: [prometheus, logging]

    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [jaeger, logging]
```

## Grafana Dashboard Configuration

### Core Dashboards

#### System Overview Dashboard
```json
{
  "dashboard": {
    "title": "Caxton System Overview",
    "panels": [
      {
        "title": "Message Throughput",
        "targets": [
          {
            "expr": "rate(caxton_messages_processed_total[5m])"
          }
        ]
      },
      {
        "title": "Active Agents",
        "targets": [
          {
            "expr": "caxton_active_agents"
          }
        ]
      },
      {
        "title": "Message Latency (p95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(caxton_message_latency_seconds_bucket[5m]))"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(caxton_errors_total[5m])"
          }
        ]
      }
    ]
  }
}
```

#### Agent Performance Dashboard
```json
{
  "dashboard": {
    "title": "Agent Performance",
    "panels": [
      {
        "title": "Task Success Rate by Agent",
        "targets": [
          {
            "expr": "caxton_task_success_rate{}"
          }
        ]
      },
      {
        "title": "Agent Memory Usage",
        "targets": [
          {
            "expr": "caxton_agent_memory_bytes{}"
          }
        ]
      },
      {
        "title": "Task Duration Distribution",
        "targets": [
          {
            "expr": "histogram_quantile(0.5, rate(caxton_task_duration_seconds_bucket[5m]))"
          }
        ]
      }
    ]
  }
}
```

## Alert Rules

### Critical Alerts
```yaml
groups:
  - name: caxton_critical
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(caxton_errors_total[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors/sec"

      - alert: OrchestratorDown
        expr: up{job="caxton-orchestrator"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Orchestrator is down"

      - alert: HighMemoryUsage
        expr: caxton_agent_memory_bytes > 1073741824  # 1GB
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Agent {{ $labels.agent_id }} high memory usage"
```

### Performance Alerts
```yaml
groups:
  - name: caxton_performance
    interval: 1m
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(caxton_message_latency_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High message processing latency"
          description: "95th percentile latency is {{ $value }}s"

      - alert: LowThroughput
        expr: rate(caxton_messages_processed_total[5m]) < 10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low message throughput"
          description: "Processing only {{ $value }} messages/sec"
```

## Custom Metrics Implementation

### Adding New Metrics
```rust
use prometheus::{register_counter, register_histogram, register_gauge};

// Register custom metrics
lazy_static! {
    static ref CUSTOM_METRIC: Counter = register_counter!(
        "caxton_custom_metric_total",
        "Description of custom metric"
    ).unwrap();
}

// Use in code
CUSTOM_METRIC.inc();
```

### Metric Types Guide
- **Counter**: For monotonically increasing values (requests, errors)
- **Gauge**: For values that go up and down (memory, connections)
- **Histogram**: For distributions (latency, sizes)
- **Summary**: For pre-calculated quantiles (not recommended)

## Backend Alternatives

### Datadog Integration
```yaml
# For Datadog backend
exporters:
  datadog:
    api:
      key: ${DATADOG_API_KEY}
      site: datadoghq.com
    hostname: caxton-orchestrator
```

### New Relic Integration
```yaml
# For New Relic backend
exporters:
  newrelic:
    apikey: ${NEW_RELIC_API_KEY}
    timeout: 30s
```

### CloudWatch Integration
```yaml
# For AWS CloudWatch
exporters:
  awscloudwatchmetrics:
    namespace: Caxton
    region: us-west-2
```

## Performance Considerations

### Metric Collection Overhead
- Keep scrape intervals reasonable (15-30s for most metrics)
- Use histograms sparingly (higher storage cost)
- Batch metric updates where possible
- Consider sampling for high-volume metrics

### Storage and Retention
```yaml
# Prometheus storage configuration
storage:
  tsdb:
    path: /var/lib/prometheus
    retention.time: 30d
    retention.size: 10GB
    wal_compression: true
```

### Query Optimization
- Use recording rules for expensive queries
- Implement query result caching
- Optimize label cardinality
- Use downsampling for long-term storage

## Debugging Metrics Issues

### Common Problems and Solutions

#### Missing Metrics
```bash
# Check if metrics endpoint is accessible
curl http://localhost:9090/metrics

# Verify Prometheus scrape config
curl http://localhost:9090/api/v1/targets

# Check collector logs
docker logs otel-collector
```

#### High Cardinality
```promql
# Find high cardinality metrics
count by (__name__)({__name__=~".+"})

# Identify problematic labels
count by (label_name) (metric_name)
```

#### Performance Issues
```bash
# Profile Prometheus
curl http://localhost:9090/debug/pprof/profile?seconds=30 > profile.pb.gz

# Check TSDB stats
curl http://localhost:9090/api/v1/tsdb_status
```

## Best Practices Summary

1. **Use standard metrics libraries** - OpenTelemetry SDK preferred
2. **Keep cardinality low** - < 100k unique series
3. **Document all metrics** - Include unit and meaning
4. **Version metric names** - Include v1, v2 when breaking changes
5. **Test alerts locally** - Use Prometheus unit tests
6. **Monitor the monitoring** - Meta-metrics for observability stack
7. **Regular cleanup** - Remove unused metrics and dashboards

## References
- [Prometheus Best Practices]https://prometheus.io/docs/practices/
- [OpenTelemetry Collector Docs]https://opentelemetry.io/docs/collector/
- [Grafana Dashboard Best Practices]https://grafana.com/docs/grafana/latest/best-practices/
- [ADR-0001: Observability-First Architecture]../adr/0001-observability-first-architecture.md