caxton 0.1.4

A secure WebAssembly runtime for multi-agent systems
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
# Clustering and Distributed Operations

This guide covers running Caxton in a distributed cluster configuration for high availability and scalability.

## Overview

Caxton uses a **coordination-first architecture** that requires no external dependencies like databases or message queues. Each Caxton instance:

- Maintains its own local state using embedded SQLite
- Coordinates with other instances via the SWIM gossip protocol
- Automatically discovers and routes messages to agents across the cluster
- Handles network partitions gracefully with degraded mode operation

For architectural details, see:
- [ADR-0014: Coordination-First Architecture]../adr/0014-coordination-first-architecture.md
- [ADR-0015: Distributed Protocol Architecture]../adr/0015-distributed-protocol-architecture.md

## Starting a Cluster

### Bootstrap First Node

The first node acts as the seed for cluster formation:

```bash
# Start the seed node
caxton server start \
  --node-id node-1 \
  --bind-addr 0.0.0.0:7946 \
  --api-addr 0.0.0.0:8080 \
  --bootstrap
```

### Join Additional Nodes

Other nodes join by connecting to the seed:

```bash
# On node 2
caxton server start \
  --node-id node-2 \
  --bind-addr 0.0.0.0:7946 \
  --api-addr 0.0.0.0:8080 \
  --join node-1.example.com:7946

# On node 3
caxton server start \
  --node-id node-3 \
  --bind-addr 0.0.0.0:7946 \
  --api-addr 0.0.0.0:8080 \
  --join node-1.example.com:7946,node-2.example.com:7946
```

### Verify Cluster Status

```bash
# Check cluster membership
caxton cluster members

# Example output:
NODE-ID    STATUS    ADDRESS           AGENTS    CPU    MEMORY
node-1     alive     10.0.1.10:7946    42        15%    2.1GB
node-2     alive     10.0.1.11:7946    38        12%    1.8GB
node-3     alive     10.0.1.12:7946    40        18%    2.3GB
```

## Configuration

### Cluster Configuration File

Create `/etc/caxton/cluster.yaml`:

```yaml
coordination:
  cluster:
    # SWIM protocol settings
    bind_addr: 0.0.0.0:7946
    advertise_addr: ${HOSTNAME}:7946

    # Seed nodes for joining
    seeds:
      - caxton-1.example.com:7946
      - caxton-2.example.com:7946
      - caxton-3.example.com:7946

    # Gossip parameters
    gossip_interval: 200ms
    gossip_fanout: 3
    probe_interval: 1s
    probe_timeout: 500ms

  # Partition handling
  partition:
    detection_timeout: 5s
    quorum_size: 2
    degraded_mode: true
    queue_writes: true
```

### Security Configuration

Enable mTLS for secure inter-node communication:

```yaml
security:
  cluster:
    mtls:
      enabled: true
      ca_cert: /etc/caxton/ca.crt
      node_cert: /etc/caxton/certs/node.crt
      node_key: /etc/caxton/certs/node.key
      verify_peer: true
```

See [ADR-0016: Security Architecture](../adr/0016-security-architecture.md) for details.

## Agent Distribution

Agents are automatically distributed across the cluster:

```bash
# Deploy an agent (automatically placed on optimal node)
caxton deploy agent.wasm --name my-agent

# Deploy with placement preferences
caxton deploy agent.wasm \
  --name my-agent \
  --placement-strategy least-loaded \
  --prefer-nodes node-1,node-2

# Force deployment to specific node
caxton deploy agent.wasm \
  --name my-agent \
  --target-node node-3
```

### Agent Discovery

Agents can communicate regardless of which node they're on:

```bash
# Send message to agent (routing handled automatically)
caxton message send \
  --to remote-agent \
  --content "Hello from anywhere in the cluster!"

# The cluster automatically:
# 1. Discovers which node hosts 'remote-agent'
# 2. Routes the message through the cluster
# 3. Delivers to the target agent
```

## High Availability

### Automatic Failover

When a node fails, its agents are automatically redistributed:

```bash
# Monitor failover behavior
caxton cluster watch

# Example during node failure:
[INFO] Node node-2 detected as failed
[INFO] Redistributing 38 agents from node-2
[INFO] Agent 'processor-1' migrated to node-1
[INFO] Agent 'worker-5' migrated to node-3
[INFO] All agents successfully redistributed (2.3s)
```

### Network Partition Handling

Caxton handles network partitions gracefully:

#### Majority Partition
Nodes in the majority partition continue normal operations:

```bash
# On majority side (2 of 3 nodes)
caxton cluster status
# Status: HEALTHY (majority partition)
# Operations: READ-WRITE
# Nodes: 2/3 active
```

#### Minority Partition
Nodes in the minority enter degraded mode:

```bash
# On minority side (1 of 3 nodes)
caxton cluster status
# Status: DEGRADED (minority partition)
# Operations: READ-ONLY
# Nodes: 1/3 active
# Queued writes: 42
```

When the partition heals, queued operations are replayed automatically.

## Monitoring

### Cluster Metrics

Key metrics to monitor:

```bash
# Cluster health metrics
curl http://localhost:9090/metrics | grep caxton_cluster

# Key metrics:
caxton_cluster_nodes_total          3
caxton_cluster_nodes_alive          3
caxton_cluster_agents_total         120
caxton_cluster_gossip_latency_ms    0.8
caxton_cluster_convergence_time_ms  423
```

### Performance Monitoring

Monitor cluster performance against targets:

```bash
# Check performance against requirements
caxton cluster performance

# Output:
METRIC                    TARGET      ACTUAL    STATUS
Message routing P50       100μs       87μs      ✓
Message routing P99       1ms         0.9ms     ✓
Agent startup P50         10ms        8.2ms     ✓
Gossip convergence        <5s         2.1s      ✓
```

See [ADR-0017: Performance Requirements](../adr/0017-performance-requirements.md) for targets.

## Operations

### Rolling Upgrades

Perform zero-downtime upgrades:

```bash
# Start upgrade process
caxton cluster upgrade --version v1.2.0

# The cluster will:
# 1. Select a canary node
# 2. Drain traffic from canary
# 3. Upgrade canary node
# 4. Monitor for 24 hours
# 5. Roll out to remaining nodes
```

See [ADR-0018: Operational Procedures](../adr/0018-operational-procedures.md) for details.

### Backup and Recovery

Each node maintains its own state, but cluster-wide backups are coordinated:

```bash
# Create cluster-wide backup
caxton cluster backup --dest s3://backups/caxton/

# Restore from backup
caxton cluster restore --from s3://backups/caxton/2024-01-15/
```

### Scaling

#### Adding Nodes

```bash
# Add new node to running cluster
caxton server start \
  --node-id node-4 \
  --join <any-existing-node>:7946

# Agents automatically rebalance
caxton cluster rebalance --strategy even-distribution
```

#### Removing Nodes

```bash
# Gracefully remove a node
caxton cluster leave --node node-2 --drain-timeout 60s

# Force remove failed node
caxton cluster remove --node node-2 --force
```

## Troubleshooting

### Common Issues

#### Nodes Not Joining

```bash
# Check network connectivity
caxton cluster ping node-2

# Verify gossip encryption keys match
caxton cluster verify-auth

# Check firewall rules (port 7946 must be open)
```

#### Split Brain Detection

```bash
# Check for split brain
caxton cluster detect-partition

# If split brain detected:
WARNING: Potential split brain detected
Partition 1: [node-1, node-2] (majority)
Partition 2: [node-3] (minority)
Action: Node-3 entering degraded mode
```

#### Performance Issues

```bash
# Analyze cluster performance
caxton cluster analyze

# Suggestions:
- High gossip latency: Reduce gossip_fanout
- Slow convergence: Decrease gossip_interval
- Message delays: Check network latency between nodes
```

## Best Practices

1. **Odd Number of Nodes**: Deploy 3, 5, or 7 nodes to avoid split-brain
2. **Geographic Distribution**: Spread nodes across availability zones
3. **Resource Monitoring**: Monitor CPU, memory, and network usage
4. **Regular Backups**: Schedule automated backups
5. **Security**: Always enable mTLS in production
6. **Capacity Planning**: Plan for 2x peak load for headroom

## Advanced Topics

### Multi-Region Deployment

For global deployments:

```yaml
coordination:
  cluster:
    regions:
      - name: us-east
        nodes: [node-1, node-2, node-3]
      - name: eu-west
        nodes: [node-4, node-5, node-6]

    # Cross-region settings
    cross_region:
      latency_aware_routing: true
      prefer_local_region: true
      max_cross_region_latency: 100ms
```

### Custom Partition Strategies

Implement custom partition handling:

```yaml
partition:
  strategy: custom
  custom_handler: /usr/local/bin/partition-handler
  decisions:
    - condition: "nodes < quorum"
      action: "read-only"
    - condition: "nodes == 1"
      action: "local-only"
    - condition: "critical_agents_present"
      action: "continue-critical"
```

## Performance Tuning

### SWIM Protocol Tuning

```yaml
# For small clusters (< 10 nodes)
gossip_interval: 100ms
gossip_fanout: 3

# For medium clusters (10-50 nodes)
gossip_interval: 200ms
gossip_fanout: 4

# For large clusters (> 50 nodes)
gossip_interval: 500ms
gossip_fanout: 5
```

### Network Optimization

```yaml
# Use QUIC for better performance
transport:
  type: quic
  congestion_control: bbr
  max_streams: 100
```

## Next Steps

- [Production Deployment Guide]../operations/production-deployment.md
- [Security Best Practices]../operations/devops-security-guide.md
- [Performance Benchmarking]../benchmarks/performance-benchmarking-guide.md
- [Monitoring Integration]../monitoring/metrics-integration-guide.md