rswarm 0.1.8

A Rust implementation of the Swarm framework
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
# Phase 2: Persistence and Production Hardening — Task Decomposition

> **Source**: PRD Section 11.2 — Persistence and Production Hardening  
> **Related PRD Sections**: 3.2, 3.3, 6.3, 6.4, 7.1-7.3  
> **Reference**: `docs/phase1_task_decomposition.md`  
> **Created**: 2026-03-30  
> **Database**: `docs/agent_context.db`

---

## 1. Overview

### 1.1 Phase 2 Goals (from PRD)

Deliver production-ready hardening on top of the Phase 1 harness:
- **Durable state** with **SQLite-backed structured persistence**
- **Pluggable memory backends** for local vector retrieval and future remote stores
- **Checkpointing and resume** for interrupted sessions
- **Comprehensive guardrails** for prompt injection, PII, policy enforcement, and runaway execution
- **Operational observability** with metrics export, dashboards, alerts, and replayable traces

### 1.2 Success Criteria (PRD 11.2)

- [ ] A completed run can be **persisted, resumed, and audited**
- [ ] The default durable backend works with **SQLite only**, with no external service required
- [ ] All safety limits are **configurable, observable, and enforced before failure cascades**
- [ ] Tool/provider failures degrade through **timeouts, quotas, and circuit breakers**
- [ ] Metrics are exportable to **Prometheus/OpenTelemetry** and have **Grafana-ready dashboards**
- [ ] Incident response has **documented alert rules and runbooks**

---

## 2. Current State Assessment

### 2.1 What Exists

| Component | Location | Status |
|-----------|----------|--------|
| In-memory memory trait | `src/memory.rs` | ✅ Functional |
| SlidingWindowMemory | `src/memory.rs` | ✅ In-memory only |
| Event model and subscribers | `src/event.rs` | ✅ Functional |
| Prompt injection and PII helpers | `src/guardrails.rs` | ✅ Basic pattern matching |
| Max loop iterations | `src/types.rs`, `src/core.rs` | ✅ Basic limit support |
| Typed request/message/tool validation | `src/types.rs`, `src/tool.rs`, `src/validation.rs` | ✅ Functional |
| Local example app | `rswarm_examples/` | ✅ Builds against local crate |

### 2.2 Gaps to Address

| Gap | PRD Reference | Priority |
|-----|---------------|----------|
| Durable structured storage | 3.2.2, 11.2.1 | Critical |
| Checkpointing and session resume | 3.3.1, 3.3.2 | Critical |
| Pluggable vector retrieval backend | 3.2.1, 11.2.1 | High |
| Event-driven memory hooks | 3.2.3 | High |
| Configurable runtime budgets | 7.3.1-7.3.3, 11.2.2 | Critical |
| Tool/provider circuit breakers | 7.3.3, 11.2.2 | High |
| Policy-grade guardrails and audit hooks | 7.1, 7.2, 11.2.2 | High |
| Prometheus/OpenTelemetry export hardening | 6.3, 11.2.3 | High |
| Dashboards, alert rules, and runbooks | 6.4, 11.2.3 | Medium |

---

## 3. Task Breakdown

### 3.1 Durable Storage and Checkpointing

| ID | Title | Priority | Dependencies | Location |
|----|-------|----------|--------------|----------|
| #28 | Define persistence backend traits for sessions, events, checkpoints, and memories | critical | none | `src/persistence.rs` (new) |
| #29 | Create SQLite schema and migration runner | critical | #28 | `src/persistence/sqlite.rs` (new), `migrations/` (new) |
| #30 | Implement SqliteSessionStore for runs, messages, tool calls, and summaries | high | #28, #29 | `src/persistence/sqlite.rs` |
| #31 | Define versioned CheckpointData and serialization envelope | critical | #28 | `src/checkpoint.rs` (new) |
| #32 | Persist checkpoints from loop boundaries | critical | #31 | `src/core.rs`, `src/phase.rs` |
| #33 | Add resume API with checkpoint compatibility validation | critical | #31, #32 | `src/core.rs`, `src/checkpoint.rs` |
| #34 | Add retention, pruning, and archival policy for persisted sessions | medium | #29, #30 | `src/persistence/sqlite.rs`, `src/config.rs` (new or extend) |

#### Task #28: Define persistence backend traits

**Depends on**: none  
**Acceptance criteria**:
- Define traits for session metadata, event append/read, checkpoint save/load, and memory persistence
- Keep the default backend single-process and async-friendly
- Backend interfaces are storage-agnostic so SQLite is an implementation, not a hardcoded dependency

**Estimate**: medium (45 min)

---

#### Task #29: Create SQLite schema and migration runner

**Depends on**: #28  
**Acceptance criteria**:
- Create SQLite DDL for sessions, messages, events, checkpoints, memories, and migrations
- Migration runner applies idempotent versioned migrations
- SQLite is the default embedded persistence option with zero required external services

**Estimate**: medium (60 min)

---

#### Task #30: Implement SqliteSessionStore

**Depends on**: #28, #29  
**Acceptance criteria**:
- Store and query session metadata, message history, tool call history, and loop outcomes
- Support reads by session id, date range, and trace id
- Add indexes for common filters used by replay and debugging workflows

**Estimate**: medium (60 min)

---

#### Task #31: Define versioned CheckpointData

**Depends on**: #28  
**Acceptance criteria**:
- Versioned checkpoint envelope includes messages, context variables, current agent, loop counters, and pending work
- Serialization/deserialization is explicit and migration-ready
- Checkpoint format is suitable for persistence and resume validation

```rust
pub struct CheckpointEnvelope {
    pub version: u32,
    pub session_id: String,
    pub created_at: DateTime<Utc>,
    pub payload: CheckpointData,
}
```

**Estimate**: medium (45 min)

---

#### Task #32: Persist checkpoints from loop boundaries

**Depends on**: #31  
**Acceptance criteria**:
- Checkpoints can be emitted per iteration, per N iterations, or on explicit boundaries
- Checkpointing hooks are tied to phase or loop transitions, not ad hoc call sites
- Failures to checkpoint are observable and classified without corrupting in-memory state

**Estimate**: medium (60 min)

---

#### Task #33: Add resume API with compatibility validation

**Depends on**: #31, #32  
**Acceptance criteria**:
- `Swarm` can resume a session from a persisted checkpoint
- Resume validates checkpoint version, agent availability, and required state
- Incompatible or corrupt checkpoints fail with structured errors and graceful fallback

**Estimate**: medium (60 min)

---

#### Task #34: Add retention, pruning, and archival policy

**Depends on**: #29, #30  
**Acceptance criteria**:
- Configurable retention policy by age, status, or storage budget
- Archive path keeps replay/debugging data while pruning hot tables
- Maintenance operations are documented and safe to run repeatedly

**Estimate**: small (30 min)

---

### 3.2 Vector Memory and Event-Driven Updates

| ID | Title | Priority | Dependencies | Location |
|----|-------|----------|--------------|----------|
| #35 | Define VectorMemory trait and retrieval policy types | high | #28 | `src/memory/vector.rs` (new) |
| #36 | Implement sqlite-vss-backed vector memory adapter | high | #35, #29 | `src/memory/sqlite_vss.rs` (new) |
| #37 | Add optional Qdrant adapter behind feature flag | medium | #35 | `src/memory/qdrant.rs` (new), `Cargo.toml` |
| #38 | Emit event-driven memory hooks for summaries, tool results, and explicit writes | high | #28, #30, #35 | `src/core.rs`, `src/event.rs`, `src/memory/` |

#### Task #35: Define VectorMemory trait

**Depends on**: #28  
**Acceptance criteria**:
- Abstract embedding-backed store/retrieve/search operations
- Retrieval policy captures top-k, score threshold, and recency weighting
- Trait supports embedded and remote implementations without changing caller code

**Estimate**: medium (45 min)

---

#### Task #36: Implement sqlite-vss adapter

**Depends on**: #35, #29  
**Acceptance criteria**:
- Embedded vector retrieval works with SQLite + sqlite-vss or equivalent local extension
- Memory insert and semantic search APIs are tested
- Fallback behavior is clear when vector extension support is unavailable

**Estimate**: medium (75 min)

---

#### Task #37: Add Qdrant adapter behind feature flag

**Depends on**: #35  
**Acceptance criteria**:
- `qdrant` feature flag gates remote vector dependency
- Adapter implements the same retrieval interface as the local backend
- Configuration supports host, collection, auth, and timeout settings

**Estimate**: medium (60 min)

---

#### Task #38: Emit event-driven memory hooks

**Depends on**: #28, #30, #35  
**Acceptance criteria**:
- Memory persistence is triggered from lifecycle events, not polling loops
- Conversation completion, tool results, and explicit memory writes flow through a shared hook surface
- Hook behavior is observable through structured events

**Estimate**: medium (60 min)

---

### 3.3 Guardrails, Limits, and Policy Enforcement

| ID | Title | Priority | Dependencies | Location |
|----|-------|----------|--------------|----------|
| #39 | Extend runtime config with token, time, depth, and quota limits | critical | none | `src/types.rs`, `src/core.rs` |
| #40 | Implement cumulative budget enforcer for iterations, tokens, and wall-clock time | critical | #39 | `src/core.rs`, `src/validation.rs` |
| #41 | Add tool and provider circuit breaker state machines | high | #39 | `src/tool.rs`, `src/provider.rs`, `src/core.rs` |
| #42 | Make prompt injection handling policy-driven and auditable | high | none | `src/guardrails.rs`, `src/event.rs` |
| #43 | Add data classification tags and redaction pipeline for logs/events | high | #42 | `src/guardrails.rs`, `src/event.rs` |
| #44 | Add content policy hook and violation audit path | medium | #42 | `src/guardrails.rs`, `src/core.rs` |
| #45 | Add output verification for tool calls and structured LLM responses | high | #41 | `src/validation.rs`, `src/provider.rs`, `src/tool.rs` |
| #46 | Add heuristic escalation triggers for hallucinations and repeated failures | medium | #40, #45 | `src/core.rs`, `src/guardrails.rs` |

#### Task #39: Extend runtime config with limits

**Depends on**: none  
**Acceptance criteria**:
- `SwarmConfig` supports per-task, per-session, and per-agent limits
- Limits cover iterations, recursion/depth, token budgets, tool call quotas, and wall-clock time
- Defaults are safe and overridable

**Estimate**: medium (45 min)

---

#### Task #40: Implement cumulative budget enforcer

**Depends on**: #39  
**Acceptance criteria**:
- Enforcement occurs before runaway execution, not after the fact
- Exhaustion produces structured errors and partial-response behavior where appropriate
- Token, iteration, and elapsed-time counters are exported as observable metrics

**Estimate**: medium (60 min)

---

#### Task #41: Add tool and provider circuit breakers

**Depends on**: #39  
**Acceptance criteria**:
- Consecutive failure thresholds open a breaker for tools and providers
- Breakers support closed, open, and half-open states with reset policy
- Breaker state changes emit events and are included in debug output

```rust
pub enum CircuitState {
    Closed,
    Open { opened_at: Instant },
    HalfOpen,
}
```

**Estimate**: medium (60 min)

---

#### Task #42: Make prompt injection handling policy-driven

**Depends on**: none  
**Acceptance criteria**:
- Support explicit actions: `warn`, `sanitize`, `reject`
- Policy is configurable at runtime
- All detections are auditable through structured events and persisted traces

**Estimate**: small (30 min)

---

#### Task #43: Add data classification and redaction pipeline

**Depends on**: #42  
**Acceptance criteria**:
- Sensitive fields can be tagged or inferred before writing events/logs
- Redaction applies consistently to logs, persisted events, and replay output
- Policies distinguish redaction, masking, and drop-on-write behaviors

**Estimate**: medium (45 min)

---

#### Task #44: Add content policy hook

**Depends on**: #42  
**Acceptance criteria**:
- Hook surface exists for request/response policy checks
- Violations produce structured audit events
- Default implementation is simple and overridable

**Estimate**: small (30 min)

---

#### Task #45: Add output verification

**Depends on**: #41  
**Acceptance criteria**:
- Structured outputs and tool call parameters are validated before execution
- Validation failures feed back through retry/escalation paths instead of panicking
- Unknown tools, wrong parameter types, and malformed structured responses are classified cleanly

**Estimate**: medium (60 min)

---

#### Task #46: Add heuristic escalation triggers

**Depends on**: #40, #45  
**Acceptance criteria**:
- Detect repeated failures, non-existent tool calls, and circular reasoning patterns
- Escalation triggers are observable and configurable
- Escalation path can stop execution, request human review, or inject a continuation warning

**Estimate**: medium (45 min)

---

### 3.4 Observability, Metrics, and Operations

| ID | Title | Priority | Dependencies | Location |
|----|-------|----------|--------------|----------|
| #47 | Harden OpenTelemetry export and trace propagation | high | #30, #41 | `src/observability.rs` (new), `src/event.rs` |
| #48 | Add Prometheus metrics endpoint and registry wiring | high | #47 | `src/observability.rs`, `src/bin/metrics.rs` (new or feature-gated) |
| #49 | Create Grafana dashboards, alert rules, and runbooks | medium | #47, #48 | `deploy/grafana/` (new), `deploy/prometheus/` (new), `docs/runbooks/` (new) |

#### Task #47: Harden OpenTelemetry export

**Depends on**: #30, #41  
**Acceptance criteria**:
- Traces, logs, and metrics can be exported with environment-driven configuration
- Span context propagates through tool calls and provider requests
- Correlation IDs remain available for future multi-agent scenarios

**Estimate**: medium (60 min)

---

#### Task #48: Add Prometheus metrics endpoint

**Depends on**: #47  
**Acceptance criteria**:
- Counters/histograms/gauges cover latency, token usage, tool outcomes, breaker states, and guardrail events
- Metrics endpoint or exporter can be enabled without changing core business logic
- Naming and labels are stable enough for dashboards and alert rules

**Estimate**: medium (60 min)

---

#### Task #49: Create dashboards, alerts, and runbooks

**Depends on**: #47, #48  
**Acceptance criteria**:
- Grafana dashboard JSON or provisioning files exist for core agent metrics
- Alerting rules cover latency spikes, error-rate jumps, breaker openings, and budget exhaustion
- Runbooks document common failure modes and first-response actions

**Estimate**: medium (75 min)

---

## 4. Execution Order

### Phase 2A — Persistence Foundations

```
┌─────────────────────────────────────────────────────────────┐
│  MAX PARALLELISM: 4 tasks                                   │
│  Establish durable interfaces before concrete backends      │
└─────────────────────────────────────────────────────────────┘

  #28 Define persistence backend traits       [critical]  45m
  #31 Define versioned CheckpointData         [critical]  45m
  #35 Define VectorMemory trait               [high]      45m
  #39 Extend runtime config with limits       [critical]  45m
```

### Phase 2B — Embedded Default Backend

```
┌─────────────────────────────────────────────────────────────┐
│  MAX PARALLELISM: 5 tasks                                   │
│  Build the SQLite-first default production path             │
└─────────────────────────────────────────────────────────────┘

  #29 Create SQLite schema and migrations     [critical]  60m  → needs #28
  #30 Implement SqliteSessionStore            [high]      60m  → needs #28,#29
  #32 Persist checkpoints from loop edges     [critical]  60m  → needs #31
  #33 Add resume API and validation           [critical]  60m  → needs #31,#32
  #38 Emit event-driven memory hooks          [high]      60m  → needs #28,#30,#35
```

### Phase 2C — Guardrails and Enforcement

```
┌─────────────────────────────────────────────────────────────┐
│  MAX PARALLELISM: 6 tasks                                   │
│  Turn the harness into a bounded, auditable runtime         │
└─────────────────────────────────────────────────────────────┘

  #40 Implement cumulative budget enforcer    [critical]  60m  → needs #39
  #41 Add tool/provider circuit breakers      [high]      60m  → needs #39
  #42 Prompt injection policy actions         [high]      30m  → independent
  #43 Data classification and redaction       [high]      45m  → needs #42
  #44 Add content policy hook                 [medium]    30m  → needs #42
  #45 Add output verification                 [high]      60m  → needs #41
```

### Phase 2D — Retrieval and Telemetry Expansion

```
┌─────────────────────────────────────────────────────────────┐
│  MAX PARALLELISM: 5 tasks                                   │
│  Add retrieval backends and production observability        │
└─────────────────────────────────────────────────────────────┘

  #36 Implement sqlite-vss adapter            [high]      75m  → needs #35,#29
  #37 Add Qdrant adapter                      [medium]    60m  → needs #35
  #47 Harden OpenTelemetry export             [high]      60m  → needs #30,#41
  #48 Add Prometheus metrics endpoint         [high]      60m  → needs #47
  #34 Add retention and archival policy       [medium]    30m  → needs #29,#30
```

### Phase 2E — Operations and Escalation

```
┌─────────────────────────────────────────────────────────────┐
│  MAX PARALLELISM: 3 tasks                                   │
│  Finish operational workflows and incident readiness        │
└─────────────────────────────────────────────────────────────┘

  #46 Add heuristic escalation triggers       [medium]    45m  → needs #40,#45
  #49 Create dashboards, alerts, runbooks     [medium]    75m  → needs #47,#48
```

---

## 5. Tracking Progress

### 5.1 Database Commands

```bash
# View Phase 2 tasks (assuming numbering continues from Phase 1)
sqlite3 docs/agent_context.db "SELECT id, title, priority, status FROM items WHERE id BETWEEN 28 AND 49 ORDER BY id"

# Mark a task in progress
sqlite3 docs/agent_context.db "UPDATE items SET status = 'in_progress', updated_at = datetime('now') WHERE id = 28"

# Mark complete
sqlite3 docs/agent_context.db "UPDATE items SET status = 'complete', updated_at = datetime('now') WHERE id = 28"

# Log progress
sqlite3 docs/agent_context.db "INSERT INTO entries (session_id, entry_type, content) VALUES (2, 'progress', 'Completed persistence backend trait definitions')"
```

### 5.2 Status Values

- `pending` — Not started
- `in_progress` — Currently being worked on
- `complete` — Finished and verified
- `blocked` — Waiting on dependency
- `deferred` — Postponed to later phase

---

## 6. File Structure (Target)

```text
src/
├── core.rs
├── event.rs
├── guardrails.rs
├── lib.rs
├── memory.rs
├── phase.rs
├── tool.rs
├── validation.rs
│
├── checkpoint.rs            # NEW: checkpoint envelope + resume validation
├── persistence.rs           # NEW: persistence traits
├── persistence/
│   └── sqlite.rs            # NEW: SQLite-backed session/event/checkpoint store
├── memory/
│   ├── vector.rs            # NEW: VectorMemory trait + retrieval policy
│   ├── sqlite_vss.rs        # NEW: embedded vector backend
│   └── qdrant.rs            # NEW: optional remote vector backend
├── observability.rs         # NEW: OTEL + Prometheus wiring
│
└── tests/                   # extend with persistence, resume, guardrail, and ops coverage

migrations/                  # NEW: SQLite DDL and migration metadata
deploy/
├── grafana/                 # NEW: dashboards/provisioning
└── prometheus/              # NEW: alert rules / scrape config examples
docs/
└── runbooks/                # NEW: incident response docs
```

---

## 7. Notes

### 7.1 Embedded-First Discipline

Phase 2 should keep the **embedded SQLite path as the default production story**:
- No mandatory external database
- No mandatory vector service
- Remote stores are optional adapters behind feature flags

### 7.2 Backward Compatibility

Phase 2 should preserve the simple Phase 1 developer experience:
- Existing in-memory workflows remain valid
- Durable persistence is opt-in through config
- Guardrails default to safe behavior but remain configurable
- Observability exporters are feature/config driven rather than hard-wired

### 7.3 Testing Expectations

Phase 2 needs broader verification than Phase 1:
- Migration tests for schema upgrades and rollback safety
- Resume tests for valid and incompatible checkpoints
- Guardrail tests for reject/sanitize/audit paths
- Circuit breaker tests for closed/open/half-open transitions
- Metrics export tests for stable names and labels

---

## 8. Changelog

| Date | Change |
|------|--------|
| 2026-03-30 | Initial decomposition for Phase 2 created with tasks #28-#49 |