# Phase 2: Persistence and Production Hardening — Task Decomposition
> **Source**: PRD Section 11.2 — Persistence and Production Hardening
> **Related PRD Sections**: 3.2, 3.3, 6.3, 6.4, 7.1-7.3
> **Reference**: `docs/phase1_task_decomposition.md`
> **Created**: 2026-03-30
> **Database**: `docs/agent_context.db`
---
## 1. Overview
### 1.1 Phase 2 Goals (from PRD)
Deliver production-ready hardening on top of the Phase 1 harness:
- **Durable state** with **SQLite-backed structured persistence**
- **Pluggable memory backends** for local vector retrieval and future remote stores
- **Checkpointing and resume** for interrupted sessions
- **Comprehensive guardrails** for prompt injection, PII, policy enforcement, and runaway execution
- **Operational observability** with metrics export, dashboards, alerts, and replayable traces
### 1.2 Success Criteria (PRD 11.2)
- [ ] A completed run can be **persisted, resumed, and audited**
- [ ] The default durable backend works with **SQLite only**, with no external service required
- [ ] All safety limits are **configurable, observable, and enforced before failure cascades**
- [ ] Tool/provider failures degrade through **timeouts, quotas, and circuit breakers**
- [ ] Metrics are exportable to **Prometheus/OpenTelemetry** and have **Grafana-ready dashboards**
- [ ] Incident response has **documented alert rules and runbooks**
---
## 2. Current State Assessment
### 2.1 What Exists
| In-memory memory trait | `src/memory.rs` | ✅ Functional |
| SlidingWindowMemory | `src/memory.rs` | ✅ In-memory only |
| Event model and subscribers | `src/event.rs` | ✅ Functional |
| Prompt injection and PII helpers | `src/guardrails.rs` | ✅ Basic pattern matching |
| Max loop iterations | `src/types.rs`, `src/core.rs` | ✅ Basic limit support |
| Typed request/message/tool validation | `src/types.rs`, `src/tool.rs`, `src/validation.rs` | ✅ Functional |
| Local example app | `rswarm_examples/` | ✅ Builds against local crate |
### 2.2 Gaps to Address
| Durable structured storage | 3.2.2, 11.2.1 | Critical |
| Checkpointing and session resume | 3.3.1, 3.3.2 | Critical |
| Pluggable vector retrieval backend | 3.2.1, 11.2.1 | High |
| Event-driven memory hooks | 3.2.3 | High |
| Configurable runtime budgets | 7.3.1-7.3.3, 11.2.2 | Critical |
| Tool/provider circuit breakers | 7.3.3, 11.2.2 | High |
| Policy-grade guardrails and audit hooks | 7.1, 7.2, 11.2.2 | High |
| Prometheus/OpenTelemetry export hardening | 6.3, 11.2.3 | High |
| Dashboards, alert rules, and runbooks | 6.4, 11.2.3 | Medium |
---
## 3. Task Breakdown
### 3.1 Durable Storage and Checkpointing
| #28 | Define persistence backend traits for sessions, events, checkpoints, and memories | critical | none | `src/persistence.rs` (new) |
| #29 | Create SQLite schema and migration runner | critical | #28 | `src/persistence/sqlite.rs` (new), `migrations/` (new) |
| #30 | Implement SqliteSessionStore for runs, messages, tool calls, and summaries | high | #28, #29 | `src/persistence/sqlite.rs` |
| #31 | Define versioned CheckpointData and serialization envelope | critical | #28 | `src/checkpoint.rs` (new) |
| #32 | Persist checkpoints from loop boundaries | critical | #31 | `src/core.rs`, `src/phase.rs` |
| #33 | Add resume API with checkpoint compatibility validation | critical | #31, #32 | `src/core.rs`, `src/checkpoint.rs` |
| #34 | Add retention, pruning, and archival policy for persisted sessions | medium | #29, #30 | `src/persistence/sqlite.rs`, `src/config.rs` (new or extend) |
#### Task #28: Define persistence backend traits
**Depends on**: none
**Acceptance criteria**:
- Define traits for session metadata, event append/read, checkpoint save/load, and memory persistence
- Keep the default backend single-process and async-friendly
- Backend interfaces are storage-agnostic so SQLite is an implementation, not a hardcoded dependency
**Estimate**: medium (45 min)
---
#### Task #29: Create SQLite schema and migration runner
**Depends on**: #28
**Acceptance criteria**:
- Create SQLite DDL for sessions, messages, events, checkpoints, memories, and migrations
- Migration runner applies idempotent versioned migrations
- SQLite is the default embedded persistence option with zero required external services
**Estimate**: medium (60 min)
---
#### Task #30: Implement SqliteSessionStore
**Depends on**: #28, #29
**Acceptance criteria**:
- Store and query session metadata, message history, tool call history, and loop outcomes
- Support reads by session id, date range, and trace id
- Add indexes for common filters used by replay and debugging workflows
**Estimate**: medium (60 min)
---
#### Task #31: Define versioned CheckpointData
**Depends on**: #28
**Acceptance criteria**:
- Versioned checkpoint envelope includes messages, context variables, current agent, loop counters, and pending work
- Serialization/deserialization is explicit and migration-ready
- Checkpoint format is suitable for persistence and resume validation
```rust
pub struct CheckpointEnvelope {
pub version: u32,
pub session_id: String,
pub created_at: DateTime<Utc>,
pub payload: CheckpointData,
}
```
**Estimate**: medium (45 min)
---
#### Task #32: Persist checkpoints from loop boundaries
**Depends on**: #31
**Acceptance criteria**:
- Checkpoints can be emitted per iteration, per N iterations, or on explicit boundaries
- Checkpointing hooks are tied to phase or loop transitions, not ad hoc call sites
- Failures to checkpoint are observable and classified without corrupting in-memory state
**Estimate**: medium (60 min)
---
#### Task #33: Add resume API with compatibility validation
**Depends on**: #31, #32
**Acceptance criteria**:
- `Swarm` can resume a session from a persisted checkpoint
- Resume validates checkpoint version, agent availability, and required state
- Incompatible or corrupt checkpoints fail with structured errors and graceful fallback
**Estimate**: medium (60 min)
---
#### Task #34: Add retention, pruning, and archival policy
**Depends on**: #29, #30
**Acceptance criteria**:
- Configurable retention policy by age, status, or storage budget
- Archive path keeps replay/debugging data while pruning hot tables
- Maintenance operations are documented and safe to run repeatedly
**Estimate**: small (30 min)
---
### 3.2 Vector Memory and Event-Driven Updates
| #35 | Define VectorMemory trait and retrieval policy types | high | #28 | `src/memory/vector.rs` (new) |
| #36 | Implement sqlite-vss-backed vector memory adapter | high | #35, #29 | `src/memory/sqlite_vss.rs` (new) |
| #37 | Add optional Qdrant adapter behind feature flag | medium | #35 | `src/memory/qdrant.rs` (new), `Cargo.toml` |
| #38 | Emit event-driven memory hooks for summaries, tool results, and explicit writes | high | #28, #30, #35 | `src/core.rs`, `src/event.rs`, `src/memory/` |
#### Task #35: Define VectorMemory trait
**Depends on**: #28
**Acceptance criteria**:
- Abstract embedding-backed store/retrieve/search operations
- Retrieval policy captures top-k, score threshold, and recency weighting
- Trait supports embedded and remote implementations without changing caller code
**Estimate**: medium (45 min)
---
#### Task #36: Implement sqlite-vss adapter
**Depends on**: #35, #29
**Acceptance criteria**:
- Embedded vector retrieval works with SQLite + sqlite-vss or equivalent local extension
- Memory insert and semantic search APIs are tested
- Fallback behavior is clear when vector extension support is unavailable
**Estimate**: medium (75 min)
---
#### Task #37: Add Qdrant adapter behind feature flag
**Depends on**: #35
**Acceptance criteria**:
- `qdrant` feature flag gates remote vector dependency
- Adapter implements the same retrieval interface as the local backend
- Configuration supports host, collection, auth, and timeout settings
**Estimate**: medium (60 min)
---
#### Task #38: Emit event-driven memory hooks
**Depends on**: #28, #30, #35
**Acceptance criteria**:
- Memory persistence is triggered from lifecycle events, not polling loops
- Conversation completion, tool results, and explicit memory writes flow through a shared hook surface
- Hook behavior is observable through structured events
**Estimate**: medium (60 min)
---
### 3.3 Guardrails, Limits, and Policy Enforcement
| #39 | Extend runtime config with token, time, depth, and quota limits | critical | none | `src/types.rs`, `src/core.rs` |
| #40 | Implement cumulative budget enforcer for iterations, tokens, and wall-clock time | critical | #39 | `src/core.rs`, `src/validation.rs` |
| #41 | Add tool and provider circuit breaker state machines | high | #39 | `src/tool.rs`, `src/provider.rs`, `src/core.rs` |
| #42 | Make prompt injection handling policy-driven and auditable | high | none | `src/guardrails.rs`, `src/event.rs` |
| #43 | Add data classification tags and redaction pipeline for logs/events | high | #42 | `src/guardrails.rs`, `src/event.rs` |
| #44 | Add content policy hook and violation audit path | medium | #42 | `src/guardrails.rs`, `src/core.rs` |
| #45 | Add output verification for tool calls and structured LLM responses | high | #41 | `src/validation.rs`, `src/provider.rs`, `src/tool.rs` |
| #46 | Add heuristic escalation triggers for hallucinations and repeated failures | medium | #40, #45 | `src/core.rs`, `src/guardrails.rs` |
#### Task #39: Extend runtime config with limits
**Depends on**: none
**Acceptance criteria**:
- `SwarmConfig` supports per-task, per-session, and per-agent limits
- Limits cover iterations, recursion/depth, token budgets, tool call quotas, and wall-clock time
- Defaults are safe and overridable
**Estimate**: medium (45 min)
---
#### Task #40: Implement cumulative budget enforcer
**Depends on**: #39
**Acceptance criteria**:
- Enforcement occurs before runaway execution, not after the fact
- Exhaustion produces structured errors and partial-response behavior where appropriate
- Token, iteration, and elapsed-time counters are exported as observable metrics
**Estimate**: medium (60 min)
---
#### Task #41: Add tool and provider circuit breakers
**Depends on**: #39
**Acceptance criteria**:
- Consecutive failure thresholds open a breaker for tools and providers
- Breakers support closed, open, and half-open states with reset policy
- Breaker state changes emit events and are included in debug output
```rust
pub enum CircuitState {
Closed,
Open { opened_at: Instant },
HalfOpen,
}
```
**Estimate**: medium (60 min)
---
#### Task #42: Make prompt injection handling policy-driven
**Depends on**: none
**Acceptance criteria**:
- Support explicit actions: `warn`, `sanitize`, `reject`
- Policy is configurable at runtime
- All detections are auditable through structured events and persisted traces
**Estimate**: small (30 min)
---
#### Task #43: Add data classification and redaction pipeline
**Depends on**: #42
**Acceptance criteria**:
- Sensitive fields can be tagged or inferred before writing events/logs
- Redaction applies consistently to logs, persisted events, and replay output
- Policies distinguish redaction, masking, and drop-on-write behaviors
**Estimate**: medium (45 min)
---
#### Task #44: Add content policy hook
**Depends on**: #42
**Acceptance criteria**:
- Hook surface exists for request/response policy checks
- Violations produce structured audit events
- Default implementation is simple and overridable
**Estimate**: small (30 min)
---
#### Task #45: Add output verification
**Depends on**: #41
**Acceptance criteria**:
- Structured outputs and tool call parameters are validated before execution
- Validation failures feed back through retry/escalation paths instead of panicking
- Unknown tools, wrong parameter types, and malformed structured responses are classified cleanly
**Estimate**: medium (60 min)
---
#### Task #46: Add heuristic escalation triggers
**Depends on**: #40, #45
**Acceptance criteria**:
- Detect repeated failures, non-existent tool calls, and circular reasoning patterns
- Escalation triggers are observable and configurable
- Escalation path can stop execution, request human review, or inject a continuation warning
**Estimate**: medium (45 min)
---
### 3.4 Observability, Metrics, and Operations
| #47 | Harden OpenTelemetry export and trace propagation | high | #30, #41 | `src/observability.rs` (new), `src/event.rs` |
| #48 | Add Prometheus metrics endpoint and registry wiring | high | #47 | `src/observability.rs`, `src/bin/metrics.rs` (new or feature-gated) |
| #49 | Create Grafana dashboards, alert rules, and runbooks | medium | #47, #48 | `deploy/grafana/` (new), `deploy/prometheus/` (new), `docs/runbooks/` (new) |
#### Task #47: Harden OpenTelemetry export
**Depends on**: #30, #41
**Acceptance criteria**:
- Traces, logs, and metrics can be exported with environment-driven configuration
- Span context propagates through tool calls and provider requests
- Correlation IDs remain available for future multi-agent scenarios
**Estimate**: medium (60 min)
---
#### Task #48: Add Prometheus metrics endpoint
**Depends on**: #47
**Acceptance criteria**:
- Counters/histograms/gauges cover latency, token usage, tool outcomes, breaker states, and guardrail events
- Metrics endpoint or exporter can be enabled without changing core business logic
- Naming and labels are stable enough for dashboards and alert rules
**Estimate**: medium (60 min)
---
#### Task #49: Create dashboards, alerts, and runbooks
**Depends on**: #47, #48
**Acceptance criteria**:
- Grafana dashboard JSON or provisioning files exist for core agent metrics
- Alerting rules cover latency spikes, error-rate jumps, breaker openings, and budget exhaustion
- Runbooks document common failure modes and first-response actions
**Estimate**: medium (75 min)
---
## 4. Execution Order
### Phase 2A — Persistence Foundations
```
┌─────────────────────────────────────────────────────────────┐
│ MAX PARALLELISM: 4 tasks │
│ Establish durable interfaces before concrete backends │
└─────────────────────────────────────────────────────────────┘
#28 Define persistence backend traits [critical] 45m
#31 Define versioned CheckpointData [critical] 45m
#35 Define VectorMemory trait [high] 45m
#39 Extend runtime config with limits [critical] 45m
```
### Phase 2B — Embedded Default Backend
```
┌─────────────────────────────────────────────────────────────┐
│ MAX PARALLELISM: 5 tasks │
│ Build the SQLite-first default production path │
└─────────────────────────────────────────────────────────────┘
#29 Create SQLite schema and migrations [critical] 60m → needs #28
#30 Implement SqliteSessionStore [high] 60m → needs #28,#29
#32 Persist checkpoints from loop edges [critical] 60m → needs #31
#33 Add resume API and validation [critical] 60m → needs #31,#32
#38 Emit event-driven memory hooks [high] 60m → needs #28,#30,#35
```
### Phase 2C — Guardrails and Enforcement
```
┌─────────────────────────────────────────────────────────────┐
│ MAX PARALLELISM: 6 tasks │
│ Turn the harness into a bounded, auditable runtime │
└─────────────────────────────────────────────────────────────┘
#40 Implement cumulative budget enforcer [critical] 60m → needs #39
#41 Add tool/provider circuit breakers [high] 60m → needs #39
#42 Prompt injection policy actions [high] 30m → independent
#43 Data classification and redaction [high] 45m → needs #42
#44 Add content policy hook [medium] 30m → needs #42
#45 Add output verification [high] 60m → needs #41
```
### Phase 2D — Retrieval and Telemetry Expansion
```
┌─────────────────────────────────────────────────────────────┐
│ MAX PARALLELISM: 5 tasks │
│ Add retrieval backends and production observability │
└─────────────────────────────────────────────────────────────┘
#36 Implement sqlite-vss adapter [high] 75m → needs #35,#29
#37 Add Qdrant adapter [medium] 60m → needs #35
#47 Harden OpenTelemetry export [high] 60m → needs #30,#41
#48 Add Prometheus metrics endpoint [high] 60m → needs #47
#34 Add retention and archival policy [medium] 30m → needs #29,#30
```
### Phase 2E — Operations and Escalation
```
┌─────────────────────────────────────────────────────────────┐
│ MAX PARALLELISM: 3 tasks │
│ Finish operational workflows and incident readiness │
└─────────────────────────────────────────────────────────────┘
#46 Add heuristic escalation triggers [medium] 45m → needs #40,#45
#49 Create dashboards, alerts, runbooks [medium] 75m → needs #47,#48
```
---
## 5. Tracking Progress
### 5.1 Database Commands
```bash
# View Phase 2 tasks (assuming numbering continues from Phase 1)
sqlite3 docs/agent_context.db "SELECT id, title, priority, status FROM items WHERE id BETWEEN 28 AND 49 ORDER BY id"
# Mark a task in progress
sqlite3 docs/agent_context.db "UPDATE items SET status = 'in_progress', updated_at = datetime('now') WHERE id = 28"
# Mark complete
sqlite3 docs/agent_context.db "UPDATE items SET status = 'complete', updated_at = datetime('now') WHERE id = 28"
# Log progress
sqlite3 docs/agent_context.db "INSERT INTO entries (session_id, entry_type, content) VALUES (2, 'progress', 'Completed persistence backend trait definitions')"
```
### 5.2 Status Values
- `pending` — Not started
- `in_progress` — Currently being worked on
- `complete` — Finished and verified
- `blocked` — Waiting on dependency
- `deferred` — Postponed to later phase
---
## 6. File Structure (Target)
```text
src/
├── core.rs
├── event.rs
├── guardrails.rs
├── lib.rs
├── memory.rs
├── phase.rs
├── tool.rs
├── validation.rs
│
├── checkpoint.rs # NEW: checkpoint envelope + resume validation
├── persistence.rs # NEW: persistence traits
├── persistence/
│ └── sqlite.rs # NEW: SQLite-backed session/event/checkpoint store
├── memory/
│ ├── vector.rs # NEW: VectorMemory trait + retrieval policy
│ ├── sqlite_vss.rs # NEW: embedded vector backend
│ └── qdrant.rs # NEW: optional remote vector backend
├── observability.rs # NEW: OTEL + Prometheus wiring
│
└── tests/ # extend with persistence, resume, guardrail, and ops coverage
migrations/ # NEW: SQLite DDL and migration metadata
deploy/
├── grafana/ # NEW: dashboards/provisioning
└── prometheus/ # NEW: alert rules / scrape config examples
docs/
└── runbooks/ # NEW: incident response docs
```
---
## 7. Notes
### 7.1 Embedded-First Discipline
Phase 2 should keep the **embedded SQLite path as the default production story**:
- No mandatory external database
- No mandatory vector service
- Remote stores are optional adapters behind feature flags
### 7.2 Backward Compatibility
Phase 2 should preserve the simple Phase 1 developer experience:
- Existing in-memory workflows remain valid
- Durable persistence is opt-in through config
- Guardrails default to safe behavior but remain configurable
- Observability exporters are feature/config driven rather than hard-wired
### 7.3 Testing Expectations
Phase 2 needs broader verification than Phase 1:
- Migration tests for schema upgrades and rollback safety
- Resume tests for valid and incompatible checkpoints
- Guardrail tests for reject/sanitize/audit paths
- Circuit breaker tests for closed/open/half-open transitions
- Metrics export tests for stable names and labels
---
## 8. Changelog
| 2026-03-30 | Initial decomposition for Phase 2 created with tasks #28-#49 |