task-graph-mcp 0.5.0

MCP server for agent task workflows with phases, prompts, gates, and multi-agent coordination
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
# Workflow Experiments System


> **Version:** 1.0
> **Date:** 2026-01-31
> **Status:** Design Proposal

## Motivation


We have five workflow configurations (solo, swarm, relay, hierarchical, push) that represent different multi-agent coordination patterns. We have opinions about when each works best, but no empirical data. This design describes how to use the existing task-graph infrastructure to run controlled experiments that measure real tradeoffs between coordination patterns.

### What We Want to Learn


1. **Push vs Pull coordination:** Does having a coordinator assign every task (push) add overhead that outweighs better load balancing? Or does self-selection (pull/swarm) cause enough contention and poor distribution to offset its simplicity?

2. **Hierarchical vs Flat:** Does a lead agent decomposing work and monitoring progress justify the cost of that lead's token budget and the serialization bottleneck it introduces?

3. **Specialist vs Generalist:** When specialist agents (relay: designer -> implementer -> reviewer -> tester) produce higher quality through handoff gates, does the sequential bottleneck and handoff overhead cost more in wall-clock time than generalists working in parallel?

4. **Granularity effects:** Do fine-grained atomic tasks (swarm-style) outperform coarse-grained tasks (solo-style) at different team sizes? Where is the crossover?

5. **Scaling behavior:** How does each workflow degrade as agent count increases? Which patterns have linear scaling and which hit coordination walls?

---

## Existing Infrastructure


The current codebase provides most of what an experiment system needs. Here is what exists and what role it plays.

### Templates (`src/db/template.rs`)


Templates are Snapshot-format JSON files that define reusable task structures. They support:

- **Identical starting conditions:** `instantiate_template()` creates fresh copies with new IDs, resetting all status, timestamps, and runtime fields. Two experiments can start from the exact same task graph.
- **ID remapping:** Every instantiation generates unique petname IDs via `remap_snapshot()`, so multiple experiment runs never collide.
- **Parent attachment:** Entry points can be attached to a parent task, allowing an experiment root task to own all instantiated work.
- **Extra tags:** `InstantiateOptions::with_extra_tags()` can stamp every task with an experiment identifier (e.g., `["exp-swarm-3agent-run1"]`).
- **Title prefix:** `with_title_prefix("Run-1")` disambiguates tasks from different runs in the same database.

Templates serve as the **control variable** -- the identical project shape that remains constant while we vary the workflow.

### Workflows (`src/config/workflows.rs`, `config/workflow-*.yaml`)


Workflow configs define the **independent variable** -- the coordination pattern under test. Each workflow specifies:

- **States and transitions:** What states exist and which transitions are allowed.
- **Transition prompts:** Instructions that shape agent behavior at each state change (the mechanism by which coordination patterns are enforced).
- **Roles:** Who can do what (`lead` vs `worker`, `designer` vs `implementer`).
- **Role prompts:** Behavioral instructions per role (claiming strategy, handoff protocol, failure handling).
- **Gates:** Required artifacts before transitions (the relay workflow requires a design spec gate before implementation can proceed).
- **Overlays:** Additive modifications via `apply_overlay()` for cross-cutting concerns (git workflow, troubleshooting) without modifying the base workflow.

Five workflow configs exist: `solo`, `swarm`, `relay`, `hierarchical`, `push`. The push workflow already includes an `experiment_metrics` section documenting what to capture.

### Metrics Tracking (`src/tools/tracking.rs`, `src/db/state_transitions.rs`)


The metrics infrastructure provides the **dependent variables** -- what we measure:

- **Automatic time tracking:** `record_state_transition()` closes the previous transition with an `end_timestamp` and accumulates `time_actual_ms` for timed states. Every status change is recorded in `task_sequence`.
- **`log_metrics` tool:** Agents call this to record `cost_usd` and up to 8 integer metric slots (token counts, custom values). Values are aggregated (added to existing).
- **`get_metrics` tool:** Retrieves metrics for one or more tasks, supporting aggregation across task groups.
- **`task_history` tool:** Returns the full state transition sequence for a task with computed durations, time-per-status, and time-per-agent breakdowns.
- **`project_history` tool:** Returns project-wide transition data with time range filters, transition counts by status and agent, and total tracked time.
- **`get_stats`:** Aggregation queries returning `tasks_by_status`, `total_points`, `completed_points`, `total_time_estimate_ms`, `total_time_actual_ms`, `total_cost_usd`, and the 8-slot metrics array.

### Export/Import Pipeline (`src/export/mod.rs`, `src/db/export.rs`, `src/db/import.rs`)


The export system captures experiment results:

- **Snapshot format:** JSON with all tables (`tasks`, `dependencies`, `attachments`, `task_tags`, `task_sequence`) ordered deterministically for diff-friendly output.
- **Full state history:** The `task_sequence` table is exported, preserving the complete transition timeline for post-hoc analysis.
- **Gzip support:** Large experiments can export compressed.
- **Schema versioning:** Exports include `schema_version` and `export_version` for forward compatibility.

---

## Experiment Protocol


### Overview


An experiment compares two or more workflow configurations running the same template project with the same agent count. The process has four phases: prepare, execute, collect, analyze.

### Phase 1: Prepare


#### 1a. Create the Template Project


Build a representative task graph that exercises the dimensions you want to test. Export it as a template.

```
# Work in a scratch database to build the template

task-graph export --output templates/medium-feature.json
```

Template design guidelines:

- **Realistic structure:** 15-30 tasks with a mix of independent and dependent work. Include at least one critical path (chain of `follows` dependencies) and at least one parallel fan-out (multiple independent siblings under a parent).
- **Tag the tasks:** Use `needed_tags` to indicate specialist requirements (e.g., `designer`, `implementer`, `tester`). Generalist workflows will ignore these; specialist workflows will route by them.
- **Include estimates:** Set `time_estimate_ms` and `points` on tasks so metrics can compute estimate accuracy and weighted throughput.
- **Variety:** Mix task sizes. Some 1-point quick tasks, some 5-point substantial ones. This tests whether the workflow handles heterogeneous work.

#### 1b. Define the Experiment Matrix


Decide which variables to test and hold constant.

| Variable | Role | Example Values |
|----------|------|----------------|
| Workflow config | Independent | `swarm`, `hierarchical`, `push` |
| Agent count | Control (or 2nd independent) | 3 |
| Template | Control | `medium-feature.json` |
| Model | Control | Same model for all agents |
| Run count | Replication | 3 runs per configuration |

#### 1c. Create the Experiment Manifest


A YAML file describing the experiment:

```yaml
# experiments/push-vs-pull.yaml

name: push-vs-pull
description: Compare push coordination overhead against pull self-selection
template: templates/medium-feature.json
agent_count: 3
runs_per_config: 3

configs:
  - workflow: swarm
    description: Pure pull - agents self-select from ready queue
  - workflow: push
    description: Pure push - coordinator assigns every task
  - workflow: hierarchical
    description: Hybrid - lead decomposes, workers pull subtasks

metrics:
  - wall_clock_time_ms
  - total_cost_usd
  - rework_rate
  - agent_utilization_pct
  - dependency_wait_time_ms
  - tasks_per_hour
  - coordination_overhead_pct
```

### Phase 2: Execute

Each run follows this sequence:

1. **Instantiate the template** with experiment-specific tags:
   ```
   instantiate_template(
     template: "medium-feature.json",
     options: {
       parent_task_id: "experiment-root",
       extra_tags: ["exp:push-vs-pull", "run:swarm-1"],
       title_prefix: "swarm-r1"
     }
   )
   ```

2. **Load the workflow config** for this run. The workflow's prompts and role definitions shape agent behavior -- this is the independent variable.

3. **Connect agents** with appropriate tags. For specialist workflows (relay), connect agents with role-specific tags (`designer`, `implementer`, `tester`). For generalist workflows (swarm), all agents get `worker` tags.

4. **Run to completion.** Agents follow the workflow's prompts, claiming and completing tasks. The system automatically records:
   - Every state transition in `task_sequence` (with timestamps)
   - Accumulated `time_actual_ms` on each task
   - `cost_usd` and `metrics` via `log_metrics` calls
   - File coordination events via `mark_file`/`unmark_file`

5. **Export the results:**
   ```
   task-graph export --output results/push-vs-pull/swarm-run1.json
   ```

#### Isolation Between Runs

Each run should use a **fresh database** to avoid cross-contamination. The simplest approach:

```bash
# Set a unique DB path for each run
TASK_GRAPH_DB_PATH=experiments/swarm-run1.db task-graph-mcp
```

Alternatively, use the template system's ID remapping to run multiple experiments in one database, filtering by experiment tags in analysis. But separate databases are cleaner.

### Phase 3: Collect


After all runs complete, gather metrics from each exported snapshot.

#### Metrics to Collect


| Metric | Source | Computed From |
|--------|--------|---------------|
| **Wall-clock time** | `tasks` table | `MAX(completed_at) - MIN(created_at)` across all tasks in the experiment |
| **Total cost** | `tasks.cost_usd` | `SUM(cost_usd)` across all tasks |
| **Active work time** | `tasks.time_actual_ms` | `SUM(time_actual_ms)` across all tasks |
| **Dependency wait time** | `task_sequence` | Total time tasks spent in `pending` state after their creation (excluding initial pending before first claim) |
| **Agent utilization** | `task_sequence` | Per agent: `time_in_working / wall_clock_time`. Measures how much of the experiment duration each agent spent doing productive work vs waiting. |
| **Rework rate** | `task_sequence` | Tasks with more than one `working` period divided by total tasks |
| **Throughput** | `tasks` | `completed_tasks / wall_clock_hours` |
| **Coordination overhead** | `task_sequence` | `(time_in_pending + time_in_assigned) / (time_in_pending + time_in_assigned + time_in_working)` |
| **Dispatch latency** (push only) | `task_sequence` | Time from task creation to `assigned` transition |
| **Pickup latency** | `task_sequence` | Time from `assigned` (or `pending`) to `working` transition |
| **Failure rate** | `tasks` | Tasks ending in `failed` / total tasks |
| **Estimate accuracy** | `tasks` | `AVG(time_actual_ms / time_estimate_ms)` for tasks with estimates |
| **Points throughput** | `tasks` | `SUM(points) / wall_clock_hours` for weighted throughput |
| **Per-metric slots** | `tasks.metric_0..7` | Token counts, custom counters logged via `log_metrics` |

#### Collection Method


A post-hoc analysis script loads each exported snapshot and computes the metrics above. No runtime instrumentation is needed beyond what the system already records.

```python
# Pseudocode for metric extraction

snapshot = load_snapshot("results/swarm-run1.json")
tasks = snapshot["tables"]["tasks"]
sequence = snapshot["tables"]["task_sequence"]

wall_clock = max(t["completed_at"] for t in tasks if t["completed_at"]) \
           - min(t["created_at"] for t in tasks)
total_cost = sum(t["cost_usd"] for t in tasks)
# ... etc

```

### Phase 4: Analyze


Compare metrics across configurations using the standard tools:

1. **Tabular comparison:** One row per configuration, columns for each metric, averaged across runs with standard deviation.

2. **Key ratios:**
   - Speedup: `wall_clock(solo) / wall_clock(workflow)` -- how much faster is multi-agent?
   - Cost ratio: `total_cost(workflow) / total_cost(solo)` -- how much more expensive?
   - Efficiency: `speedup / agent_count` -- is each additional agent paying for itself?

3. **Distribution analysis:** Per-agent utilization histograms to detect load imbalance. Dependency wait time distributions to find bottleneck tasks.

4. **Scaling curves:** If running experiments at multiple agent counts, plot throughput vs agent count per workflow to identify scaling limits.

---

## Control Variables


To produce meaningful comparisons, these must remain constant across experiment arms:

| Variable | Why | How to Control |
|----------|-----|----------------|
| **Template project** | Same task graph structure, same dependencies, same estimates | Use `instantiate_template` from a single JSON file |
| **Agent count** | Same parallelism budget | Start the same number of agent processes per run |
| **Model** | Same underlying capability | Configure all agents to use the same model and temperature |
| **Prompt foundation** | Same base instructions | Vary only workflow-specific prompts; keep system prompts constant |
| **Hardware/network** | Same latency environment | Run experiments on the same machine or cloud instance |
| **Task content** | Same actual work to perform | Templates define task descriptions; agents follow them |

The only thing that changes between arms is the **workflow configuration file**, which controls:
- State transition prompts (agent behavioral instructions)
- Role definitions and role prompts
- Gates (required artifacts)
- Coordination model (encoded in prompts and roles)

---

## What Is Missing


The existing infrastructure covers about 80% of what a full experiment system needs. Here is what would need to be built, roughly ordered by priority.

### High Priority (Needed for First Experiment)


#### 1. Experiment Runner Script


A CLI command or script that automates the execute phase:

```bash
task-graph experiment run --manifest experiments/push-vs-pull.yaml
```

This would:
- Create a fresh database per run
- Instantiate the template with proper tags
- Start the MCP server with the specified workflow
- Signal agents to connect (or wait for manual connection)
- Wait for all tasks to reach terminal states
- Export results
- Repeat for each configuration and run

**Scope:** Shell script or Rust CLI subcommand. The task-graph binary already has a CLI (`src/cli/mod.rs`) that could host an `experiment` subcommand.

**Complexity:** Medium. The core logic (instantiate + export) exists. The orchestration (starting servers, waiting for completion) is new.

#### 2. Metrics Extraction Script


A script that reads exported snapshots and computes the metrics table from Phase 3.

```bash
task-graph experiment analyze --results-dir results/push-vs-pull/
```

Output: A comparison table (markdown or CSV) with one row per configuration, columns for each metric.

**Scope:** Python or Rust. Could use the existing `Snapshot` struct for parsing. The SQL queries from `docs/METRICS.md` translate directly to Rust `rusqlite` queries or Python `sqlite3` queries.

**Complexity:** Low-medium. The queries are straightforward. The main work is wiring them together with nice output formatting.

#### 3. Template Library


A curated set of template projects at different scales:

| Template | Tasks | Deps | Description |
|----------|-------|------|-------------|
| `tiny-feature.json` | 5 | 3 | Minimal: one parent, four subtasks |
| `medium-feature.json` | 15-20 | 10-15 | Realistic: mixed deps, critical path, parallel fan-out |
| `large-refactor.json` | 40-60 | 30+ | Stress test: deep hierarchy, many dependencies |
| `independent-batch.json` | 20 | 0 | All independent tasks, tests pure parallelism |
| `pipeline.json` | 10 | 9 | Fully sequential chain, tests handoff efficiency |

These should be checked into `templates/` and documented. They need specialist tags on tasks so relay/hierarchical workflows can route properly, and they should have estimates for throughput analysis.

**Scope:** JSON files created by hand or exported from real projects.

**Complexity:** Low. Just careful task graph design.

### Medium Priority (Improves Experiment Quality)


#### 4. Automatic Completion Detection


Currently there is no built-in way to know when "the experiment is done." The runner script would need to poll `get_stats()` and check if all tasks are in terminal states (`completed`, `failed`, `cancelled`).

A simple approach: query tasks where status is in the blocking states list. When count reaches zero, the experiment is done.

```sql
SELECT COUNT(*) FROM tasks
WHERE id IN (SELECT id FROM experiment_tasks)
  AND status IN ('pending', 'assigned', 'working');
```

**Scope:** A polling loop in the experiment runner. Could also be a `--wait` flag on the export CLI command.

**Complexity:** Low.

#### 5. Experiment Tagging in the Database


Add first-class support for experiment metadata. Currently, experiment identity is encoded via `extra_tags` on tasks. A cleaner approach would be an `experiment_runs` table:

```sql
CREATE TABLE experiment_runs (
  id TEXT PRIMARY KEY,
  experiment_name TEXT NOT NULL,
  workflow_name TEXT NOT NULL,
  agent_count INTEGER,
  template_name TEXT,
  started_at INTEGER,
  completed_at INTEGER,
  config_snapshot TEXT  -- JSON of the full workflow config used
);
```

Tasks would reference their experiment run via a tag or a dedicated column. This makes querying cleaner than filtering by tag prefix.

**Scope:** Migration + schema change + minor tool updates.

**Complexity:** Medium. Touches the schema, import/export, and stats queries.

#### 6. Warm-Up and Cool-Down Handling


The first task in any experiment run has disproportionate startup cost (agent initialization, context loading). The last task may have cleanup overhead. Metrics should support excluding warm-up and cool-down tasks, either by:
- Marking them with a tag (`warmup`, `cooldown`)
- Excluding the first/last N tasks from aggregation
- Using time-based trimming (exclude first/last M minutes)

**Scope:** Metric extraction script feature.

**Complexity:** Low.

### Low Priority (Nice to Have)


#### 7. Live Dashboard During Experiments


The web dashboard (`src/dashboard/`) already shows task status. Extending it with experiment-specific views (per-workflow progress, live metric counters, agent utilization gauges) would help monitor experiments in real time.

**Scope:** Dashboard template changes.

**Complexity:** Medium-high. The dashboard exists but adding experiment-specific views requires new routes and templates.

#### 8. Statistical Significance Testing


When comparing metrics across configurations, report whether differences are statistically significant. With 3+ runs per configuration, compute:
- Mean and standard deviation per metric per configuration
- p-values from t-tests or Mann-Whitney U tests
- Confidence intervals

**Scope:** Analysis script feature (likely Python with scipy).

**Complexity:** Low once the data extraction works.

#### 9. Automated Report Generation


Generate a markdown or HTML report from experiment results with tables, charts, and narrative summaries. Could use the analysis script output plus a template.

**Scope:** Script that formats analysis output.

**Complexity:** Low-medium.

---

## Example: First Experiment


Here is a concrete plan for the first experiment to validate the system.

### Goal


Compare swarm (pull) vs push coordination with 3 agents on a medium-sized feature.

### Setup


1. Create `templates/medium-feature.json` with ~20 tasks:
   - 1 root task ("Build Widget System")
   - 4 top-level subtasks (Design, Implement Core, Implement UI, Test)
   - Each top-level subtask has 3-5 leaf tasks
   - `follows` dependencies between Design -> Implement -> Test
   - Parallel paths between Implement Core and Implement UI
   - `needed_tags`: `designer` on design tasks, `implementer` on impl tasks, `tester` on test tasks

2. Run 3 iterations of each:
   - **Swarm:** All 3 agents connect as `worker` generalists. Pull coordination.
   - **Push:** 1 agent connects as `coordinator`, 2 as `worker`. Push coordination.

3. Collect: Export each run's database.

4. Analyze: Compare wall-clock time, total cost, rework rate, agent utilization.

### Expected Outcomes


- Swarm should complete faster (3 parallel workers vs 1 coordinator + 2 workers)
- Push should have lower rework rate (coordinator can route tasks to best-fit worker)
- Push should have higher coordination overhead (coordinator's token budget)
- Swarm should have more variable agent utilization (luck of the draw on claiming)

### Success Criteria for the Experiment System


The experiment system works if:
1. Templates instantiate identically across runs (verified by task count and dependency count)
2. Metrics are automatically captured without manual intervention beyond `log_metrics`
3. The analysis script produces a comparison table from exported snapshots
4. Results are reproducible: multiple runs of the same config produce metrics within reasonable variance

---

## Relationship to Existing Docs


- **`docs/METRICS.md`** defines the full metrics catalog with SQL queries. This design uses those metrics as dependent variables.
- **`docs/WORKFLOW_TOPOLOGIES.md`** describes the workflow dimension space. This design turns those qualitative comparisons into quantitative experiments.
- **`config/workflow-push.yaml`** already includes an `experiment_metrics` section. This design generalizes that approach across all workflows.

---

## Summary


The task-graph system already has the three pillars needed for experiments:

1. **Templates** provide identical starting conditions (control)
2. **Workflows** provide the coordination patterns to compare (independent variable)
3. **Metrics tracking** provides automatic measurement (dependent variables)

What needs to be built is the orchestration layer that ties them together: a runner script, a metrics extraction script, and a library of template projects. This is a scripting and tooling effort, not an architectural change. The core infrastructure is ready.