task-graph-mcp 0.3.0

MCP server for agent task workflows with phases, prompts, gates, and multi-agent coordination
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
# Experiment Framework Design


> **Version:** 1.0
> **Date:** 2026-01-31
> **Status:** Design Proposal
> **Task:** 019c024f-f126-7cb2-8434-b62a0313436e

---

## 1. Goals


The experiment framework enables systematic comparison of multi-agent coordination patterns by providing repeatable experiment execution, automated metric collection, and structured result reporting.

### What an Experiment Looks Like


An experiment answers a question like:

> "Does running task decomposition with the swarm workflow (pure pull) vs the hierarchical workflow (push) produce faster completion times, lower token cost, or better task distribution for the same workload?"

Concretely, an experiment:

1. **Defines a hypothesis** about a coordination pattern (e.g., "hybrid push/pull achieves higher throughput than pure push").
2. **Loads a task template** (e.g., `browser-parallel.json`) into a fresh database to create an identical workload across runs.
3. **Configures agents** with a specific workflow, role assignments, and agent count.
4. **Executes the workload** while the task-graph system automatically tracks state transitions, time, and cost.
5. **Collects metrics** from the database (wall-clock time, tokens, cost, blocking ratio, rework rate, etc.).
6. **Compares results** across experiment variants (e.g., swarm vs hierarchical vs hybrid) using the same template and agent count.

### Key Design Principles


- **Grounded in existing infrastructure.** The task-graph already tracks state transitions (`task_sequence`), metrics (`log_metrics`/`get_metrics`), templates (`experiments/templates/`), workflows, and export/import. The experiment framework orchestrates these existing capabilities rather than replacing them.
- **Declarative experiment definitions.** Experiments are YAML files (building on the existing `experiments/*.yaml` pattern) that declare what to run, not how to run it.
- **Isolation via database snapshotting.** Each experiment run uses a fresh database (or a `Replace`-mode import) so runs do not contaminate each other. The existing `ImportMode::Replace` and `ImportMode::Fresh` modes support this.
- **Automated metric extraction.** Metrics are computed from SQL queries against the task-graph database, reusing the `query` tool and `task_sequence` table that already exist.

---

## 2. Components


### 2.1 Experiment Definition Schema


Each experiment is a YAML file under `experiments/`. Three experiment definitions already exist in the codebase (`pure-pull.yaml`, `push-experiment.yaml`, `experiment-hybrid.yaml`). The framework formalizes and extends this existing pattern.

```yaml
# experiments/example-experiment.yaml


experiment:
  id: "swarm-vs-hierarchical-001"
  name: "Swarm vs Hierarchical"
  version: "1.0.0"
  created: "2026-01-31"
  hypothesis: >
    Pure pull (swarm) coordination achieves higher throughput than
    hierarchical (push) for independent tasks, but hierarchical
    produces lower rework rates due to lead oversight.

# What task set to use

template:
  source: "experiments/templates/browser-parallel.json"
  instantiate_options:
    reset_status: true
    extra_tags: ["experiment:swarm-vs-hierarchical-001"]

# Variants to compare (each is a separate run)

variants:
  - name: "swarm-4"
    workflow: swarm
    agents:
      count: 4
      config:
        tags: [worker, implementer, code]
        max_claims: 1

  - name: "hierarchical-4"
    workflow: hierarchical
    agents:
      count: 5  # 1 lead + 4 workers
      definitions:
        - id: lead
          tags: [lead, coordinator]
          workflow: hierarchical
        - id: "worker-{n}"
          count: 4
          tags: [worker, implementer, code]
          workflow: hierarchical

# Metrics to collect (references docs/METRICS.md categories)

metrics:
  primary:
    - wall_clock_duration_ms
    - total_cost_usd
    - tasks_per_hour
    - completion_rate_pct
  coordination:
    - blocking_ratio_pct
    - avg_queue_wait_ms
    - rework_rate_pct
  token_efficiency:
    - total_billable_tokens
    - tokens_per_completed_task
    - cache_hit_rate_pct

# Success criteria

success_criteria:
  minimum:
    completion_rate_pct: ">= 80"
  target:
    tasks_per_hour: "> baseline"

# Execution settings

execution:
  timeout_seconds: 7200
  poll_interval_seconds: 30
  output_dir: "experiments/results/{experiment.id}/{variant.name}"
```

**Relationship to existing files:** The three experiment YAMLs already in the repository (`pure-pull.yaml`, `push-experiment.yaml`, `experiment-hybrid.yaml`) follow a similar but ad-hoc structure. This schema formalizes the common elements across them.

### 2.2 Experiment Runner


A CLI tool (or script) that automates the experiment lifecycle:

```
experiment run <experiment.yaml> [--variant <name>] [--output <dir>]
experiment compare <result-dir-1> <result-dir-2> [--output <dir>]
experiment report <result-dir>
```

#### Run Lifecycle


```
1. SETUP
   |-- Parse experiment YAML
   |-- For each variant:
   |     |-- Create fresh database (ImportMode::Fresh or Replace)
   |     |-- Import template via instantiate_template()
   |     |     (uses existing src/db/template.rs machinery)
   |     |-- Apply extra_tags for experiment tracking
   |     |-- Record experiment metadata as an attachment on root task
   |
2. LAUNCH
   |-- Start task-graph-mcp server (existing binary)
   |-- For each agent definition:
   |     |-- Launch agent process (e.g., `claude --task "..."`)
   |     |-- Agent calls `connect(workflow=..., tags=...)` (existing tool)
   |     |-- Agents work autonomously per their workflow
   |
3. MONITOR
   |-- Poll database via `project_history` or `query` tool
   |-- Track: pending count, working count, completed count
   |-- Detect completion: all tasks in terminal state
   |-- Detect timeout: wall-clock exceeds timeout_seconds
   |-- Detect stale agents: heartbeat tracking (existing)
   |
4. COLLECT
   |-- Export database snapshot via export_tables() (existing)
   |-- Run metric queries against the database
   |-- Compute derived metrics (throughput, ratios, Gini)
   |-- Write results to output directory:
   |     |-- tasks.db          (SQLite database copy)
   |     |-- snapshot.json     (full Snapshot export)
   |     |-- metrics.json      (computed metrics)
   |     |-- summary.md        (human-readable report)
   |     |-- experiment.yaml   (copy of experiment definition)
   |
5. CLEANUP
   |-- Disconnect all agents (existing disconnect tool)
   |-- Stop task-graph server
```

### 2.3 Metric Collection

Metrics are collected from existing data sources in the task-graph database. No new tables or columns are needed.

#### Existing Data Sources

| Source | What It Provides | Location |
|--------|-----------------|----------|
| `tasks` table | Status, cost_usd, metrics[0..7], points, timestamps | `src/db/tasks.rs` |
| `task_sequence` table | State transition history with timestamps | `src/db/state_transitions.rs` |
| `workers` table | Agent registrations, heartbeats, tags | `src/db/agents.rs` |
| `log_metrics` tool | Per-task cost and 8 integer metric slots | `src/tools/tracking.rs` |
| `get_metrics` tool | Aggregated metrics across tasks | `src/tools/tracking.rs` |
| `project_history` tool | Project-wide state transition stats | `src/tools/tracking.rs` |
| `task_history` tool | Per-task time-per-status breakdown | `src/tools/tracking.rs` |
| `export_tables()` | Full database snapshot for archival | `src/db/export.rs` |

#### Metric Queries

The metric collection layer runs SQL queries against the database using the existing `query` tool (read-only SQL). Queries are defined in the experiment YAML or referenced from `docs/METRICS.md`. Key metrics and their queries:

| Metric | Query Source |
|--------|-------------|
| `wall_clock_duration_ms` | `MAX(completed_at) - MIN(started_at) FROM tasks` |
| `total_cost_usd` | `SUM(cost_usd) FROM tasks` |
| `tasks_per_hour` | Derived: `completed_count / (wall_clock_ms / 3600000)` |
| `completion_rate_pct` | `100 * SUM(status='completed') / COUNT(*) FROM tasks` |
| `blocking_ratio_pct` | Ratio of time in pending/assigned vs working in `task_sequence` |
| `rework_rate_pct` | Count of tasks with >1 working period in `task_sequence` |
| `total_billable_tokens` | `SUM(metric_0 + metric_1 + metric_3) FROM tasks` (per METRICS.md convention) |

#### The metrics[0..7] Convention

The `log_metrics` tool provides 8 integer slots per task. The established convention from `docs/METRICS.md` and the existing experiment YAMLs is:

| Slot | Meaning |
|------|---------|
| `metric_0` | Input tokens |
| `metric_1` | Output tokens |
| `metric_2` | Cached tokens |
| `metric_3` | Thinking tokens |
| `metric_4` | Image tokens |
| `metric_5` | Audio tokens |
| `metric_6` | (available) |
| `metric_7` | (available) |

### 2.4 Baseline Comparison

Comparing experiment variants requires:

1. **Identical starting state.** Each variant imports the same template into a fresh database. The `instantiate_template()` function (in `src/db/template.rs`) already supports this with ID remapping and status reset.

2. **Aligned metric collection.** All variants collect the same metric set defined in the experiment YAML.

3. **Comparison report.** A report tool that loads `metrics.json` from each variant's output directory and produces:
   - Side-by-side metric table (variant name as columns)
   - Delta columns (absolute and percentage difference from baseline)
   - Color-coded direction indicators (green for improvements, red for regressions)
   - Statistical significance notes (for experiments with multiple runs per variant)

#### Comparison Output Example

```markdown
## Experiment: Swarm vs Hierarchical

| Metric                    | swarm-4     | hierarchical-4 | Delta     |
|---------------------------|-------------|-----------------|-----------|
| wall_clock_duration_ms    | 1,245,000   | 1,890,000       | -34.1%    |
| total_cost_usd            | $2.34       | $3.67           | -36.2%    |
| tasks_per_hour            | 38.5        | 25.3            | +52.2%    |
| completion_rate_pct       | 95.0%       | 100.0%          | -5.0pp    |
| blocking_ratio_pct        | 8.2%        | 22.1%           | -13.9pp   |
| rework_rate_pct           | 12.0%       | 3.0%            | +9.0pp    |
```

### 2.5 Result Reporting


Each experiment run produces a results directory:

```
experiments/results/swarm-vs-hierarchical-001/
  swarm-4/
    tasks.db              # SQLite database (complete state)
    snapshot.json         # Snapshot export (version-controllable)
    metrics.json          # Computed metrics
    summary.md            # Human-readable run summary
    experiment.yaml       # Experiment definition (frozen copy)
  hierarchical-4/
    tasks.db
    snapshot.json
    metrics.json
    summary.md
    experiment.yaml
  comparison.md           # Cross-variant comparison report
  comparison.json         # Machine-readable comparison data
```

The `snapshot.json` file uses the existing `Snapshot` format from `src/export/mod.rs`, which means results are importable back into any task-graph database for further analysis.

---

## 3. Integration with Existing Infrastructure


### 3.1 Templates


The experiment framework uses templates from `experiments/templates/` and `task-graph/templates/`. Two templates already exist:

- **`browser-parallel.json`**: 22 tasks in a parallel (fan-out/fan-in) structure. Designed for swarm workflow testing with maximum concurrency.
- **`browser-phased.json`**: 40 tasks organized by component, each with explore/design/implement/test subtasks. Designed for relay workflow testing with phase-based handoffs and `needed_tags` for role matching.

Templates are imported via `Database::instantiate_template()` (`src/db/template.rs`) which handles:
- ID remapping (fresh petname IDs per run)
- Status reset to `pending`
- Timestamp reset to current time
- Optional parent task attachment
- Optional extra tags (useful for tagging all tasks with the experiment ID)
- Merge-mode import (no conflicts with existing data)

### 3.2 Workflows


Experiments reference workflows by name via the `connect` tool's `workflow` parameter. The workflow system (`src/config/workflows.rs`) supports:

- **Named workflows** (e.g., `swarm`, `hierarchical`, `relay`, `solo`) loaded from `workflow-*.yaml` files
- **Role definitions** with tag matching, max_claims, and can_assign permissions
- **Phase workflows** defining per-phase state machines
- **Overlays** that modify workflow behavior dynamically (e.g., `git`, `troubleshooting`)
- **Prompts** delivered to agents based on their matched role

Experiments specify which workflow each variant uses. The runner passes this to agents who include it in their `connect()` call.

### 3.3 Metrics Tracking


The existing metrics system requires no changes:

- **`log_metrics` tool** (`src/tools/tracking.rs`): Agents call this to report cost_usd and up to 8 integer metric values per task. Values are aggregated (added to existing).
- **`get_metrics` tool** (`src/tools/tracking.rs`): Retrieves aggregated metrics for one or more tasks.
- **`task_history` tool** (`src/tools/tracking.rs`): Returns per-task state transition history with time-per-status and time-per-agent breakdowns.
- **`project_history` tool** (`src/tools/tracking.rs`): Returns project-wide transition statistics with date/time range filters.
- **`time_actual_ms`** on tasks: Automatically accumulated when tasks exit timed states (e.g., `working`).

### 3.4 Export/Import


The experiment framework uses the existing export/import pipeline:

- **`Database::export_tables()`** (`src/db/export.rs`): Exports all project data tables (tasks, dependencies, attachments, tags, task_sequence) in deterministic order. Excludes ephemeral tables (workers, file_locks).
- **`Database::import_snapshot()`** (`src/db/import.rs`): Imports a `Snapshot` with support for `Fresh`, `Replace`, and `Merge` modes. Handles foreign key ordering and FTS rebuild.
- **`Snapshot` struct** (`src/export/mod.rs`): The portable JSON format for database state. Includes schema_version, export metadata, and all table data.

For experiments, the workflow is:
1. **Setup**: `import_snapshot` with `ImportMode::Fresh` or `ImportMode::Replace` to load the template.
2. **Collect**: `export_tables` after experiment completion to capture the full result state.
3. **Archive**: Save the `Snapshot` JSON alongside the SQLite database for both machine and human consumption.

### 3.5 Agent Connection


Agents connect via the `connect` tool (`src/tools/agents.rs`) which:
- Registers the worker with ID, tags, and workflow name
- Returns workflow configuration, role information, and role-specific prompts
- Supports overlay application for dynamic behavior modification

The experiment runner launches agent processes that call `connect()` with the workflow and tags specified in the experiment variant.

### 3.6 Query Tool


The `query` tool (`src/tools/query.rs`) provides read-only SQL access to the database. The metric collection phase uses this to run the SQL queries defined in the experiment's metric section, extracting computed values from the raw data.

---

## 4. CLI Commands / Tools


### 4.1 New MCP Tools (Future)


These tools would extend the existing tool set in `src/tools/`:

| Tool | Description |
|------|-------------|
| `experiment_run` | Start an experiment from a YAML definition. Sets up database, imports template, returns experiment ID. |
| `experiment_status` | Check experiment progress (pending/working/completed/failed counts, elapsed time, estimated completion). |
| `experiment_metrics` | Compute and return metrics for a completed experiment run. |
| `experiment_compare` | Compare metrics across two or more experiment result directories. |

### 4.2 CLI Script Commands


Before the tools are built into the server, a Python runner script (`scripts/run_experiment.py`) provides the same functionality. The existing experiment YAMLs already reference this script:

```bash
# Run a single experiment variant

python scripts/run_experiment.py \
  --config experiments/pure-pull.yaml \
  --variant pull-4 \
  --output experiments/results/pure-pull/pull-4

# Wait for completion and export

python scripts/run_experiment.py \
  --wait --poll-interval 30 --timeout 7200 \
  --output experiments/results/pure-pull/pull-4

# Compare results across variants

python scripts/compare_experiments.py \
  experiments/results/pure-pull/pull-4/tasks.db \
  experiments/results/push/push-4/tasks.db \
  experiments/results/hybrid/hybrid-4/tasks.db \
  --labels "pull,push,hybrid" \
  --output experiments/results/comparison
```

### 4.3 Dashboard Integration


The existing web dashboard (`src/dashboard/`) includes a metrics page (`templates/metrics.html`). Experiment results could be surfaced there by:
- Adding an experiment selector dropdown
- Loading metrics.json from the results directory
- Rendering comparison charts

This is a future enhancement, not part of the initial framework.

---

## 5. Example Experiment Definition


This is a complete, runnable experiment definition that uses only existing infrastructure:

```yaml
# experiments/decomposition-strategy-001.yaml

#
# Question: Does a coordinator that decomposes tasks into smaller subtasks

# before workers claim them produce better throughput and lower cost than

# letting workers claim coarse tasks and decompose themselves?


experiment:
  id: "decomposition-strategy-001"
  name: "Pre-decomposition vs Self-decomposition"
  version: "1.0.0"
  created: "2026-01-31"
  hypothesis: >
    Pre-decomposing tasks via a coordinator reduces total token cost
    (workers spend less time on planning) but increases wall-clock time
    (coordinator is a serial bottleneck). Self-decomposition by workers
    achieves higher throughput but at higher per-task token cost.

template:
  source: "experiments/templates/browser-parallel.json"
  # browser-parallel has coarse leaf tasks suitable for further decomposition
  instantiate_options:
    reset_status: true
    extra_tags: ["experiment:decomposition-001"]

variants:
  # Variant A: Coordinator pre-decomposes, workers execute atomic tasks
  - name: "pre-decomposed"
    description: >
      Lead decomposes all tasks into fine-grained subtasks using
      create_tree() before workers start. Workers only execute.
    workflow: hierarchical
    agents:
      count: 5
      definitions:
        - id: lead
          tags: [lead, coordinator, designer]
          prompt: >
            You are the lead. Before workers start, decompose every leaf
            task into 2-4 subtasks using create_tree(). Once decomposition
            is complete, workers will pull subtasks. Monitor progress.
        - id: "worker-{n}"
          count: 4
          tags: [worker, implementer, code]
          prompt: >
            Wait for the lead to finish decomposition (subtasks will
            appear with ready=true). Then pull and execute subtasks.

  # Variant B: Workers claim coarse tasks and self-decompose
  - name: "self-decomposed"
    description: >
      No coordinator. Workers claim coarse tasks, decompose into
      subtasks themselves, and execute.
    workflow: swarm
    agents:
      count: 4
      definitions:
        - id: "swarm-{n}"
          count: 4
          tags: [worker, implementer, code]
          prompt: >
            Claim a task. If it is large (>5 points), decompose it into
            subtasks via create_tree(), then work the subtasks. If small,
            execute directly.

  # Variant C: Hybrid -- coordinator decomposes top-level only
  - name: "hybrid-decomposed"
    description: >
      Lead decomposes root into top-level components (push), workers
      decompose their assigned component further (self-decompose within scope).
    workflow: hierarchical
    agents:
      count: 5
      definitions:
        - id: lead
          tags: [lead, coordinator]
          prompt: >
            Decompose the root task into top-level components only.
            Push each component to a worker. Workers handle further
            decomposition within their assigned scope.
        - id: "worker-{n}"
          count: 4
          tags: [worker, implementer, code]
          prompt: >
            Wait for assignment from lead. Once assigned, decompose
            your component into subtasks and execute them.

metrics:
  primary:
    - name: wall_clock_duration_ms
      query: "SELECT MAX(completed_at) - MIN(started_at) FROM tasks WHERE status = 'completed'"
      compare: lower_is_better

    - name: total_cost_usd
      query: "SELECT SUM(cost_usd) FROM tasks WHERE deleted_at IS NULL"
      compare: lower_is_better

    - name: tasks_per_hour
      derived: "completed_count / (wall_clock_duration_ms / 3600000)"
      compare: higher_is_better

    - name: completion_rate_pct
      query: >
        SELECT 100.0 * SUM(CASE WHEN status='completed' THEN 1 ELSE 0 END) / COUNT(*)
        FROM tasks WHERE deleted_at IS NULL
      compare: higher_is_better

  coordination:
    - name: decomposition_time_ms
      description: "Time the lead spends decomposing before workers start"
      query: >
        SELECT MAX(ts.timestamp) - MIN(ts.timestamp)
        FROM task_sequence ts
        JOIN tasks t ON ts.task_id = t.id
        WHERE ts.worker_id IN ('lead') AND ts.status = 'working'
      compare: lower_is_better

    - name: avg_subtask_count
      description: "Average number of subtasks per original task"
      query: >
        SELECT AVG(child_count) FROM (
          SELECT d.from_task_id, COUNT(*) as child_count
          FROM dependencies d WHERE d.dep_type = 'contains'
          GROUP BY d.from_task_id
        )
      compare: neutral

    - name: blocking_ratio_pct
      description: "Fraction of tracked time in pending/assigned vs working"
      query: >
        SELECT 100.0 * SUM(CASE WHEN status IN ('pending','assigned') THEN
          COALESCE(end_timestamp, CAST(strftime('%s','now') AS INTEGER)*1000) - timestamp
        ELSE 0 END) / SUM(
          COALESCE(end_timestamp, CAST(strftime('%s','now') AS INTEGER)*1000) - timestamp
        ) FROM task_sequence
      compare: lower_is_better

  quality:
    - name: rework_rate_pct
      query: >
        SELECT 100.0 * COUNT(CASE WHEN cnt > 1 THEN 1 END) / COUNT(*)
        FROM (SELECT task_id, COUNT(*) as cnt FROM task_sequence
              WHERE status = 'working' GROUP BY task_id)
      compare: lower_is_better

  efficiency:
    - name: tokens_per_completed_task
      query: >
        SELECT AVG(metric_0 + metric_1 + metric_3)
        FROM tasks WHERE status = 'completed' AND deleted_at IS NULL
      compare: lower_is_better

    - name: cost_per_point
      query: >
        SELECT SUM(cost_usd) / NULLIF(SUM(points), 0)
        FROM tasks WHERE status = 'completed' AND deleted_at IS NULL
      compare: lower_is_better

comparison:
  baseline: "self-decomposed"
  hypotheses:
    throughput: >
      Pre-decomposed should have lower wall_clock if decomposition is fast,
      but the serial decomposition phase may dominate. Hybrid should balance.
    cost: >
      Pre-decomposed should have lower tokens_per_completed_task (workers
      do less planning). Self-decomposed should have higher per-task cost
      but may have lower total cost if throughput is significantly better.
    quality: >
      Pre-decomposed should have lower rework_rate (lead catches issues
      during decomposition). Self-decomposed may have more rework as
      workers discover issues late.

success_criteria:
  minimum:
    completion_rate_pct: ">= 80"
    wall_clock_duration_ms: "< 7200000"  # 2 hours
  target:
    tasks_per_hour: "> 20"
    rework_rate_pct: "< 10"

execution:
  timeout_seconds: 7200
  poll_interval_seconds: 30
  output_dir: "experiments/results/decomposition-001"
```

---

## 6. Implementation Roadmap


This is a design document. Implementation should proceed in phases:

### Phase 1: Runner Script (Python)

- Implement `scripts/run_experiment.py` that reads experiment YAML
- Template import via CLI (`task-graph import --file <template> --mode fresh`)
- Agent launch via subprocess
- Polling loop for completion detection
- Metric query execution and JSON export
- Comparison report generation

### Phase 2: MCP Tool Integration

- Add `experiment_run` tool to `src/tools/` that orchestrates setup and monitoring
- Add `experiment_metrics` tool for in-server metric computation
- Add `experiment_compare` tool for cross-run comparison

### Phase 3: Dashboard and Visualization

- Extend the web dashboard with experiment result viewing
- Add chart rendering for metric comparisons
- Timeline visualization for agent activity

### Phase 4: Statistical Rigor

- Support multiple runs per variant (N=3+) for statistical significance
- Confidence interval computation
- Automated hypothesis testing (two-sample t-test on key metrics)

---

## 7. Open Questions


1. **Agent launch mechanism.** The framework needs to start multiple agent processes. Should this use `claude --task` subprocesses, a dedicated agent launcher, or assume agents are started externally?

2. **Database isolation.** Should each variant get its own SQLite file (simplest, most isolated) or its own server instance? Separate files are simpler for archival and comparison.

3. **Template parameterization.** Should templates support variable substitution (e.g., `{agent_count}` in task descriptions) for experiments that vary task content across runs?

4. **Live monitoring.** The existing dashboard shows live task state. Should the experiment framework integrate with it, or is post-hoc analysis sufficient for the initial version?

5. **Metric slot convention.** The `metrics[0..7]` slots have an informal convention (documented in `docs/METRICS.md`). Should the experiment framework enforce this convention or allow per-experiment custom slot assignments?

---

## Appendix: Existing Infrastructure Summary


| Component | Location | Role in Experiments |
|-----------|----------|-------------------|
| Task templates | `experiments/templates/*.json` | Define workload for experiment runs |
| Experiment definitions | `experiments/*.yaml` | Declare experiment parameters (3 exist) |
| Template instantiation | `src/db/template.rs` | Import templates with ID remapping and status reset |
| State transition tracking | `src/db/state_transitions.rs` | Automatic time tracking per state per task |
| Metrics logging | `src/tools/tracking.rs` (`log_metrics`) | Per-task cost and token tracking (8 slots) |
| Metrics retrieval | `src/tools/tracking.rs` (`get_metrics`) | Aggregated metric queries |
| Project history | `src/tools/tracking.rs` (`project_history`) | Project-wide transition statistics |
| Database export | `src/db/export.rs` | Snapshot creation for archival |
| Database import | `src/db/import.rs` | Fresh/Replace/Merge import modes |
| Snapshot format | `src/export/mod.rs` | Portable JSON format for database state |
| Workflow system | `src/config/workflows.rs` | Named workflows, roles, overlays |
| Agent connection | `src/tools/agents.rs` (`connect`) | Workflow-aware agent registration |
| Read-only SQL | `src/tools/query.rs` | Metric query execution |
| Web dashboard | `src/dashboard/` | Live monitoring (future integration) |
| Metrics documentation | `docs/METRICS.md` | Metric definitions and SQL examples |
| Workflow topologies | `docs/WORKFLOW_TOPOLOGIES.md` | Topology dimension documentation |