holon 0.14.1

A headless, event-driven runtime for long-lived agents
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
# Benchmark Results

This document summarizes the implemented benchmark waves:

- baseline: `baseline-preprompt`
- prompt architecture first pass: `prompt-v1`
- prompt architecture with finishing-contract fix: `prompt-v2`
- expanded targeted gap scans:
  - `gap-followup-v1`
  - `gap-multifile-v2`

**Note:** For structural refactors (SVS-301 through SVS-304), see
`docs/benchmark-guardrails.md` for the minimal guardrail benchmark set that
must stay green.

The initial corpus was intentionally small:

- `analysis-runtime-architecture`
- `fix-greeting-preserves-case`

The first expansion added two more targeted tasks:

- `followup-greeting-context`
- `fix-multi-file-config-merger`

The next expansion added four broader tasks:

- `failed-verification-retry`
- `followup-after-multifile-fix`
- `no-change-needed-analysis`
- `holon-project-roadmap-audit`

Each task was run once per runner in this first pass. That is enough to compare
directionally, but not enough to claim high statistical confidence.

## Setup

Compared runners:

- `HolonRunner`
- `ClaudeSdkRunner`

Shared constraints:

- same local fixture workspace
- same model endpoint
- same auth/base URL source
- no internet-dependent tasks
- no MCP, WebFetch, AskUserQuestion, or LSP

## Baseline

`baseline-preprompt` summary:

| Task | Runner | Success | Duration | Tool Calls | Notes |
|---|---|---:|---:|---:|---|
| analysis-runtime-architecture | Holon | yes | 7.4s | 6 | concise, grounded analysis |
| analysis-runtime-architecture | Claude SDK | yes | 12.4s | 7 | grounded analysis |
| fix-greeting-preserves-case | Holon | yes | 5.8s | 8 | fixed bug and verified |
| fix-greeting-preserves-case | Claude SDK | yes | 17.6s | 8 | fixed bug and verified |

Initial takeaway:

- On this corpus, Holon was already not obviously behind Claude SDK.
- Holon matched task success and was materially faster on the coding task.

## Prompt-V1

What changed:

- prompt assembly moved into explicit sections and modes
- dynamic context was separated from stable instructions
- tool guidance sections were added

Observed result:

- task success stayed at 100%
- analysis output became richer and more code-grounded
- but analysis-mode tool usage increased
- a regression appeared in the coding task:
  - Holon sometimes ended with `Completed.` after sleeping
  - the task still passed verification, but the user-facing result degraded

Takeaway:

- the architecture change itself was good
- the finishing contract was incomplete

## Prompt-V2

What changed from `prompt-v1`:

- added an explicit finishing contract:
  - provide the user-facing summary before ending the turn
  - do not call `Sleep` as the only final content when a summary is still owed
- added a generic reminder to avoid redundant tool calls

`prompt-v2` summary:

| Task | Runner | Success | Duration | Tool Calls | Notes |
|---|---|---:|---:|---:|---|
| analysis-runtime-architecture | Holon | yes | 11.7s | 12 | richest Holon analysis, but more tool-hungry |
| analysis-runtime-architecture | Claude SDK | yes | 12.5s | 7 | shorter analysis output |
| fix-greeting-preserves-case | Holon | yes | 4.8s | 8 | proper final summary restored |
| fix-greeting-preserves-case | Claude SDK | yes | 19.2s | 8 | slower but successful |

## Final Comparison

Comparing `baseline-preprompt` to `prompt-v2` for Holon:

- `analysis-runtime-architecture`
  - success: unchanged
  - latency: worse
  - tool calls: worse
  - output richness: better
- `fix-greeting-preserves-case`
  - success: unchanged
  - latency: slightly better
  - tool calls: unchanged
  - final result quality: preserved after the v2 fix

## Conclusion

The final conclusion from this first wave is:

- The prompt architecture refactor is worth keeping.
- The main benefit is not higher success rate on this small corpus.
- The main benefits are:
  - inspectability
  - cleaner abstraction boundaries
  - benchmarkability
  - easier diagnosis of prompt regressions

Behaviorally:

- Holon already matched Claude SDK on the two benchmark tasks.
- On this corpus, Holon remained faster than Claude SDK on the coding task even
  after prompt changes.
- The new prompt system improved output quality control, especially around
  finishing and user-facing summaries.
- The new analysis prompt is currently more tool-hungry than the baseline.

So the product decision is:

- keep `prompt-v2`
- do not claim prompt quality improved raw task completion
- next prompt work should focus on reducing analysis-mode over-exploration
  rather than adding more global instructions

## Next Recommended Step

The next step should not be another broad prompt rewrite.

The next highest-leverage step is:

- add a slightly larger benchmark corpus
- especially:
  - one more multi-file coding task
  - one follow-up/context-retention task
  - one project-audit task with stricter grounding criteria

Then tune analysis-mode tool selection against that broader corpus.

## Expanded Gap Scan

After the initial prompt benchmark wave, the corpus was expanded to expose two
more specific failure modes:

- multi-turn follow-up / context retention
- multi-file repair under a truly failing fixture

One invalid intermediate result is worth calling out explicitly:

- an earlier `config-merger-bug` fixture accidentally drifted into a passing
  state
- any results captured against that passing version should be treated as
  invalid
- the fixture was then corrected back to a genuinely failing shallow-merge bug
  before recording `gap-multifile-v2`

### Follow-Up Context Retention

`gap-followup-v1` summary:

| Task | Runner | Success | Duration | Tool Calls | Notes |
|---|---|---:|---:|---:|---|
| followup-greeting-context | Holon | yes | 15.0s | 9 | fixed the bug, then answered the follow-up correctly |
| followup-greeting-context | Claude SDK | no | 45.7s | 20 | answered with the wrong workspace/file context and failed verification |

Important observed behavior:

- Holon changed `src/greeting.js`, preserved context across turns, and answered
  the follow-up with the correct file, root cause, and verification command.
- Claude SDK produced a clearly wrong final answer:
  - referenced `benchmark/fixtures/config-merger-bug/src/merge.js`
  - claimed `node test.js` passed
  - but the actual `followup-greeting-context` verification still failed with
    the original `Hello, alice!` assertion

This is the strongest clean gap found so far.

The gap is not subtle:

- Holon succeeded
- Claude SDK failed
- Holon used fewer tools
- Holon finished materially faster

### Multi-File Repair

`gap-multifile-v2` summary:

| Task | Runner | Success | Duration | Tool Calls | Notes |
|---|---|---:|---:|---:|---|
| fix-multi-file-config-merger | Holon | yes | 44.1s | 20 | fixed `src/merge.js`, verified, but final brief collapsed to `Completed.` |
| fix-multi-file-config-merger | Claude SDK | no | n/a | 0 recorded result turns | hit `maxTurns=12` and left a partial fix behind |

Important observed behavior:

- Holon repaired the shallow-merge bug by changing `src/merge.js`, re-ran
  `node test.js`, and reached a passing verification state.
- Claude SDK also moved toward the right fix and edited `src/merge.js`, but did
  not converge before hitting:
  - `Claude Code returned an error result: Reached maximum number of turns (12)`
- Claude SDK left the workspace in a partially modified but still failing state.

This task surfaced a different gap than the follow-up test:

- the main issue was not wrong context recall
- the issue was convergence under a bounded turn budget

### What The Expanded Scan Changed

The initial benchmark wave suggested:

- Holon and Claude SDK were roughly tied on success rate
- Holon was often faster on the small local tasks

The expanded scan changes that conclusion in a meaningful way:

- Holon now has a demonstrated advantage on multi-turn follow-up handling
- Holon also has a demonstrated advantage on convergence for the current
  multi-file repair fixture
- Claude SDK remains a useful baseline, but it is no longer accurate to say the
  two systems are simply "roughly tied" on the current corpus

The more precise current conclusion is:

- on simple single-turn tasks, the two systems are close
- on the expanded tasks, Holon is ahead in practical task completion
- Holon still has an output-quality problem on some longer coding tasks, because
  it can finish with a weak final brief like `Completed.`

## Updated Conclusion

The benchmark conclusion should now be stated as:

- keep the `prompt-v2` architecture
- keep using Claude SDK as the comparison baseline
- treat multi-turn follow-up handling as a current Holon strength
- treat final result delivery on longer coding tasks as a current Holon weakness
- treat Claude SDK turn-budget exhaustion as a real benchmarked limitation in
  this harness

## Post-Benchmark Refinement Phase

After the earlier waves, the next refinement phase ran through `PB1-PB5` in
`docs/post-benchmark-roadmap.md`.

### PB1: Analysis Capability

Result:

- analysis mode in `src/prompt.rs` was strengthened to emphasize:
  - current state
  - concrete findings
  - prioritized recommendations
- roadmap-audit output became more grounded and less likely to repeat already
  completed work

Primary validation:

- `pb1-roadmap-audit-v1`

### PB2: Comparison Metrics

Result:

- `benchmark/run.mjs` now captures richer runner metrics:
  - `read_ops`
  - `search_ops`
  - `list_ops`
  - `unique_files_read`
  - `unique_search_queries`
  - `bytes_read`
  - `search_to_read_chains`

Primary validation:

- `pb2-metrics-roadmap-audit-v2`

The clean comparison run showed:

| Metric | Holon | Claude SDK |
|---|---:|---:|
| Success | yes | yes |
| Duration | 32.3s | 40.8s |
| Tool calls | 26 | 29 |
| Read ops | 19 | 19 |
| Search ops | 2 | 5 |
| List ops | 4 | 5 |
| Unique files read | 18 | 17 |
| Unique search queries | 2 | 10 |
| Bytes read | 204,757 | 75,010 |

Interpretation:

- Holon does not obviously read more files than Claude SDK
- the more meaningful difference is:
  - Claude SDK does more discovery-style search/list work
  - Holon reads larger chunks once it commits to evidence

### PB3: Analysis-Oriented Tooling

Result:

- added:
  - `TodoWrite`
  - `TaskList`
  - `TaskGet`
  - `TaskStop`
- todo snapshots now persist in storage and enter context construction
- runtime now supports cancelling running background tasks

Primary validation:

- `cargo test`
- `tests/runtime_flow.rs`

### PB4: Tool Surface Comparison

Result:

- comparison findings were captured in:
  - `docs/tool-surface-comparison.md`

Main conclusion:

- do not label Holon's current analysis behavior as a simple over-reading bug
- the current gap is better understood as a mix of:
  - tool-surface differences
  - read granularity
  - search/discovery strategy

### PB5: Final Delivery And Follow-Up Quality

Result:

- reporting guidance in `src/prompt.rs` was tightened again
- analysis mode now prefers a concise structured report
- roadmap-audit snapshot was updated to include
  `docs/post-benchmark-roadmap.md`
- `config-merger-bug` fixture drift was corrected back to a truly failing state

Primary validations:

- `pb5-roadmap-audit-v2`
- `pb5-followup-greeting-v1`

Observed outcomes:

- roadmap audit now finishes with a long structured report instead of a weak
  ending or stale recommendation set
- follow-up greeting context still succeeds with a compact grounded answer:
  - correct file
  - correct root cause
  - correct verification

## Current Conclusion

The current best benchmark-based judgment is:

- Holon is already competitive on open-ended analysis and local coding tasks
- the next useful improvements should target:
  - better evidence targeting
  - more realistic coordination benchmarks
  - tool-surface refinements only where metrics justify them
- raw tool-count reduction should not be treated as the main optimization goal

## Benchmark Expansion V1

After the `PB1-PB5` refinement phase, three additional benchmarks were added to
improve diagnosis:

- `coordination-sequential-render-plan`
- `analysis-evidence-improvements`
- `read-granularity-holon-analysis-pipeline`

These were run together in:

- `expansion-v1`

### Coordination Benchmark

Task:

- multi-turn coding task
- asks the agent to keep track of completed and pending steps while fixing the
  sequential render fixture

Result:

| Runner | Success | Duration | Tool Calls | TodoWrite | Verify |
|---|---:|---:|---:|---:|---:|
| Holon | yes | 11.5s | 18 | 4 | pass |
| Claude SDK | no | 62.8s | 21 | 0 | fail |

Interpretation:

- Holon used `TodoWrite` repeatedly and completed the task with a grounded
  follow-up answer.
- Claude SDK produced a plausible-looking plan/status report, but left the
  fixture unchanged and failed verification.

This is a useful benchmark because it measures more than code fixing:

- planning persistence
- session coordination
- truthful follow-up reporting

### Analysis Evidence Benchmark

Task:

- analyze a small runtime fixture
- recommend three concrete improvements
- every recommendation must cite specific files and explain a current
  limitation

Result:

| Runner | Success | Duration | Tool Calls | Read Ops | List Ops |
|---|---:|---:|---:|---:|---:|
| Holon | yes | 8.9s | 9 | 5 | 3 |
| Claude SDK | yes | 20.7s | 12 | 5 | 7 |

Interpretation:

- both runners succeeded
- both read the same number of files
- Holon reached the answer faster and with fewer total tool calls
- Claude SDK relied more on discovery/list operations for the same small
  fixture

This benchmark is good at measuring evidence discipline without mixing in
project-roadmap synthesis.

### Read Granularity Benchmark

Task:

- analyze a narrow Holon snapshot
- identify where prompt assembly, context assembly, tool execution, and
  benchmark comparison live

Result:

| Runner | Success | Duration | Tool Calls | Read Ops | List Ops | Bytes Read |
|---|---:|---:|---:|---:|---:|---:|
| Holon | yes | 15.3s | 8 | 7 | 1 | 157,142 |
| Claude SDK | yes | 22.5s | 18 | 12 | 6 | 139,124 |

Interpretation:

- Holon finished faster and with fewer total exploration steps
- Claude SDK used more discovery and more file reads on this narrow mapping
  task
- Holon still read large chunks once it committed to a file, so the benchmark
  supports a more precise claim:
  - Holon is not simply "over-reading"
  - Holon currently prefers broader file reads over more discovery steps

## Updated Judgment After Expansion V1

These new tasks strengthen the current conclusion:

- Holon is already strong on:
  - grounded follow-up
  - narrow analysis mapping
  - evidence-backed improvement recommendations
- the next benchmark work should continue to focus on:
  - coordination realism
  - evidence discipline
  - tool-surface diagnosis

The current evidence still does **not** justify a blanket claim that Holon's
analysis problem is "too many file reads". The more accurate framing is:

- Holon often reaches answers with fewer search/list steps
- Holon can still read larger evidence chunks than Claude SDK
- that is a refinement target, not a blocking defect

## Bounded Synthesis Iteration

After `extension-v2-bounded`, Holon's prompt system was updated with a
turn-scoped bounded-output section. This section activates only when the user
request explicitly asks for a bounded or highly concise answer.

The goal was:

- improve concise synthesis efficiency
- keep grounded file references
- avoid making wide analysis tasks artificially terse

Validation run:

- `bounded-v2`

### Bounded Synthesis Result

`bounded-synthesis-analysis-runtime`:

| Runner | Success | Duration | Tool Calls | Final Length | Read Ops |
|---|---:|---:|---:|---:|---:|
| Holon | yes | 3.9s | 11 | 993 | 5 |
| Claude SDK | yes | 13.9s | 8 | 1182 | 4 |

Compared to the earlier Holon run in `extension-v2-bounded`:

- duration improved from `25.5s` to `3.9s`
- final length dropped from `1428` to `993`
- read ops stayed controlled and grounded evidence remained intact

Interpretation:

- the bounded-output section materially improved Holon on the task it was meant
  to optimize
- Holon became faster than Claude SDK on this bounded synthesis benchmark while
  staying grounded

### Read Granularity Side Effect

`read-granularity-holon-analysis-pipeline` in the same run:

| Runner | Success | Duration | Tool Calls | Final Length | Read Ops | Unique Files |
|---|---:|---:|---:|---:|
| Holon | yes | 2.3s | 15 | 650 | 7 | 7 |
| Claude SDK | yes | 21.0s | 16 | 3665 | 12 | 12 |

Interpretation:

- the bounded-output optimization did not damage the broader mapping task
- on this rerun, Holon answered the scoped mapping question much faster while
  reading fewer files than Claude SDK
- this still does **not** justify promoting bounded-output guidance into all
  analysis turns; the current evidence only supports keeping it scoped to
  explicitly bounded requests

## Current Judgment

The current benchmark-based judgment is now more precise:

- Holon can be made highly competitive on concise bounded synthesis with a
  targeted, generic prompt contract
- that optimization should stay turn-scoped
- broader analysis efficiency should still be treated separately from bounded
  synthesis optimization

## Benchmark Extension V2

Two additional benchmark classes were then added:

- `task-inspection-subagent-status`
- `bounded-synthesis-analysis-runtime`

These were designed to answer two different questions:

- can Holon's new task-inspection tools support a real workflow?
- can Holon stay concise and grounded when the synthesis task is explicitly
  bounded?

### Task Inspection Benchmark

Task:

- Holon-only capability benchmark
- asks the agent to:
  - create a bounded subagent task
  - stop the main turn
  - later inspect the task state and report the result

Run:

- `extension-v2`

Result:

| Runner | Success | Duration | Tool Calls | CreateTask | TaskList | TaskGet |
|---|---:|---:|---:|---:|---:|---:|
| Holon | yes | 7.5s | 6 | 1 | 1 | 1 |

Interpretation:

- this benchmark is worth keeping
- it proves the new task-control tools are not just schema additions
- Holon used:
  - `CreateTask`
  - `TaskList`
  - `TaskGet`

Important caveat:

- the benchmark also exposed a quality issue:
  - the final answer included raw subagent output with internal planning traces
  - this suggests a result-hygiene gap in how subagent output is delivered back
    through task results

That hygiene issue was then fixed by:

- stronger `PromptMode::Subagent` output constraints
- runtime-side subagent result sanitization

Validation:

- `hygiene-v2`

Observed outcome after the fix:

- `task-inspection-subagent-status` still passed
- the final answer no longer leaked `<think>` blocks, pseudo-tool tags, or
  internal planning traces

So this benchmark should remain in the corpus, and now serves as a regression
test for subagent result hygiene.

### Bounded Synthesis Benchmark

Task:

- concise analysis task with a strict upper bound on final response length
- same fixture family as earlier analysis tasks
- still requires grounded file references and a concrete next milestone

Run:

- `extension-v2-bounded`

Result:

| Runner | Success | Duration | Tool Calls | Read Ops | Final Length |
|---|---:|---:|---:|---:|---:|
| Holon | yes | 25.5s | 7 | 4 | 1428 |
| Claude SDK | yes | 9.7s | 6 | 4 | 1024 |

Interpretation:

- this benchmark is also worth keeping
- it surfaced a real difference:
  - both runners stayed grounded
  - Claude SDK was materially faster and more concise
  - Holon still answered well, but was slower and longer under the same bounded
    synthesis task

This is useful because it isolates a narrower weakness than the open-ended
roadmap audit:

- not general analysis ability
- not file-reading discipline alone
- specifically concise synthesis efficiency

## Updated Judgment After Extension V2

The benchmark corpus now gives a more nuanced picture:

- Holon strengths:
  - grounded multi-turn follow-up
  - coordination with its own task/todo tools
  - efficient narrow mapping and evidence-backed analysis
- Holon weaknesses or open issues:
  - concise bounded synthesis is still slower than Claude SDK
  - subagent task result hygiene needs improvement

So the next good benchmark-informed work is no longer "add random tasks".
The next highest-value items are:

- fix subagent result-delivery hygiene
- improve concise synthesis efficiency without losing grounding
- continue growing the corpus around coordination and bounded reporting

So the next prompt/runtime work should focus on:

- stronger finishing/result-delivery guarantees for longer coding tasks
- richer benchmark coverage for follow-up and multi-turn sessions
- possibly revisiting Claude SDK adapter settings only if we want a separate
  "higher max-turn baseline" experiment

## Expansion Two

The next benchmark wave broadened the corpus in four directions:

- retry after a verification failure
- multi-file fix followed by a follow-up question
- restraint when no code change is needed
- open-ended project audit against a real `Holon` code snapshot

These tasks are implemented as:

- `failed-verification-retry`
- `followup-after-multifile-fix`
- `no-change-needed-analysis`
- `holon-project-roadmap-audit`

### What We Verified First

Before comparing runners, the fixtures and task definitions were smoke-tested
with Holon itself.

That showed:

- `failed-verification-retry` is a valid coding benchmark
- `followup-after-multifile-fix` is a valid multi-turn benchmark
- `no-change-needed-analysis` is a valid “do not edit” benchmark
- `holon-project-roadmap-audit` is intentionally harder and currently exposes a
  Holon weakness in open-ended analysis/result delivery

The open-ended audit task is therefore useful even though Holon does not yet
pass it reliably.

### No-Change-Needed Analysis

This task asks the runner to inspect a healthy fixture, verify it, and avoid
unnecessary edits.

Observed result:

| Task | Runner | Success | Duration | Tool Calls | Notes |
|---|---|---:|---:|---:|---|
| no-change-needed-analysis | Holon | yes | 14.9s | 8 | no edits, verified, produced a real analysis summary |
| no-change-needed-analysis | Claude SDK | yes | 17.8s | 7 | no edits, verified, also stayed disciplined |

Takeaway:

- this task is a good sanity benchmark
- both runners pass it
- it does not currently expose a large quality gap

That is useful because it shows the corpus is not biased toward forcing one side
to fail.

### Follow-Up After Multi-File Fix

This task asks the runner to repair `config-merger-bug`, then answer a
follow-up grounded in the actual repair.

Observed result:

| Task | Runner | Success | Duration | Tool Calls | Notes |
|---|---|---:|---:|---:|---|
| followup-after-multifile-fix | Holon | yes | 49.0s | 25 | repaired the bug, preserved follow-up context, answered with grounded file/root-cause/verification details |
| followup-after-multifile-fix | Claude SDK | no | 165.5s | 37 | answered confidently but verification still failed |

Important observed behavior:

- Holon repaired the bug and then answered the second-turn question using the
  actual session history.
- Claude SDK produced a convincing but false final answer:
  - claimed the changed file was `benchmark/fixtures/config-merger-bug/src/merge.js`
  - claimed `node test.js` passed
  - but the verify log still failed with the original `theme === "light"`
    assertion

This is now another strong clean gap in the benchmark corpus.

It is especially valuable because it combines:

- multi-step repair
- multi-turn context retention
- honest final reporting

### Failed Verification Retry

This task uses a small fixture with two formatting defects in the render path.

Observed result so far:

| Task | Runner | Success | Duration | Tool Calls | Notes |
|---|---|---:|---:|---:|---|
| failed-verification-retry | Holon | yes | 13.2s | 14 | repaired both defects and passed verification |

Important note:

- this benchmark does not force a strict “fail once, then retry” sequence
- a sufficiently strong runner may inspect both defects and fix them in one pass

So this task should be interpreted as:

- a benchmark that allows retry behavior
not:
- a benchmark that guarantees retry behavior

It is still worth keeping because it broadens beyond single-bug fixtures.

### Holon Project Roadmap Audit

This task asks the runner to read a real tracked snapshot of the `Holon`
repository and recommend the next concrete improvements with file-grounded
evidence.

Initial observed result:

| Task | Runner | Success | Duration | Tool Calls | Notes |
|---|---|---:|---:|---:|---|
| holon-project-roadmap-audit | Holon | no | 38.4s | 17 | gathered substantial evidence, but final result delivery collapsed into an over-short summary |

The initial failure exposed a real runtime/output bug rather than a pure
reasoning gap.

Root cause:

- the model sometimes called `Sleep` with a malformed structured payload instead
  of a single clean `reason`
- `Sleep` preserved only the short `reason` field and silently dropped the rest
  of the structured content
- `derive_final_text()` then preferred a short assistant preamble over the
  richer `Sleep` summary

This was fixed by:

- making `Sleep` preserve malformed structured payloads instead of collapsing
  them to a placeholder
- preferring a richer `Sleep` summary over obvious short preambles in
  `derive_final_text()`
- tightening the prompt contract so `Sleep` is told to pass exactly one string
  field for `reason`

After the fix, the same task became stable for both runners:

| Task | Runner | Success | Duration | Tool Calls | Final Message Length | Notes |
|---|---|---:|---:|---:|---:|---|
| holon-project-roadmap-audit | Holon | yes | 40.1s | 26 | 4756 | stable long-form report, grounded in current docs/code/benchmarks |
| holon-project-roadmap-audit | Claude SDK | yes | 82.6s | 18 | 4204 | stable long-form report, slightly more concise, slower overall |

One additional issue surfaced after that first fix:

- Holon still treated provider-side `max_tokens` truncation as a successful
  turn because it did not parse `stop_reason`
- Holon also preferred the `Sleep` tool record summary over the full
  `Sleep.reason` content, which could re-truncate a long report back into an
  ellipsized summary

That was corrected by:

- raising the old fixed `1024` output-token budget to a configurable runtime
  setting
- parsing provider `stop_reason`
- automatically continuing generation when the provider stops at `max_tokens`
- using the full `Sleep` result content rather than the truncated tool summary
  when deriving the final user-facing report

Revalidation after this second fix:

| Task | Runner | Success | Duration | Tool Calls | Final Message Length | Notes |
|---|---|---:|---:|---:|---:|---|
| holon-project-roadmap-audit | Holon | yes | 105.8s | 30 | 4349 | report completes cleanly after truncation-recovery and full Sleep-result delivery |

Comparison takeaway:

- the report-stability problem in Holon is now fixed in this benchmark
- Holon is faster on this task, but also more tool-hungry
- Claude SDK is slower, but produces a comparably grounded final report
- the current gap is no longer “Holon cannot finish open-ended audit tasks”
- the more precise remaining difference is:
  - Holon tends to over-read and over-assemble evidence
  - Claude SDK tends to read less, synthesize earlier, and spend more wall time

This task remains intentionally hard.

Its purpose is not only to check “can the model say something smart about the
repo”. Its purpose is to stress:

- open-ended project understanding
- roadmap judgment
- grounding in real files
- long-form result delivery quality

So the updated interpretation is:

- it is no longer just an aspirational benchmark
- it is now a useful comparison task for open-ended analysis quality and
  efficiency

## `SVS-401`: Focused Tool-Surface Recompare

`SVS-401` reran two focused comparison tasks with current token and model-round
metrics:

- `analysis-evidence-improvements`
- `read-granularity-holon-analysis-pipeline`

Fresh summary:

- `.benchmark-results/svs401-compare-v1/summary.json`

Key takeaways:

- Holon does not look like a simple "reads too many files" agent.
- On the focused evidence task, both runners read the same number of files.
- On the read-granularity task, Holon read fewer files and finished faster.
- Claude SDK still spends more steps in discovery/listing mode.
- Holon currently tends to spend more model rounds synthesizing once it has
  gathered evidence.
- Token and round cost are now observable on focused tasks, but historical
  older comparison runs still lack those counters and should not be used for
  token-cost claims.

See also:

- `docs/tool-surface-comparison.md`

## `#1244`: Tool Input Validation Contract Benchmark

Run label:

- `.benchmark-results/openai-tool-contract-validation-2026-05-18-1244-r1`

Task:

- Issue: [#1244]https://github.com/holon-run/holon/issues/1244
- Goal: harden common invalid built-in tool input shape handling, especially
  `Sleep`, `ExecCommand`, `ExecCommandBatch`, and existing `ApplyPatch` JSON
  surface behavior.
- Model: `openai-codex/gpt-5.3-codex-spark`

Both runners produced draft PRs that completed the core task and passed GitHub
CI:

| Runner | PR | Draft | CI | Changed Files | Additions | Deletions | Local/Agent Duration | Input Tokens | Cached/Read Input | Output Tokens | Rounds |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| Codex | [#1250]https://github.com/holon-run/holon/pull/1250 | yes | pass | 4 | 207 | 125 | 389.0s | 3,366,894 | 3,214,976 | 20,397 | 1 CLI turn |
| Holon | [#1252]https://github.com/holon-run/holon/pull/1252 | yes | pass | 4 | 169 | 122 | 1,002.3s | 5,515,298 | 4,783,360 | 24,821 | 88 model rounds |

Changed files are the same in both PRs:

- `src/tool/helpers.rs`
- `src/tool/tools/exec_command.rs`
- `src/tool/tools/exec_command_batch.rs`
- `src/tool/tools/sleep.rs`

### Implementation Comparison

Both implementations:

- remove the old permissive `Sleep` helper path that folded unknown structured
  fields into `reason`
- make `Sleep` reject unknown top-level fields
- make `Sleep` reject `duration_ms = 0`
- add `ExecCommand` invalid-shape coverage for `command` instead of `cmd`
- add `ExecCommand` invalid-shape coverage for task metadata such as `status`
- extend `ExecCommandBatch` coverage for unsupported continuation fields

The Codex PR adds a custom `parse_exec_command_args` wrapper that emits more
specific errors for:

- `command` vs `cmd`
- `status`
- `task_handle`

That gives better model-facing recovery hints, but it also adds more bespoke
parsing logic.

The Holon PR keeps `ExecCommand` validation on the shared typed parser path and
tests the generic strict-schema error. It is smaller and closer to the existing
tool parsing pattern. Its PR body is also clearer: it lists focused validation
commands and explicitly explains the typed parsing direction.

### Verification Notes

GitHub CI passed for both PRs:

- `Rust`
- `Coverage`
- `Run Holon / solve`
- Vercel preview checks

The Codex benchmark artifact reports `verify_status = failed` because the
benchmark-local `cargo test --quiet --workspace` run failed in unrelated
`runtime_compaction` tests:

- `preview_prompt_after_compaction_keeps_work_item_plan_and_pending_work_visible`
- `contentful_wake_hint_after_compaction_keeps_active_work_truth`

The focused `cargo test --quiet tool` step passed in the same artifact, and the
PR's GitHub Rust CI later passed. Treat the Codex artifact verification failure
as an environment/base verification mismatch, not as evidence that the PR
implementation is broken.

Holon's agent run performed the focused tests listed in the PR body and then
GitHub CI validated the full PR.

### Runner Behavior

Codex finished materially faster and cheaper:

- about 6.5 minutes locally versus Holon's about 16.7 minutes
- lower total input and output tokens
- one CLI turn versus Holon's 88 model rounds

Holon still completed the task and produced a valid PR, but the run shows the
same cost pattern seen in recent command/tool-contract benchmarks:

- many small model rounds
- substantial provider replay/compaction traffic
- repeated correction cycles around edits and command/PR flow

Holon cache behavior was healthy in the sense that most input was cache-read
or replayed through incremental continuation:

- 87 of 88 rounds hit incremental continuation
- total cache-read input tokens: 4,783,360

The remaining gap is therefore less about total context loss and more about
round count and execution path efficiency.

### Recommendation

Prefer keeping **Holon PR #1252** as the official PR for this benchmark.

Reasons:

- smaller patch with the same changed-file set
- stays closer to shared typed parser behavior instead of adding bespoke
  `ExecCommand` pre-parse branches
- clearer PR body and validation section
- full GitHub CI is green

Codex PR #1250 is a useful comparison artifact and has stronger custom recovery
hints for `ExecCommand`, especially `task_handle`, but those improvements should
be considered separately if we want better bespoke model-facing errors. They are
not necessary for the core #1244 contract fix.

## `#1256`: Command Task Identity Projection Benchmark

Run label:

- `.benchmark-results/command-task-identity-2026-05-18-1256-r1`

Task:

- Issue: [#1256]https://github.com/holon-run/holon/issues/1256
- Goal: expose full command identity on agent-facing `TaskList` and
  `TaskStatus` projections for `command_task`.
- Model: `openai-codex/gpt-5.3-codex-spark`

Raw benchmark results:

| Runner | PR | Draft | Verify | Changed Files | Additions | Deletions | Duration | Input Tokens | Cached/Read Input | Output Tokens | Rounds |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| Holon | [#1258]https://github.com/holon-run/holon/pull/1258 | yes | fail | 6 | 124 | 6 | 616.1s | 2,584,038 | 2,109,952 | 14,761 | 55 provider rounds |
| Codex | [#1259]https://github.com/holon-run/holon/pull/1259 | yes | fail | 5 | 124 | 18 | 1,292.6s | 9,407,166 | n/a | 33,909 | 1 CLI turn |

Changed files:

- Holon: `docs/runtime-spec.md`, `src/runtime/command_task.rs`,
  `src/runtime/tasks.rs`, `src/types.rs`, `tests/runtime_tasks.rs`,
  `tests/support/runtime_tasks.rs`
- Codex: `docs/runtime-spec.md`, `src/runtime/command_task.rs`,
  `src/runtime/tasks.rs`, `src/runtime/tests/agent_and_tools.rs`,
  `src/types.rs`

### Implementation Comparison

Both implementations add the core behavior requested by #1256:

- persist `cmd_digest` in command task detail
- expose `cmd`, `cmd_digest`, `workdir`, `shell`, `login`, and `tty` in
  `TaskStatus.task.command`
- expose a command projection in `TaskList` entries for `command_task`
- keep non-command task entries compact via an absent optional `command` field
- document that `cmd_preview` is not the agent-facing identity source

The Holon implementation is much cheaper and faster in this run:

- 2.58M input tokens versus Codex 9.41M
- 14.8K output tokens versus Codex 33.9K
- about 10.3 minutes versus about 21.5 minutes
- 51 tool calls versus Codex 124

However, the Codex implementation is cleaner as the PR to keep:

- it extracts `CommandTaskStatusSnapshot::from_task_record` and reuses it from
  both `TaskList` and `TaskStatus`
- it keeps the command projection construction local to the command projection
  type rather than routing `TaskList` through `TaskStatusSnapshot`
- it uses existing runtime unit tests plus a type-level projection test, so the
  behavior is covered close to the projection boundary
- its PR body is already a normal implementation PR with `Closes #1256`

The Holon implementation is valid enough as a benchmark result, but the PR is
less polished:

- PR body is still benchmark boilerplate
- the final agent summary only mentions a late test type-fix, not the full
  issue implementation
- the `TaskList` implementation obtains the command projection via
  `TaskStatusSnapshot::from_task_record(&task).command`, which works but is a
  less direct boundary than the Codex helper

### Verification Notes

Both raw benchmark artifacts report failed verification.

Holon:

- `cargo test --quiet task` exited with code `1` without useful diagnostic
  output in the benchmark verify log
- `cargo test --quiet --workspace` later failed in existing
  `runtime_compaction` tests:
  - `preview_prompt_after_compaction_keeps_work_item_plan_and_pending_work_visible`
  - `contentful_wake_hint_after_compaction_keeps_active_work_truth`
- GitHub CI initially failed only at `cargo fmt --check`

Codex:

- focused task tests passed in the raw artifact:
  - `cargo test --quiet task`
  - agent-run targeted projection tests
- `cargo test --quiet --workspace` failed in an unrelated
  `agent_template::tests::initialize_agent_home_fails_closed_on_invalid_skill_ref`
  assertion
- GitHub CI initially failed only at `cargo fmt --check`
- after selection, a follow-up formatting commit was pushed to #1259:
  `2fea61e style: format command identity test`
- after the formatting commit, GitHub Rust/Coverage still failed on the current
  base with an unrelated compile error in
  `src/tool/tools/exec_command_batch.rs`:
  `ExecCommandBatchItemArgs` has no field named `continue_on_result`
- the same error reproduces on current `main` with
  `cargo check --all-targets -q`, so this is a base/main break rather than a
  regression introduced by #1259

One benchmark-framework observation: the Codex runner artifact records
`pr_status = skipped_no_changes`, but the agent itself had already created
[#1259](https://github.com/holon-run/holon/pull/1259). Treat #1259 as the real
Codex PR for this run.

### Recommendation

Prefer keeping **Codex PR #1259** as the official PR.

Reasons:

- better projection boundary via a shared `CommandTaskStatusSnapshot` builder
- cleaner PR body and issue linkage
- simpler changed-file set
- focused task tests passed before the unrelated workspace verification failure
- the initial PR-specific CI issue was a trivial formatting issue and was fixed
  after selection
- the remaining red CI is caused by current `main` and should be handled
  separately before merge

Close **Holon PR #1258** as superseded by #1259. Holon had the better token and
runtime profile in this benchmark, so this run is still a positive signal for
Holon execution efficiency, but #1259 is the cleaner code review artifact.

## TaskList Active-Only Coordination View (#1260)

Run label:

- `.benchmark-results/task-list-active-only-2026-05-19-1260-r1`

Task:

- Issue: [#1260]https://github.com/holon-run/holon/issues/1260
- Goal: make `TaskList` return only active task snapshots for the current
  agent, without adding filter parameters, while preserving the compact command
  identity projection from #1256.
- Model: `openai-codex/gpt-5.3-codex-spark`

Raw benchmark results:

| Runner | PR | Draft | Verify | Changed Files | Additions | Deletions | Duration | Input Tokens | Cached/Read Input | Output Tokens | Rounds |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| Holon | [#1263]https://github.com/holon-run/holon/pull/1263 | yes | fail | 4 | 117 | 4 | 560.5s | 2,207,796 | 1,795,968 | 11,556 | 48 provider rounds |
| Codex | [#1262]https://github.com/holon-run/holon/pull/1262 | yes | fail | 4 | 132 | 4 | 688.5s | 3,463,674 | n/a | 20,054 | 1 CLI turn |

Changed files:

- Holon: `src/prompt/tools.rs`, `src/runtime/tasks.rs`,
  `src/runtime/tests/agent_and_tools.rs`, `src/tool/tools/task_list.rs`
- Codex: `src/prompt/tools.rs`, `src/runtime/tasks.rs`,
  `src/runtime/tests/agent_and_tools.rs`, `src/tool/tools/task_list.rs`

### Implementation Comparison

Both implementations make `TaskList` active-only by switching the runtime path
from historical latest records to `latest_active_task_records_for_agent`, and
both update the model-facing tool description plus task-control prompt text.
Both also add runtime tests covering terminal task exclusion and current-agent
scoping.

Holon is cheaper and faster in this run:

- 2.21M input tokens versus Codex 3.46M
- 11.6K output tokens versus Codex 20.1K
- about 9.3 minutes versus about 11.5 minutes
- 44 tool calls versus Codex 62

Codex is the better PR to keep:

- it extracts `latest_task_list_entries_for_agent`, making the current-agent
  filtering boundary explicit and directly testable
- its test keeps the active-only lookup separated from `RuntimeHandle`'s
  current-agent resolution, which makes agent scoping easier to reason about
- its PR body is already a normal implementation PR with `Closes #1260`
- GitHub Rust and Coverage checks are green

The Holon implementation is functionally close, but it keeps the agent-scoped
active lookup inline inside `latest_task_list_entries`, so the test can only
exercise scoping indirectly through the runtime's current agent.

### Verification Notes

Both raw benchmark artifacts report failed verification, but the targeted task
tests passed in both runs. The failure was the same local full-workspace
`runtime_compaction` test failure in both worktrees:

- `preview_prompt_after_compaction_keeps_work_item_plan_and_pending_work_visible`
- `contentful_wake_hint_after_compaction_keeps_active_work_truth`

GitHub CI is the more useful signal for these PRs:

- #1262: Rust passed, Coverage passed
- #1263: Rust passed, Coverage passed

One benchmark-framework observation remains: both run artifacts recorded
`pr_status = skipped_no_changes`, but the agents had created draft PRs #1262
and #1263. Treat the GitHub PRs as the real delivery artifacts for this run.

### Recommendation

Prefer keeping **Codex PR #1262** as the official PR.

Reasons:

- cleaner runtime boundary via `latest_task_list_entries_for_agent`
- direct test coverage for the agent-scoped active-only helper
- normal PR body with issue linkage
- GitHub Rust/Coverage checks are green

Close **Holon PR #1263** as superseded by #1262. Holon had the better token and
runtime profile in this benchmark, but #1262 is the cleaner review artifact.

## ExecCommand Duplicate Startup Policy (#1257)

Run label:

- `.benchmark-results/exec-command-duplicate-policy-2026-05-19-1257-r1`

Task:

- Issue: [#1257]https://github.com/holon-run/holon/issues/1257
- Goal: add `duplicate_policy` to `ExecCommand` so equivalent active
  `command_task` runs are reused by default and `start_new` explicitly starts a
  second process.
- Model: `openai-codex/gpt-5.3-codex-spark`

Raw benchmark results:

| Runner | PR | Draft | GitHub CI | Changed Files | Additions | Deletions | Duration | Input Tokens | Cached/Read Input | Output Tokens | Rounds |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| Holon | [#1266]https://github.com/holon-run/holon/pull/1266 | yes | Rust/Coverage pass | 9 | 467 | 10 | 1,228.9s | 11,328,636 | 8,380,416 | 48,285 | 178 provider rounds |
| Codex | [#1265]https://github.com/holon-run/holon/pull/1265 | yes | Rust/Coverage pass | 9 | 519 | 36 | 1,140.5s | 12,729,590 | n/a | 73,114 | 1 CLI turn |

Changed files:

- Holon: `docs/runtime-spec.md`, `src/runtime/command_task.rs`,
  `src/runtime/task_supervisor.rs`, `src/runtime/turn.rs`,
  `src/tool/tools/exec_command.rs`, `src/tool/tools/exec_command_batch.rs`,
  `src/types.rs`, `tests/runtime_tasks.rs`, `tests/support/runtime_tasks.rs`
- Codex: `docs/rfcs/tool-result-envelope.md`, `docs/runtime-spec.md`,
  `src/runtime/command_task.rs`, `src/runtime/task_supervisor.rs`,
  `src/tool/tools/exec_command.rs`, `src/tool/tools/exec_command_batch.rs`,
  `src/types.rs`, `tests/runtime_tasks.rs`, `tests/support/runtime_tasks.rs`

### Implementation Comparison

Both implementations add the core #1257 behavior:

- `ExecCommand` accepts `duplicate_policy`
- `reuse_running` is the default
- `start_new` bypasses duplicate reuse
- equivalent active command tasks return an `already_running` receipt
- terminal command tasks do not block a new command run
- runtime tests cover reuse, `start_new`, non-equivalent commands, and terminal
  task behavior

Holon is the stronger PR to keep:

- lower input token cost: 11.33M versus Codex 12.73M
- lower output token cost: 48.3K versus Codex 73.1K
- fewer tool calls: 159 versus Codex 218
- narrower scope: it does not touch `docs/rfcs/tool-result-envelope.md`
- preserves the existing `exec_command_auto_promotes_long_running_command_task`
  regression test; Codex rewrites that test while adding duplicate-policy tests
- adds explicit runtime turn summarization for `already_running`, so compact
  command receipts preserve the new disposition clearly

Codex is slightly faster in wall time, but the PR has two review risks:

- it broadens docs scope into the ToolResult RFC even though #1257 is a command
  startup behavior issue
- it rewrites an existing auto-promotion regression test, which makes the test
  delta harder to review

### Verification Notes

The benchmark framework no longer runs local raw verification for real-repo PR
benchmarks. It records GitHub CI instead.

Live PR checks:

- #1265: Rust passed, Coverage passed, Vercel passed
- #1266: Rust passed, Coverage passed, Vercel passed
- `Run Holon / solve` is a trigger workflow and may remain pending longer than
  the code-quality checks; it was not used as the primary selection signal.

### Recommendation

Prefer keeping **Holon PR #1266** as the official PR.

Reasons:

- more focused diff
- lower token and tool-call cost
- keeps existing auto-promotion coverage intact
- includes turn-summary handling for the new result disposition
- GitHub Rust/Coverage checks are green

Close **Codex PR #1265** as superseded by #1266.

## MemorySearch Task Receipt Discovery (#1261)

Run label:

- `.benchmark-results/memory-task-receipts-2026-05-19-1261-r1`

Task:

- Issue: [#1261]https://github.com/holon-run/holon/issues/1261
- Goal: index compact latest reduced task records for historical task
  discovery through `MemorySearch`, and make the returned `task:<id>`
  source refs retrievable through `MemoryGet`.
- Model: `openai-codex/gpt-5.3-codex-spark`

Raw benchmark results:

| Runner | PR | Draft | GitHub CI | Changed Files | Additions | Deletions | Duration | Input Tokens | Cached/Read Input | Output Tokens | Rounds |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| Holon | [#1268]https://github.com/holon-run/holon/pull/1268 | yes | all pass | 2 | 215 | 4 | 1,538.9s | 4,612,282 | 3,961,216 | 23,001 | 80 provider rounds |
| Codex | [#1267]https://github.com/holon-run/holon/pull/1267 | yes | all pass | 3 | 412 | 5 | 2,265.7s | 16,663,093 | n/a | 60,202 | 1 CLI turn |

Changed files:

- Holon: `src/memory/index.rs`, `src/storage.rs`
- Codex: `src/memory/index.rs`, `src/storage.rs`,
  `src/tool/tools/memory_get.rs`

### Implementation Comparison

Both implementations add the core reduced-task indexing path:

- task appends dirty the memory index
- `MemorySearch` indexes one latest task document per task id
- task documents include task id, kind, status, summary, work item metadata, and
  command identity
- tests cover latest-snapshot reduction and command identity search

Codex is the stronger PR to keep despite higher token cost:

- it updates `MemoryGet`'s model-facing `source_ref` allowlist to accept
  `task:<id>`
- it adds direct tool-level coverage for `MemoryGet` with a task source ref
- it covers more acceptance paths: task id, summary, command fragment,
  `cmd_digest`, work item metadata, latest snapshot, and `MemoryGet`
- GitHub Rust, Coverage, and Holon trigger checks are green

Holon is much cheaper and faster:

- 4.61M input tokens versus Codex 16.66M
- 23.0K output tokens versus Codex 60.2K
- 71 tool calls versus Codex 151
- about 25.6 minutes versus Codex about 37.8 minutes

But #1268 misses an important acceptance boundary: it verifies the lower-level
`get_memory()` path for `task:<id>`, while the actual model-facing
`MemoryGet` tool still rejects `task:` because `src/tool/tools/memory_get.rs`
is unchanged. That means an agent could receive a `task:<id>` search result
and still fail to fetch it through the intended tool.

### Verification Notes

Both PRs have green GitHub checks:

- #1267: Rust passed, Coverage passed, Run Holon / solve passed
- #1268: Rust passed, Coverage passed, Run Holon / solve passed

One benchmark-framework observation: Holon committed locally before framework
finalization, so the artifact recorded `pr_status = skipped_no_changes` and no
PR. I pushed the benchmark branch manually and created #1268 from the Holon
commit so the run could be compared on GitHub.

### Recommendation

Prefer keeping **Codex PR #1267** as the official PR.

Reasons:

- complete model-facing `MemoryGet task:` support
- broader acceptance coverage
- all GitHub checks are green

Close **Holon PR #1268** as superseded by #1267. Holon had a substantially
better token/runtime profile, but the implementation misses the tool-entry
acceptance criterion.