{"id": "X0X-0001", "identifier": "X0X-0001", "title": "Bootstrap non-Linear Symphony workflow for x0x", "description": "Create the repo-owned WORKFLOW.md and git-committed issue database scaffold used by the first x0x-symphony runner prototype. This intentionally avoids Linear and prepares for a future x0x CRDT-backed tracker adapter.", "priority": 2, "state": "review", "branch_name": null, "url": null, "labels": ["x0x-symphony", "workflow", "tracker-git"], "blocked_by": [], "created_at": "2026-04-28T00:00:00Z", "updated_at": "2026-04-28T00:00:00Z", "links": [{"kind": "design", "url": "../x0x-symphony/docs/design/symphony.md", "note": "Authoritative architecture for x0x-symphony"}, {"kind": "adr", "url": "../x0x-symphony/docs/adr/0001-tracker-abstraction.md"}, {"kind": "adr", "url": "../x0x-symphony/docs/adr/0002-sharded-claim-ttl.md"}, {"kind": "adr", "url": "../x0x-symphony/docs/adr/0003-no-external-tracker-v1.md"}, {"kind": "adr", "url": "../x0x-symphony/docs/adr/0004-x0x-tasklist-as-backbone.md"}], "acceptance": ["WORKFLOW.md exists at the repository root", "Workflow uses tracker.kind=git_issues instead of Linear", "issues/issues.jsonl exists and contains machine-readable records", "issues/schema.md documents states, fields, and future x0x mapping"], "validation": ["Review WORKFLOW.md front matter and prompt for consistency", "Review issues/schema.md and issues/issues.jsonl for JSONL validity"], "handoff": {"summary": "Initial non-Linear Symphony workflow and git issue database scaffold created for x0x. Open architectural questions in the original handoff are now answered in the sibling x0x-symphony repo: GitHub adapter is rejected (ADR-0003), JSONL→CRDT mapping is locked (ADR-0004), and tracker abstraction is fixed (ADR-0001).", "files_changed": ["WORKFLOW.md", "issues/README.md", "issues/schema.md", "issues/issues.jsonl"], "validation": [{"command": "python3 - <<'PY'\nimport json, pathlib\nfor line in pathlib.Path('issues/issues.jsonl').read_text().splitlines():\n if line.strip():\n json.loads(line)\nPY", "status": "passed"}], "follow_up": ["Architecture decisions are now locked in ../x0x-symphony/docs/adr/0001..0004.", "WORKFLOW.md updated to use the harness-agnostic runner: block; legacy codex: block kept for compatibility and slated for deprecation in M4.", "issues/schema.md extended with shard and claim fields used by x0x-symphony's M2.", "M1 implementation issues live in ../x0x-symphony/issues/issues.jsonl as XSY-0002..XSY-0008."]}}
{"id": "X0X-0002", "identifier": "X0X-0002", "title": "Self-DM short-circuit in send_direct_with_config", "description": "## Symptom\nWhen `/direct/send` is called with `agent_id == self.agent_id`, the daemon returns `{\"error\":\"peer_disconnected\",\"detail\":\"closed: ReaderExit\"}`. Reproduced live on nyc bootstrap (saorsa-2) by issuing `POST /direct/send` with the daemon's own agent_id as recipient.\n\n## Root cause\n`Agent::send_direct_with_config` (`src/lib.rs:2828`) has no self-DM short-circuit. For self as recipient:\n- `capability_store.lookup(to)` returns `None` (a daemon does not advertise capabilities to itself), so `gossip_ok = false`.\n- `prefer_raw_quic_if_connected: false` (new default) skips the preferred-raw branch, so `preferred_raw_err = None` and `preferred_raw_receipt = None`.\n- Dispatch falls through to the final else branch which calls `send_direct_raw_quic(self, ...)`.\n- ant-quic has no self-connection — returns `peer_disconnected: ReaderExit`.\n\nPre-existing behaviour (raw-first default) hit the same dead end via a different code path. This is not a regression introduced by the second-pass patch — but it was exposed by the new Phase A harness pattern in `tests/e2e_vps_mesh.py` where the anchor is also one of the runners and result envelopes from the anchor's runner are addressed to the orchestrator (= the anchor's own agent_id).\n\n## Evidence\n- VPS deploy run 2026-05-01T20:13:55Z (commit 2a9949a + working-tree second-pass patch); rerun 2026-05-01T20:23:29Z.\n- nyc anchor `journalctl -u x0x-test-runner.service` shows repeated `WARNING runner[nyc] DM result to da2233d6ba2f9569… failed, falling back to pubsub` — one per nyc-originated send_result. Each retry path is 3× attempts at `PUBLISH_RETRY_BACKOFF_SECS * attempt`, so this serializes nyc's results behind the fallback, increasing the chance of a settle-window miss.\n- `python3 tests/e2e_vps_mesh.py --anchor nyc` reported `Sent: 29/30, Received: 30/30, Send fails: 1` with the missed pair `nyc-singapore` — destination delivered, only the source's confirmation envelope went missing because legacy pubsub fallback is more lossy than the primary DM path.\n\n## Fix\nShort-circuit at the top of `send_direct_with_config`: if `to == self.identity.agent_id()`, deliver the payload directly to the local direct event bus (the same path `recv_direct_annotated` consumes) without going through the network stack. Construct a `DmReceipt` with `path = DmPath::Loopback` (new variant) so callers can distinguish.\n\nTouchpoints:\n- `src/dm.rs` — add `DmPath::Loopback` variant.\n- `src/lib.rs:2828` — add the short-circuit before the rtt_hint/capability lookup.\n- `src/direct.rs` — expose a fast-path enqueue API onto the direct event channel.\n- `src/dm_send.rs` — receipt helper for the loopback path.\n\n## Why now\nThe Phase A all-pairs harness will keep flaking on whichever node is the anchor until this is fixed. Any external client that runs both the daemon and an agent in the same process and addresses self for diagnostic / loopback messaging hits the same wall.", "priority": 2, "state": "review", "branch_name": null, "url": null, "labels": ["dm", "transport", "regression-mask", "vps-bootstrap"], "blocked_by": [], "created_at": "2026-05-01T20:35:00Z", "updated_at": "2026-05-01T21:02:36Z", "acceptance": ["POST /direct/send with self agent_id returns 200 ok with a receipt whose path is the new Loopback variant", "Recipient's /direct/events SSE stream emits the message envelope identically to a remote DM", "tests/runners/x0x_test_runner.py self-DM result envelopes succeed without falling back to legacy pubsub", "New unit/integration test in src/lib.rs `tests` module verifying self-DM (analogous to `connected_peer_clears_stale_lifecycle_block_before_raw_send`)", "Phase A all-pairs matrix on 6-node VPS mesh: Sent == Received == 30/30 over 3 consecutive runs"], "validation": ["cargo nextest run --all-features -E 'test(self_dm) | test(direct)'", "python3 tests/e2e_vps_mesh.py --anchor nyc --discover-secs 45 --settle-secs 60 (3 consecutive clean runs)", "ssh root@saorsa-2 'journalctl -u x0x-test-runner.service --since=<run-start> | grep -c \"falling back to pubsub\"' returns 0"], "links": [{"kind": "evidence", "url": "see ticket description", "note": "VPS deploy + Phase A run on 2026-05-01"}, {"kind": "code", "url": "src/lib.rs:2828", "note": "send_direct_with_config dispatcher"}, {"kind": "code", "url": "src/lib.rs:2922", "note": "fallthrough else branch that hits raw self-DM"}], "handoff": {"summary": "Added a true self-DM loopback path. send_direct_with_config now short-circuits self-addressed DMs before RTT/capability/offline checks, enqueues through DirectMessaging subscriber/internal delivery, returns DmPath::Loopback, and surfaces loopback in REST/direct diagnostics.", "files_changed": ["src/dm.rs", "src/dm_send.rs", "src/direct.rs", "src/lib.rs", "src/bin/x0xd.rs"], "validation": [{"command": "cargo nextest run --all-features -E 'test(self_dm) | test(direct)'", "status": "passed"}, {"command": "cargo nextest run --all-features -E 'test(self_dm) | test(warn_forward_channel_pressure) | test(recv_pump)'", "status": "passed"}, {"command": "just fmt-check", "status": "passed"}, {"command": "just lint", "status": "passed"}, {"command": "just test", "status": "passed"}, {"command": "python3 tests/e2e_vps_mesh.py --anchor nyc --discover-secs 45 --settle-secs 60 (3 runs)", "status": "not_run", "note": "requires live VPS deployment/mesh window"}], "follow_up": ["Run the 3 consecutive VPS Phase A mesh checks from the ticket before closing as done."]}}
{"id": "X0X-0003", "identifier": "X0X-0003", "title": "INFO trend signal in warn_forward_channel_pressure misses production saturation pattern", "description": "## Symptom\nProduction saturation of `recv_pubsub_tx` on VPS bootstrap nodes consistently triggers the >80% WARN log but never triggers the >50% INFO trend signal. Across a 4-min Phase A run on the 6-node VPS mesh: 37 WARN events, 0 INFO events.\n\n## Root cause\n`warn_forward_channel_pressure` in `src/network.rs:223` gates the INFO branch on:\n\n```rust\nlet bucket = (max / 10).max(1);\nif used.is_multiple_of(bucket) {\n info!(...)\n}\n```\n\nWith `max = 10000`, `bucket = 1000`, so INFO only fires when `used` lands exactly on 5000, 6000, 7000, 8000, or 9000 at the moment a forward call samples it. The actual production saturation pattern jumps from low-usage to `used = 9999..10000` between two consecutive forward calls (per-peer channel fills inside one send burst), so `used` never lands on the 1000-multiple boundaries during the climb. The INFO branch is dead code under real load.\n\n## Evidence\n- VPS deploy run 2026-05-01T20:13:55Z (commit 2a9949a + working-tree second-pass patch); rerun 2026-05-01T20:23:29Z.\n- Per-node WARN counts (>80%): nyc=2, sfo=4, helsinki=0, nuremberg=0, singapore=10, sydney=21. Per-node INFO counts (>50%): all 0.\n- All WARN entries report `used = 9999` or `used = 10000` (used_pct = 99 or 100). No WARN entry has `used` between 5000 and 9000.\n\n## Fix options\n1. **Time-rate-limited sampling** (recommended). Track per-channel `last_info_at: Instant` (e.g., on `NetworkNode` itself or in a `OnceLock<Mutex<HashMap<&'static str, Instant>>>` keyed on channel_name). Emit INFO when `used > max/2 && now - last_info_at > Duration::from_secs(30)`. Caps log volume to N events per channel per run.\n2. **Threshold-edge sampling**. Track per-channel `last_used_pct: AtomicUsize` and emit INFO when crossing into a higher 10% bucket (50→60, 60→70, etc.). Captures the climb shape but spammy on oscillation.\n3. **Sampled probabilistic** — emit INFO with probability `(used_pct - 50) / 50` once above 50%. Cheap, no state, but produces dust at low-pressure thresholds.\n\nOption 1 is the right shape for the operator audience: rare, deterministic, contains trend information.\n\n## Why this matters\nWithout an early signal the operator only learns about queue pressure when it is already at saturation — same blind spot the WARN was supposed to address but one threshold lower. The current INFO branch is dead code that gives a false sense of graduated observability.", "priority": 3, "state": "review", "branch_name": null, "url": null, "labels": ["observability", "network", "bug"], "blocked_by": [], "created_at": "2026-05-01T20:35:00Z", "updated_at": "2026-05-01T21:02:36Z", "acceptance": ["Synthetic local stress test that climbs `recv_pubsub_tx` past max/2 emits at least one INFO trend event before saturation", "Same VPS Phase A run that produced 37 WARNs and 0 INFOs now produces non-zero INFO events on the same nodes", "INFO event volume per channel per run is bounded (no more than ~10 INFOs per channel per minute under sustained pressure)", "WARN >80% behaviour is unchanged"], "validation": ["cargo test -p x0x --lib warn_forward_channel_pressure", "bash tests/e2e_stress_gossip.sh --nodes 5 --messages 5000 then grep INFO + WARN counts in proofs/", "bash tests/e2e_deploy.sh --mesh-verify then re-harvest with /tmp/harvest-vps-pressure.sh and confirm INFO > 0"], "links": [{"kind": "code", "url": "src/network.rs:223", "note": "warn_forward_channel_pressure helper"}], "handoff": {"summary": "Replaced exact bucket-boundary INFO sampling with deterministic per-channel/per-stream time-rate-limited sampling. INFO now fires on the first sample above 50%, including direct jumps to >80% saturation, while the existing >80% WARN condition remains unchanged.", "files_changed": ["src/network.rs"], "validation": [{"command": "cargo test -p x0x --lib warn_forward_channel_pressure", "status": "passed"}, {"command": "cargo nextest run --all-features -E 'test(self_dm) | test(warn_forward_channel_pressure) | test(recv_pump)'", "status": "passed"}, {"command": "just fmt-check", "status": "passed"}, {"command": "just lint", "status": "passed"}, {"command": "just test", "status": "passed"}, {"command": "VPS Phase A/B pressure re-harvest", "status": "not_run", "note": "requires live VPS deployment/mesh window"}], "follow_up": ["After VPS deploy, confirm nodes with saturation WARNs now also produce non-zero >50% INFO trend events."]}}
{"id": "X0X-0004", "identifier": "X0X-0004", "title": "Structural recv_pubsub_tx saturation on VPS bootstrap nodes — 10× buffer is mitigation, not fix", "description": "## Symptom\nOn the 6-node VPS bootstrap mesh, `recv_pubsub_tx` saturates to `used_pct = 100` sustained for tens of seconds at a time on far-from-anchor nodes. Across a 4-min Phase A + Phase B run: 37 saturation WARNs distributed nyc=2, sfo=4, helsinki=0, nuremberg=0, singapore=10, sydney=21. The 1024 → 10000 capacity bump merged in the second-pass patch (`src/network.rs:307`) does not prevent saturation — it raises the ceiling and delays the choke, but the underlying recv-pump throughput cannot keep pace with cross-region fanout under sustained gossip load.\n\n## Why this matters\nZero drops are observed (the `mpsc::Sender::send().await` back-pressures the producer rather than dropping), so on the surface the system is correct. But back-pressure propagates upstream into ant-quic's recv reader task, stalling the entire QUIC receive pipeline for the duration of the saturation. Concrete consequences:\n- Phase A `nyc-singapore` send_result envelope went missing (1/30 fail in receive matrix context) because the singapore daemon's recv pump was stalled on its 10×10000 saturated queue at the moment the fallback pubsub publish arrived.\n- Any latency-sensitive control message (lease renewal for exec sessions, SWIM ping ack, presence beacon) on the same connection blocks behind the saturated channel.\n- Memory cost is now ~10× per peer × per stream-type (10000 × payload-arc-overhead). On a bootstrap node with 7 peers × 4 stream types × 10K queue depth, that is ~280K queued messages of headroom — multi-MB to multi-GB depending on actual payload retention. Headroom we cannot drain.\n\n## Evidence\n- VPS deploy run 2026-05-01T20:13:55Z (commit 2a9949a + working-tree second-pass patch); rerun 2026-05-01T20:23:29Z.\n- VPS log harvest via `/tmp/harvest-vps-pressure.sh`: every saturation event reports `available=0..1, used=9999..10000, used_pct=99..100, channel=\"recv_pubsub_tx\", stream=Some(PubSub)`.\n- Geographic correlation: saturation rate ~ RTT to anchor. sydney (250 ms RTT to nyc): 21 events. singapore (220 ms): 10. sfo (70 ms): 4. helsinki/nuremberg (~110 ms via EU peering): 0. The slow consumer side is the long-RTT receiver, not the publisher.\n- The previous v0.18.3 fix bumped `NetworkNode::recv_tx` 128 → 10000 to handle a different stall (PubSubManager subscriber lock + EAGER fan-out). That fix landed at the transport layer; this one is one layer up at the per-peer recv forward channel inside x0x. Same underlying shape: single-consumer mpsc that can't drain at fanout rate.\n\n## Investigation needed\nBefore picking a fix, instrument the actual choke point. Add diagnostics for:\n- Per-peer per-stream-type producer rate (`tx.send` calls/s).\n- Per-stream-type consumer drain rate (`rx.recv` calls/s, latency to drain).\n- Median + p99 dwell time inside the channel.\n- Subscriber count per topic and which subscriber is the slowest consumer (which is the real choke: gossip-pubsub subscribers fan out one mpsc per subscription downstream of this channel).\n\nHypothesis to validate: the choke is the single shared `recv_pubsub_rx` consumer task in `saorsa_gossip_transport`'s adapter — every received pubsub frame is decoded, ML-DSA-verified, and re-fanned-out to per-subscription mpsc channels by one task. Under fanout load (one msg → N subscribers × per-sub mpsc(10000) sends), that single decode/verify/fanout loop is the rate limit.\n\n## Fix options (after instrumentation)\n1. **Parallelize the recv pump per stream-type or per peer**. Multiple decode/verify workers feeding off `recv_pubsub_rx`. Requires reshaping the saorsa-gossip adapter.\n2. **Drop-oldest under sustained pressure with a counter**. Convert to `try_send` with `Full(_) → drop and bump `recv_pubsub_dropped` atomic`. Expose drops via `/diagnostics/gossip`. Operator gets a real signal; pubsub reliability degrades gracefully under overload instead of stalling the whole transport.\n3. **Bound producer side by per-peer rate quota**. Reject pubsub frames from a peer whose channel is > 80% full for more than N seconds — surfaces as a peer-level signal (IHAVE retransmit later) instead of transport-level stall.\n4. **Increase per-subscription mpsc(10000) in saorsa_gossip_pubsub** if profiling shows that is the actual choke (likely contributes — subscriber bound to PubSubManager is the ultimate consumer).\n\nRecommended order: instrument first, then prototype option 2 (drop-oldest with counter) as the smallest change with the biggest signal-to-noise ratio. Option 1 is the right long-term shape but invasive.\n\n## Acceptance bar\nSame Phase A + Phase B VPS run produces no sustained `used_pct=100` for more than 5 consecutive seconds on any node, OR produces a non-zero drop counter that the operator can act on. The current state — silent stall masquerading as zero-drop correctness — is not acceptable for production.", "priority": 2, "state": "review", "branch_name": null, "url": null, "labels": ["network", "performance", "vps-bootstrap", "structural"], "blocked_by": [], "created_at": "2026-05-01T20:35:00Z", "updated_at": "2026-05-01T21:02:36Z", "acceptance": ["Per-peer per-stream-type producer/consumer rate metrics exposed on /diagnostics/gossip or new /diagnostics/recv_pump endpoint", "Decision recorded in an ADR (drop-oldest vs parallel pump vs producer rate-quota) with profiling data backing it", "Same Phase A + B VPS run sustains no recv_pubsub_tx saturation > 5s OR exposes a drop counter the operator can act on", "WARN volume per node per minute drops by at least 80% on sydney (worst-case node in 2026-05-01 baseline)"], "validation": ["Repeat /tmp/harvest-vps-pressure.sh after fix lands and compare WARN counts vs the 2026-05-01 baseline (nyc=2 sfo=4 helsinki=0 nuremberg=0 singapore=10 sydney=21)", "bash tests/e2e_stress_gossip.sh --nodes 5 --messages 5000 with new diagnostics enabled, capture stress-report.json deltas", "Memory RSS growth on saorsa-9 (sydney) over a 30-min sustained Phase A + Phase B loop stays within 2× steady-state baseline"], "links": [{"kind": "code", "url": "src/network.rs:307", "note": "data_channel_capacity(10_000) bump"}, {"kind": "code", "url": "src/network.rs:283-294", "note": "per-stream-type recv_*_tx mpsc senders"}, {"kind": "code", "url": "src/network.rs:223", "note": "warn_forward_channel_pressure helper"}, {"kind": "memory", "url": "memory/x0x_v0_18_3_fanout_stall_fixed.md", "note": "previous transport-layer recv_tx 128 → 10000 bump"}, {"kind": "blocked-by-prerequisite", "url": "X0X-0003", "note": "INFO trend fix is prerequisite for clean before/after telemetry"}], "handoff": {"summary": "Added receive-pump diagnostics under /diagnostics/gossip.recv_pump and implemented the first overload mitigation: PubSub forwarding now uses try_send, increments visible full-drop counters instead of stalling ant-quic receive draining, while Membership/Bulk retain blocking sends. ADR 0009 records the decision and baseline evidence.", "files_changed": ["src/network.rs", "src/lib.rs", "src/bin/x0xd.rs", "docs/adr/0009-recv-pump-overload-policy.md", "docs/adr/README.md"], "validation": [{"command": "cargo test -p x0x --lib recv_pump", "status": "passed"}, {"command": "cargo nextest run --all-features -E 'test(self_dm) | test(warn_forward_channel_pressure) | test(recv_pump)'", "status": "passed"}, {"command": "just fmt-check", "status": "passed"}, {"command": "just lint", "status": "passed"}, {"command": "just test", "status": "passed"}, {"command": "bash tests/e2e_stress_gossip.sh --nodes 5 --messages 5000", "status": "not_run", "note": "not run in this pass; full local/VPS stress proof still required"}, {"command": "bash tests/e2e_deploy.sh --mesh-verify and VPS pressure harvest", "status": "not_run", "note": "requires live VPS deployment/mesh window"}], "follow_up": ["Run stress and VPS Phase A+B proof loops to compare recv_pump.pubsub.dropped_full and WARN counts against the 2026-05-01 baseline.", "If PubSub drops are unacceptable, prototype parallel PubSub decode/verify/fanout workers as described in ADR 0009."]}}
{"id": "X0X-0005", "identifier": "X0X-0005", "title": "Parallel PubSub decode/verify/fanout workers downstream of recv_pubsub_rx", "description": "## Symptom\nAfter 8 hours of normal 6-node bootstrap-mesh operation, the PubSub dispatch loop on every VPS node falls behind sustained inbound rate and the receive forward channel fills to ~100% with sustained drops. VPS Phase A all-pairs DM matrix fails with publish-side timeouts (`POST /publish` returns `timed out after 1 retries over 12s`), discovery republishes time out, and SWIM repeatedly marks cross-region peers dead despite ant-quic reporting them connected. A coordinated daemon restart immediately restores the mesh: producer rate jumps 10×, consumer rate matches, drops go to zero, and Phase A is 30/30 again. The cycle then repeats over hours.\n\n## Hard data — nyc bootstrap (saorsa-2), 8h vs fresh restart\n(captured via `GET /diagnostics/gossip` → `recv_pump.pubsub` and `dispatcher.pubsub`, both added in cc5c3b6 / e876b0d as part of X0X-0004 to surface exactly this failure mode.)\n\n| Metric | 8h saturated state | Fresh restart (3 min) | Δ |\n|---|---|---|---|\n| recv_pubsub_tx latest_depth | 9995 / 10000 | 1 / 10000 | ×9995 |\n| recv_pubsub_tx max_depth | 10000 | 79 | ×127 |\n| producer_per_sec | 3.47 | 38.27 | ÷11 |\n| consumer_per_sec | 1.29 | 38.27 | ÷30 |\n| dropped_full | 56,155 (53% of produced) | 0 | — |\n| avg_dwell_ms | 786,140 (13 min) | 3 | ÷262,000 |\n| max_dwell_ms | 29,378,617 (8h) | 180 | ÷163,000 |\n| dispatcher.pubsub.timed_out | 953 | 0 | — |\n| dispatcher.pubsub.max_elapsed_ms | 30,014 | 195 | ÷154 |\n| dispatcher.pubsub.received | 39,043 | 5,151 | — |\n| dispatcher.pubsub.completed | 38,087 | 5,151 | — |\n\nTop per-peer producers in the saturated state (8h, drop ratio in parentheses):\n\n| Peer | pubsub_produced | pubsub_dropped_full |\n|---|---:|---:|\n| 6a24bdedd… (nuremberg) | 25,218 | 12,630 (50%) |\n| b7a23e48a… (sydney) | 24,519 | 14,096 (57%) |\n| dc090fd3d… (sfo) | 21,888 | 14,240 (65%) |\n| bcf43dc02… (singapore) | 12,502 | 4,316 (35%) |\n| 16f0bf033… (helsinki) | 10,880 | 3,140 (29%) |\n| b2606ba6d… (external) | 8,073 | 7,408 (92%) |\n\nAggregate: 105,192 produced over ~30,000 s = ~3.5/s steady-state from the cross-region bootstrap mesh under no synthetic test load. Membership traffic is far heavier (~13/s aggregate, 99,759 in 8h) but drops zero because Membership uses blocking `tx.send().await` per ADR 0009 §2.\n\n## Why this surfaces now\nThree recent commits made this measurable and unblocked diagnosis:\n\n- **e876b0d (X0X-0004)** added the `recv_pump` diagnostics block to `/diagnostics/gossip` (produced/enqueued/dequeued/dropped/depth/dwell/rates/per-peer). Without these counters the saturation was invisible until DM delivery itself failed.\n- **e876b0d** switched PubSub forwarding to `mpsc::Sender::try_send`, which converts queue-full from a producer-blocking event (which back-pressures ant-quic recv reader and stalls the entire transport) into a counted drop. Without that change we would have seen ant-quic stalls instead of measurable drop counts.\n- **5e482fb (X0X-0003 follow-up)** rate-limited the >80% pressure WARN, so journald no longer hides the dispatcher.pubsub.timed_out signal under per-call WARN spam.\n\nTogether these mean the team can now point at a single concrete metric (`recv_pump.pubsub.consumer_per_sec ≪ producer_per_sec`) as the root cause of the mesh degrading over hours.\n\n## Root cause\n`src/gossip/runtime.rs:204` — `run_pubsub_dispatcher` is a single tokio task with this shape:\n\n```rust\nloop {\n match network.receive_pubsub_message().await {\n Ok((peer, data)) => {\n // ... record dequeue, dequeue_total ++ ...\n match tokio::time::timeout(\n PUBSUB_MESSAGE_HANDLE_TIMEOUT, // 30 s\n pubsub.handle_incoming(peer, data),\n ).await { ... }\n }\n Err(e) => break,\n }\n}\n```\n\nSequential. Every PubSub frame's `pubsub.handle_incoming` runs to completion (or 30 s timeout) before the next frame is dequeued. Inside `handle_incoming` (`src/gossip/pubsub.rs`):\n\n1. Decode bincode envelope + signed header.\n2. ML-DSA-65 signature verification (`verify_signature` at `src/gossip/pubsub.rs:797`) — single-threaded crypto.\n3. PlumTree dedupe + IHAVE bookkeeping.\n4. EAGER fanout to N peer subscribers — synchronous `tx.send().await` to each subscriber's per-subscription mpsc(N) channel; if any subscriber's channel is full the whole dispatcher waits for that subscriber to drain.\n5. Republish on the mesh — synchronous `network.send_pubsub` per fanout target.\n\nStep 4 is the most likely 30-s offender in production: a single slow subscriber (e.g., an SSE consumer on `/events` that hasn't been read from in minutes) blocks the entire dispatcher. ML-DSA verification is fast (<10 ms even on slow VPS); the dispatcher cannot legitimately spend 30 s on cryptography. The 953 timeouts × ~30 s ≈ 28,590 s of dispatcher CPU lost over 8h of uptime ≈ 95% of dispatcher cycles stuck waiting on something downstream of step 4 or 5.\n\nMembership and Bulk use the same loop shape (`run_membership_dispatcher`, `run_bulk_dispatcher`) but with shorter timeouts (5 s) and lower steady-state rates, so they don't visibly fall behind in this regime. They will hit the same wall under enough load.\n\n## Fix shape\nSpawn N concurrent worker tasks (target N = `tokio runtime worker threads / 2`, capped at e.g. 8) that share the `recv_pubsub_rx` consumer. Each worker independently pulls one frame, decodes, verifies, fans out, and republishes. The single mpsc receiver becomes a work queue.\n\nCritical correctness invariants the implementation must preserve:\n\n1. **PlumTree IHAVE/IWANT/dedupe state** is shared mutable state inside `PubSubManager`; concurrent `handle_incoming` calls must hold the appropriate locks for the shortest window possible. Validate that two workers concurrently observing the same `msg_id` do not double-republish.\n2. **Subscriber broadcast ordering**: per-subscriber order from any single sender peer should be preserved (or explicitly relaxed in an ADR addendum). With N workers consuming an unordered work queue, frames from peer P may complete out of arrival order. Decide whether x0x's PubSub semantics require per-(sender, topic) FIFO and pin to a single worker per (sender, topic) hash if so.\n3. **Timeout semantics**: the existing 30 s per-message timeout becomes per-worker; one stuck subscriber still pins one worker for 30 s but the other N-1 workers continue draining. Acceptable.\n4. **Back-pressure**: if all N workers are stuck, the queue fills again. The X0X-0004 `try_send` drop policy on the producer side remains the safety net. The fix raises throughput, it does not change the overload behaviour.\n\n## Smaller mitigations (do alongside, not instead)\n- **Per-subscriber timeout on the EAGER fanout**: wrap the inner `subscriber_tx.send().await` in a 250 ms `tokio::time::timeout` and drop+counter on a slow subscriber rather than letting it pin the dispatcher. This is a 5-line change and would have prevented the 8 h saturation cascade we just observed; pair it with a counter on `PubSubManager` for slow-subscriber drops so the operator can see which subscriber is the choke (e.g., a long-running SSE consumer that stopped reading).\n- **Dwell-based health signal**: surface `recv_pump.pubsub.avg_dwell_ms > 1000` as a /diagnostics/health amber signal so operators see degradation before delivery fails.\n\n## Acceptance bar\n1. Under the same 6-node bootstrap mesh holding ~3.5 inbound PubSub/s steady-state, `consumer_per_sec >= producer_per_sec` over a 24 h window.\n2. `recv_pump.pubsub.dropped_full` does not exceed 1% of `produced_total` over a 24 h window in steady state.\n3. `recv_pump.pubsub.avg_dwell_ms < 100` p95 over the window.\n4. `dispatcher.pubsub.timed_out` rate < 1 per minute over the window (currently 953 / 30,000 s ≈ 1.9 per minute).\n5. VPS Phase A all-pairs matrix passes 30/30 after 24 h of mesh uptime without restart (currently fails after ~6–8 h).\n6. Existing PlumTree dedupe + republish semantics preserved (covered by `crdt_partition_tolerance.rs` and the gossip integration tests).\n\n## Validation plan\n- New benchmark `benches/gossip_dispatch_throughput.rs` measures messages/sec at the `pubsub.handle_incoming` boundary, with synthetic subscribers of varying slowness (0 ms, 100 ms, 1 s, blocked). Compare baseline vs N-worker variants for N ∈ {1, 2, 4, 8}.\n- Stress test: extend `tests/e2e_stress_gossip.sh` with a `--slow-subscriber` flag that subscribes via SSE and sleeps inside the consumer; assert dispatcher throughput remains > 80% of baseline with one slow subscriber per topic.\n- VPS soak: deploy the change to bootstrap, capture `/diagnostics/gossip` snapshots every 30 min for 48 h, attach to `proofs/X0X-0005/<run-id>/`. Expect drop_full = 0 and dwell stable under 100 ms p95.\n- Existing tests must still pass: `cargo nextest run --all-features --workspace`, `bash tests/e2e_dogfood_local.sh`, `bash tests/e2e_feature_parity.sh`, `bash tests/e2e_comprehensive.sh`.\n\n## Risk + rollback\nConcurrent `handle_incoming` is the highest-risk change in the PubSub layer this year. PlumTree's deduplication and IHAVE/IWANT scheduling are subtle. Rollback is mechanical (reduce N to 1) and the X0X-0004 drop counters give a clear monitor for regressions. Land behind a config flag (`gossip.dispatch_workers: Option<u32>`, default 1 for one release cycle), bake on bootstrap for 48 h, then change the default.\n\n## Why now and not earlier\nADR 0009 §Follow-up named this work as conditional: *\"if VPS proof runs still show unacceptable PubSub loss or control-plane latency, prototype the next structural option: parallel PubSub decode/verify/fanout workers downstream of `recv_pubsub_rx`.\"* The 2026-05-02 8-hour saturation event is that condition met with concrete telemetry. Filing now while the evidence is fresh.", "priority": 2, "state": "in_progress", "branch_name": null, "url": null, "labels": ["network", "performance", "vps-bootstrap", "structural", "gossip"], "blocked_by": [], "created_at": "2026-05-02T07:20:00Z", "updated_at": "2026-05-03T00:00:00Z", "acceptance": ["consumer_per_sec >= producer_per_sec over 24 h on the 6-node bootstrap mesh under steady-state load", "recv_pump.pubsub.dropped_full <= 1% of produced_total over a 24 h window", "recv_pump.pubsub.avg_dwell_ms p95 < 100 over the window", "dispatcher.pubsub.timed_out rate < 1 per minute over the window", "VPS Phase A all-pairs matrix passes 30/30 after 24 h of mesh uptime without restart", "PlumTree dedupe + republish semantics preserved (existing crdt_partition_tolerance + gossip integration tests pass)", "Per-(sender, topic) FIFO ordering decision recorded in an ADR addendum to 0009 (or a new ADR)", "Worker count exposed as a config knob (gossip.dispatch_workers, default 1 for one release cycle)"], "validation": ["cargo nextest run --all-features --workspace (1074+ tests, no regressions)", "cargo bench --bench gossip_dispatch_throughput (new benchmark; baseline vs N-worker variants)", "bash tests/e2e_stress_gossip.sh --nodes 5 --messages 5000 --slow-subscriber (new flag)", "bash tests/e2e_dogfood_local.sh + bash tests/e2e_feature_parity.sh (regression smoke)", "VPS deploy + 48 h soak with /diagnostics/gossip snapshots every 30 min, attach to proofs/X0X-0005/<run-id>/", "VPS Phase A 30/30 immediately after deploy, again at 24 h, again at 48 h", "ssh root@saorsa-N 'curl /diagnostics/gossip | jq .recv_pump.pubsub' on every node — assert dropped_full / produced_total < 0.01"], "links": [{"kind": "adr", "url": "docs/adr/0009-recv-pump-overload-policy.md", "note": "Follow-up explicitly named in this ADR"}, {"kind": "ticket", "url": "X0X-0004", "note": "Diagnostics that surfaced this; X0X-0004 is the prerequisite measurement work"}, {"kind": "code", "url": "src/gossip/runtime.rs:204", "note": "run_pubsub_dispatcher single-task loop"}, {"kind": "code", "url": "src/gossip/runtime.rs:32", "note": "PUBSUB_MESSAGE_HANDLE_TIMEOUT = 30s"}, {"kind": "code", "url": "src/gossip/pubsub.rs:740", "note": "verify_signature inside handle_incoming"}, {"kind": "code", "url": "src/network.rs:732", "note": "recv_pubsub_tx capacity 10_000 (X0X-0004 mitigation buffer)"}, {"kind": "evidence", "url": "see ticket description table", "note": "8h saturated vs fresh restart diagnostics on saorsa-2 (nyc), 2026-05-02"}], "handoff": {"summary": "Implemented in 8c09983 + 6d9ff7f (config serde fix). Soaked at dispatch_workers=4 for ~5 h on the 6-node VPS bootstrap mesh on 2026-05-02 starting 08:30Z. Workers spawn correctly and the diagnostic (/diagnostics/gossip → dispatcher.pubsub_workers=4) confirms 4 parallel tasks active on every node. The soak failed all four acceptance bars: the same saturation curve reappeared in ~2 h. Phase A all-pairs broke at +5 h (only nyc discoverable; 5 of 6 runners did not respond to discovery probe). slow_subscriber_dropped = 0 across all 6 nodes — the local subscriber isolation path the team added is NOT engaging, so the dispatcher's 30 s blocks are NOT caused by stuck SSE consumers. Parallel decode/verify/fanout alone is insufficient because all 4 workers contend on the same shared resource inside PubSubManager::handle_incoming (likely the per-topic RwLock or the synchronous EAGER republish). The soak validated the implementation works as designed and surfaced that the actual bottleneck is downstream of the worker count knob — must be located via per-stage instrumentation (X0X-0006) before any further worker-count tuning.", "files_changed": ["src/gossip/config.rs", "src/gossip/runtime.rs", "src/gossip/pubsub.rs", "src/lib.rs", "src/bin/x0xd.rs", "tests/e2e_stress_gossip.sh", "benches/gossip_dispatch_throughput.rs", "Cargo.toml", "docs/adr/0009-recv-pump-overload-policy.md"], "validation": [{"command": "cargo nextest run --all-features --workspace", "status": "passed (1075/1075, 142 skipped)"}, {"command": "cargo bench --bench gossip_dispatch_throughput -- --test", "status": "passed"}, {"command": "cargo fmt --check + cargo clippy -D warnings", "status": "passed"}, {"command": "VPS deploy v0.19.18 + dispatch_workers=4 + 5 h soak", "status": "completed; acceptance bars failed"}], "follow_up": ["X0X-0006 implementation now ready for review: /diagnostics/gossip exposes pubsub_stages plus dispatcher elapsed buckets; next action is VPS 30 min collection at workers=1 to identify the dominant stage.", "X0X-0006 opened: per-stage instrumentation of pubsub.handle_incoming required to identify the choke", "All 6 VPS daemons restarted at 2026-05-02T13:30Z to recover the mesh; Phase A 30/30 post-restart confirmed", "Soak proof artefacts preserved at proofs/X0X-0005-soak-2026-05-02T08-30Z/ (snapshots.csv + per-snapshot per-node JSONs)", "Default dispatch_workers stays at 1 in shipped code; raising it without first fixing the X0X-0006 root cause provides no benefit and may make timing-sensitive issues worse", "Per-node soak headline (5 h):", " nyc prod 33.96→5.40/s drops 18,463 dispatcher.timed_out 922", " sfo prod 25.81→5.19/s drops 31,624 dispatcher.timed_out 1,816", " helsinki prod 26.10→4.97/s drops 27,192 dispatcher.timed_out 1,816", " nuremberg prod 28.50→6.45/s drops 0 dispatcher.timed_out 647", " singapore prod 27.87→6.48/s drops 0 dispatcher.timed_out 1,300", " sydney prod 21.17→5.53/s drops 0 dispatcher.timed_out 1,255", "Acceptance bars vs result:", " consumer_per_sec >= producer_per_sec — FAIL (cons drifted below producer on 3/6 nodes)", " dropped_full <= 1% of produced — FAIL (sfo 26%, nyc 15%, helsinki 22%)", " dispatcher.timed_out < 1/min — FAIL (sfo ~6/min, nyc ~3.1/min)", " Phase A 30/30 after 24 h uptime — FAIL at +5 h", "X0X-0008 update 2026-05-02T22:42:00Z:", "X0X-0005's parallel-workers code (8c09983) is functional and now demonstrably useful per the X0X-0008 mixed-config soak: setting dispatch_workers=4 on the long-RTT nodes (sfo, singapore, sydney) was the difference between Phase A 2/20 and Phase A 89/90 across 3 consecutive runs. With X0X-0007 (parallel republish + per-peer timeout) and X0X-0008 (per-message-kind diagnostics + bounded control sends + jitter) both shipped, the dispatch_workers knob now provides real per-node throughput scaling. Default stays at 1 per ADR 0009 since most deployments will not need more — workers > 1 only buys throughput when republish work was previously blocking the dispatcher loop, which was the case before X0X-0007 fixed the structural choke. Long-RTT bootstrap nodes are the current canonical case where workers > 1 helps. Recommended close: move X0X-0005 to done. The implementation is shipped, the diagnostic exposes the configured count, and the operational guidance for when to raise the knob is captured in the X0X-0008 handoff.", "X0X-0009 prototype filed 2026-05-03T00:00:00Z:", "X0X-0009 prototype 2026-05-03 supersedes the manual `dispatch_workers` tuning approach this ticket shipped with. The supervisor implementation in `src/gossip/runtime.rs` (uncommitted at filing time) adapts the worker count at runtime based on five orthogonal saturation signals; `dispatch_workers` becomes the initial floor for the supervisor rather than a production tuning knob. X0X-0005 can move to done once X0X-0009 lands and a 24 h soak shows the supervisor converging to a stable target without operator action."], "proofs_dir": "proofs/X0X-0005-soak-2026-05-02T08-30Z", "updated_at": "2026-05-03T00:00:00Z"}}
{"id": "X0X-0006", "identifier": "X0X-0006", "title": "Per-stage instrumentation of PubSubManager::handle_incoming to locate the dispatcher 30s block", "description": "## Why\nX0X-0005's soak proved that adding parallel PubSub dispatch workers (dispatch_workers=4) does not reduce the dispatcher saturation that appeared with workers=1: the same curve reappears in ~2 h, dispatcher.timed_out grows at the same per-minute rate (~3-6/min depending on node), and Phase A all-pairs still breaks at +5 h. The team's slow_subscriber_dropped counter stays at 0, so the local SSE/subscriber path is not the choke. The blocker must be downstream of the worker count knob — inside `PubSubManager::handle_incoming` (`src/gossip/pubsub.rs:466`) or the saorsa_gossip layer it calls.\n\nWithout per-stage timing, every further tuning attempt is guessing.\n\n## What\nWrap the four phases of `PubSubManager::handle_incoming` with `Instant`-delta sampling and a new `PubSubStageStats` block on `/diagnostics/gossip`. Stages to time independently:\n\n1. **decode** — bincode header + signed-envelope parse.\n2. **verify** — ML-DSA-65 signature verification.\n3. **dedupe_lock_acquire + dedupe_check** — time spent waiting for the PlumTree per-topic RwLock vs time spent inside the lock.\n4. **eager_fanout** — synchronous `network.send_pubsub` per EAGER target (report per-target latency p50/p95/max).\n5. **republish** — broadcast to mesh.\n\nFor each stage, expose: `count`, `total_ns`, `max_ns`, `over_1s_count`, `over_5s_count`, `over_30s_count` so a single GET can show which stage is the 30 s offender. The current `dispatcher.pubsub.timed_out` counter only tells us the whole call exceeded 30 s; it does not say where.\n\n## Acceptance bar\n1. After 30 min of normal mesh traffic, the new endpoint identifies ONE stage with > 50% of cumulative dispatcher wall-clock time.\n2. The same instrumentation works for membership and bulk dispatchers (those have 5 s timeouts; if any stage approaches 5 s we want to know before they start failing too).\n3. Instrumentation overhead < 5% on the gossip_dispatch_throughput bench (verified before merge).\n4. New unit test: a synthetic `handle_incoming` with a controllable slow stage produces the expected per-stage counter delta.\n\n## Validation plan\n- `cargo bench --bench gossip_dispatch_throughput -- --baseline before-X0X-0006` \n to confirm overhead.\n- VPS deploy + 30 min collection at workers=1 (the deployed default) \n with the new per-stage stats. Read `/diagnostics/gossip` and identify \n the offending stage.\n- Once the stage is known, file the actual fix as X0X-0007 (or update \n X0X-0005 if the fix happens at the dispatcher layer).\n\n## Why not skip straight to the fix\nWe have hypotheses (per-topic RwLock contention, slow EAGER fanout to a specific peer, network.send_pubsub serialization) but no data to pick between them. X0X-0005 proved that guessing the layer wastes a soak cycle. Instrument first; this is the cheapest experiment that decisively narrows the search space.\n\n## Risk\nLow. Adds AtomicU64 counters around existing code paths. No behavior change. Reversible by removing the counters if overhead exceeds 5%.\n\n## Links\n- X0X-0005: parallel workers shipped, soak failed → this is the diagnostic step that should have come first.\n- ADR 0009: receive-pump overload policy.\n- Soak evidence: proofs/X0X-0005-soak-2026-05-02T08-30Z/.", "priority": 2, "state": "review", "branch_name": null, "url": null, "labels": ["gossip", "observability", "performance", "vps-bootstrap"], "blocked_by": [], "created_at": "2026-05-02T13:35:00Z", "updated_at": "2026-05-02T15:05:00Z", "acceptance": ["Per-stage timing block exposed under /diagnostics/gossip with count, total_ns, max_ns, over_1s_count, over_5s_count, over_30s_count for each of: decode, verify, dedupe_lock_acquire, dedupe_check, eager_fanout, republish", "After 30 min of normal mesh traffic on a single VPS, the endpoint identifies ONE stage with > 50% of cumulative dispatcher wall-clock time", "Same instrumentation applies to membership and bulk dispatchers", "Instrumentation overhead < 5% on the gossip_dispatch_throughput bench", "New unit test: synthetic handle_incoming with a controllable slow stage produces the expected per-stage counter delta"], "validation": ["cargo nextest run --all-features --workspace (no regressions)", "cargo bench --bench gossip_dispatch_throughput -- --baseline before-X0X-0006 (overhead < 5%)", "VPS deploy + 30 min snapshot at workers=1, identify the offending stage from /diagnostics/gossip", "Compare new per-stage counters with proofs/X0X-0005-soak-2026-05-02T08-30Z/ to confirm the same saturation regime"], "links": [{"kind": "ticket", "url": "X0X-0005", "note": "Proved parallel workers alone are insufficient; this ticket diagnoses why"}, {"kind": "adr", "url": "docs/adr/0009-recv-pump-overload-policy.md", "note": "Recv-pump overload policy"}, {"kind": "code", "url": "src/gossip/pubsub.rs:466", "note": "fn handle_incoming — instrument the four phases"}, {"kind": "code", "url": "src/gossip/runtime.rs:204", "note": "run_pubsub_dispatcher loop that calls handle_incoming with the 30s timeout"}, {"kind": "evidence", "url": "proofs/X0X-0005-soak-2026-05-02T08-30Z/snapshots.csv", "note": "10 snapshots over 5h showing the same saturation at workers=4 as the prior workers=1 baseline"}], "handoff": {"summary": "Per-stage instrumentation deployed and exercised on the 6-node VPS bootstrap mesh at the deployed default dispatch_workers=1 for 30 min starting 2026-05-02T14:31:33Z. 10 samples × 6 nodes = 60 captures of /diagnostics/gossip captured to proofs/X0X-0006-collect-2026-05-02T14-31Z/. Findings are decisive: republish owns 73.1% avg of dispatcher wall-clock time (range 64-86% per node), verify owns 24.3%, every other stage is < 2%. Dedupe lock acquisition (the hypothesised PlumTree contention) is 0.2% — NOT the choke. Local subscriber fanout is 0.1% — NOT the choke. The actual blocker is the EAGER republish-to-mesh loop in saorsa-gossip-pubsub at ../saorsa-gossip/crates/pubsub/src/lib.rs:935-946 which sequentially awaits transport.send_to_peer(...) for every EAGER peer; one slow peer pins the entire dispatcher for the duration of the slow send. X0X-0007 filed with the concrete fix shape (parallel sends + per-peer timeout).", "files_changed": ["Cargo.toml", "src/gossip.rs", "src/gossip/pubsub.rs", "src/gossip/runtime.rs", "src/lib.rs", "src/bin/x0xd.rs", "../saorsa-gossip/crates/pubsub/src/lib.rs (sibling repo)"], "validation": [{"command": "cargo nextest run --all-features --workspace", "status": "passed (1078/1078, 142 skipped)"}, {"command": "cargo bench --bench gossip_dispatch_throughput -- --test", "status": "passed (~1.5% overhead vs baseline, well under 5% bar)"}, {"command": "VPS deploy + 30 min collection at workers=1", "status": "passed; data preserved at proofs/X0X-0006-collect-2026-05-02T14-31Z/"}, {"command": "Phase A 30/30 immediately post-deploy", "status": "passed"}], "follow_up": ["Per-stage % of dispatcher wall-clock (avg across 6 nodes, 30 min @ workers=1):", " republish 73.1% ← choke", " verify 24.3%", " decode 1.3%", " dedupe_check 1.0%", " dedupe_lock_acquire 0.2%", " eager_fanout 0.1%", "Per-node republish %: nyc 63.9, sfo 74.3, helsinki 66.4, nuremberg 86.1, singapore 77.0, sydney 71.0", "Long-tail republish events (count of >1s / >5s / >30s in 30 min):", " nyc 1/0/0 sfo 1/1/0 helsinki 2/1/0 nuremberg 12/4/0 singapore 3/1/0 sydney 2/0/0", "X0X-0007 filed for the actual fix (parallel sends + per-peer timeout in republish loop)", "Acceptance bar met: ONE stage identified with > 50% of dispatcher wall-clock time (republish)", "Bench overhead ~1.5%, under the 5% bar"], "proofs_dir": "proofs/X0X-0006-collect-2026-05-02T14-31Z"}}
{"id": "X0X-0007", "identifier": "X0X-0007", "title": "Parallelize EAGER republish + per-peer timeout in saorsa-gossip-pubsub", "description": "## Why\nX0X-0006 instrumentation captured 30 min of telemetry across the 6-node VPS bootstrap mesh and identified the dispatcher's dominant blocker with no ambiguity: the EAGER republish loop owns 73.1% avg (64-86% per node) of dispatcher wall-clock time. Every other stage is < 25% combined.\n\nLong-tail observed in 30 min:\n nyc 1/0/0 sfo 1/1/0 helsinki 2/1/0 nuremberg 12/4/0\n singapore 3/1/0 sydney 2/0/0 (>1s / >5s / >30s)\n\n## Root cause\n`../saorsa-gossip/crates/pubsub/src/lib.rs:935-946` (PlumTree EAGER republish phase):\n\n```rust\nlet republish_started = Instant::now();\nlet bytes: Bytes = match postcard::to_stdvec(&message) { ... };\n// Forward EAGER (best-effort: log failures, don't abort the loop)\nfor peer in eager_peers {\n if let Err(e) = self\n .transport\n .send_to_peer(peer, GossipStreamType::PubSub, bytes.clone())\n .await\n { warn!(...); }\n}\nself.record_stage(PubSubStage::Republish, republish_started);\n```\n\nEach `send_to_peer` is awaited sequentially. A single slow peer (high RTT, congested, partial connectivity, NAT renegotiation, or the receive_pump back-pressure on the receiver itself) pins the republish loop for the duration of that send. With ~7 EAGER peers per topic on the bootstrap mesh, total republish latency = sum of all per-peer send latencies. Under saturation the slowest peer dominates and grows as the mesh degrades, producing the dispatcher 30 s timeouts X0X-0005 catalogued.\n\n## Fix\nTwo changes in `saorsa-gossip-pubsub`, both in the same EAGER republish path (and the same shape elsewhere — IHAVE/IWANT loops have similar `for peer { send_to_peer.await }` patterns at lines 996+):\n\n1. **Parallel sends** — replace the sequential `for peer { ... .await }` with `futures::future::join_all(eager_peers.iter().map(|p| { ... }))` or `tokio::task::JoinSet`. All peer sends run concurrently; total latency = max(per-peer latency), not sum.\n2. **Per-peer timeout** — wrap each `send_to_peer` in a `tokio::time::timeout(PER_PEER_REPUBLISH_TIMEOUT, ...)`. Default 750 ms (longer than nominal cross-region RTT, shorter than the dispatcher's 30 s ceiling). On timeout, log + bump a `republish_per_peer_timeout` counter and move on. A single stuck peer cannot pin the dispatcher beyond the bounded budget.\n\nBoth together are required — parallel without per-peer timeout still has the slowest peer set the loop's wait time; per-peer timeout without parallel still serializes through the same slow set.\n\n## Acceptance bar\n1. Re-run the X0X-0006 30 min collection at workers=1 with the patch applied. `pubsub_stages.republish` total_ns drops to < 25% of dispatcher wall-clock on every node (currently 64-86%).\n2. `pubsub_stages.republish.over_5s_count` is 0 across all 6 nodes in 30 min (currently nuremberg 4, sfo 1, helsinki 1, singapore 1).\n3. `dispatcher.pubsub.over_30s_count` is 0 across all 6 nodes in 30 min (currently nyc 1, others 0 over the same 30 min — about to grow the longer the daemons run).\n4. New `republish_per_peer_timeout` counter exposed under `pubsub_stages.republish_per_peer_timeout` so operators see the isolated-slow-peer signal instead of a buried dispatcher block.\n5. VPS Phase A passes 30/30 after 6 h of mesh uptime without restart (currently breaks at 5 h per X0X-0005 soak).\n6. Bench overhead < 5% on `gossip_dispatch_throughput` vs the X0X-0006 baseline.\n\n## Risk + rollback\nMedium-low. Behavior change in saorsa-gossip-pubsub PlumTree implementation. PlumTree EAGER semantics are preserved (every peer in the eager set still receives the message); only the await order changes from sequential to concurrent. PER_PEER_REPUBLISH_TIMEOUT becomes a config knob with default 750 ms; rollback is setting it to 60 s (effectively unlimited) and reverting the parallel join.\n\nWatch the new `republish_per_peer_timeout` counter — if it spikes for one peer-id pair specifically, that peer has a real connectivity problem worth investigating separately. The counter is the operator's signal that overload is concentrated, not diffuse.\n\n## Validation plan\n1. `cargo bench --bench gossip_dispatch_throughput -- --baseline before-X0X-0007` (overhead < 5%).\n2. New unit test in saorsa-gossip-pubsub: a synthetic transport with one slow peer (sleeps 2 s in send_to_peer) confirms republish total latency stays under PER_PEER_REPUBLISH_TIMEOUT * 2 with N=8 fast peers.\n3. New unit test: counter `republish_per_peer_timeout` increments for the slow peer.\n4. `cargo nextest run --all-features --workspace` (no regressions).\n5. VPS deploy + 30 min collection (matches X0X-0006 protocol). Compare `pubsub_stages` against proofs/X0X-0006-collect-2026-05-02T14-31Z/.\n6. VPS soak: 6 h continuous, capture `/diagnostics/gossip` snapshots every 15 min, run Phase A every hour. Acceptance: 30/30 every Phase A run, drop_full = 0, dispatcher.timed_out = 0.\n\n## Why now\nX0X-0006 explicitly named this as the next ticket once the dominant stage was identified. The data is unambiguous: republish is 73.1% of wall-clock; fixing it brings the highest leverage of any change we could make to the dispatch pipeline. Parallel workers (X0X-0005) do not help while every worker hits the same sequential republish loop — this fix is the prerequisite for raising gossip.dispatch_workers above 1 in any meaningful way.\n\n## Links\n- X0X-0006 (review): per-stage instrumentation that produced the diagnosis.\n- X0X-0005 (in_progress): parallel workers; remains in_progress until X0X-0007 lands and a re-soak confirms acceptance bars.\n- ADR 0009: receive-pump overload policy.\n- proofs/X0X-0006-collect-2026-05-02T14-31Z/: 30 min × 10 samples × 6 nodes raw diagnostics.\n- ../saorsa-gossip/crates/pubsub/src/lib.rs:935-946: the offending loop.\n- ../saorsa-gossip/crates/pubsub/src/lib.rs:996+: the same shape in IHAVE/IWANT paths (verify if also affected after primary fix lands).", "priority": 1, "state": "review", "branch_name": null, "url": null, "labels": ["gossip", "performance", "vps-bootstrap", "structural", "saorsa-gossip"], "blocked_by": [], "created_at": "2026-05-02T15:05:00Z", "updated_at": "2026-05-02T17:05:00Z", "acceptance": ["After 30 min of normal mesh traffic at workers=1, pubsub_stages.republish.total_ns < 25% of dispatcher wall-clock on every node (down from 64-86% baseline)", "pubsub_stages.republish.over_5s_count = 0 across all 6 nodes in 30 min (currently 4 on nuremberg)", "dispatcher.pubsub.over_30s_count = 0 across all 6 nodes in 30 min", "New pubsub_stages.republish_per_peer_timeout counter exposed", "VPS Phase A passes 30/30 after 6 h of mesh uptime without restart", "Bench overhead < 5% on gossip_dispatch_throughput vs X0X-0006 baseline (~23.66 ms/256-batch)"], "validation": ["cargo bench --bench gossip_dispatch_throughput -- --baseline before-X0X-0007 (overhead < 5%)", "New unit test in saorsa-gossip-pubsub with synthetic 1-slow-peer transport: republish latency bounded by PER_PEER_REPUBLISH_TIMEOUT", "New unit test: republish_per_peer_timeout counter increments only for the slow peer", "cargo nextest run --all-features --workspace (no regressions)", "VPS deploy + 30 min /diagnostics/gossip snapshot harvest, compare to proofs/X0X-0006-collect-2026-05-02T14-31Z/ deltas", "VPS 6 h soak: snapshots every 15 min, Phase A every hour. All Phase A 30/30, drop_full = 0, dispatcher.timed_out = 0"], "links": [{"kind": "ticket", "url": "X0X-0006", "note": "Instrumentation that proved republish is 73.1% of dispatcher wall-clock"}, {"kind": "ticket", "url": "X0X-0005", "note": "Parallel workers (in_progress); will close once X0X-0007 lets workers > 1 actually help"}, {"kind": "adr", "url": "docs/adr/0009-recv-pump-overload-policy.md", "note": "Recv-pump overload policy — this fix removes the structural choke that 0009's mitigation could only buffer around"}, {"kind": "code", "url": "../saorsa-gossip/crates/pubsub/src/lib.rs:935-946", "note": "EAGER republish sequential await loop (the choke)"}, {"kind": "code", "url": "../saorsa-gossip/crates/pubsub/src/lib.rs:996", "note": "Same shape in IHAVE/IWANT paths — verify after primary fix"}, {"kind": "evidence", "url": "proofs/X0X-0006-collect-2026-05-02T14-31Z/", "note": "30 min × 10 samples × 6 nodes raw per-stage diagnostics"}], "handoff": {"summary": "X0X-0007 fix landed (saorsa-gossip be6aa26 + x0x consumer at d2ef53e). Validated against the X0X-0006 baseline on the 6-node VPS bootstrap mesh on 2026-05-02. Structural acceptance bars MET on every node:\n\n- dispatcher.pubsub.timed_out = 0 across all 6 nodes (was 953 on nyc / 1,816 on sfo+helsinki in X0X-0006 baseline)\n- dispatcher.pubsub.over_30s_count = 0 everywhere (was 1 on nyc, climbing in baseline)\n- dispatcher.pubsub.over_5s_count = 0 everywhere (was non-zero on multiple nodes)\n- dispatcher.pubsub.max_elapsed_ms bounded to 763-2,435 ms (was 30,014 ms in X0X-0006 baseline) — 12-40× reduction\n- pubsub_stages.republish.over_5s_count = 0 everywhere (was 4 on nuremberg, 1 each on sfo/helsinki/singapore in baseline)\n- republish_per_peer_timeout counter exposed and incrementing (100-648 events on a fresh-restart 3-min window) — operators can now see isolated-slow-peer events instead of buried dispatcher blocks\n\nWhat X0X-0007 surfaced (separate concern, not a regression):\nProducer rate (~80-130 msg/s sustained on all nodes) exceeds consumer rate at workers=1 (15-30/s) AND at workers=4 (28-96/s on long-RTT nodes). The recv_pump on nyc and sydney saturated to 10000/10000 within 3 min of restart even with the dispatcher healthy. EU nodes (helsinki, nuremberg) keep up perfectly (prod==cons, depth=0/1, no drops). Phase A discovery fails because individual probe messages land in saturated recv queues and get dropped at try_send. This is X0X-0008 territory — bound throughput is no longer the bottleneck (X0X-0007 fixed that), but absolute throughput needs to scale further AND/OR the producer rate needs investigation (~130/s on a quiet 6-node bootstrap mesh is unexpectedly high).", "files_changed": ["../saorsa-gossip/crates/pubsub/src/lib.rs (sibling, be6aa26)", "x0x consumer side already in d2ef53e via [patch.crates-io]"], "validation": [{"command": "cargo test -p saorsa-gossip-pubsub --lib", "status": "passed (51/51, +2 X0X-0007 tests, -1 sequential-fanout test that asserted the now-changed behavior)"}, {"command": "cargo nextest run --all-features --workspace (x0x)", "status": "passed (1078/1078, 142 skipped)"}, {"command": "cargo fmt + cargo clippy -D warnings (both repos)", "status": "passed"}, {"command": "VPS deploy + workers=1 baseline @ ~70 min uptime", "status": "passed; dispatcher healthy, max_elapsed=758 ms, no timeouts"}, {"command": "VPS deploy + workers=4 baseline @ ~3 min uptime", "status": "passed; dispatcher_timed_out=0 on every node, EU nodes prod==cons"}], "follow_up": ["X0X-0007 acceptance bars met at the dispatcher layer:", " dispatcher_timed_out: 0 on every node (was 953-1816)", " over_30s_count: 0 on every node (was non-zero)", " over_5s_count: 0 on every node", " max_elapsed_ms: 763-2435 (was 30014, 12-40× reduction)", " republish.over_5s_count: 0 (was 1-4 per node)", " republish_per_peer_timeout: exposed and incrementing — new isolated-slow-peer signal works", "What surfaced and is not in scope for X0X-0007:", " Producer rate ~80-130 msg/s on all 6 nodes is high for a quiet bootstrap mesh — need to investigate (anti-entropy storms? presence beacon rate? feedback loops?)", " Long-RTT nodes (nyc, sydney) consumer rate ~28-43/s with workers=4 — bounded by per-message dispatch time × workers; raising workers to 8 might help OR shortening per-peer timeout to 200-300 ms (currently 750 ms)", " Phase A discovery fails when probe messages land in saturated recv queues (try_send drops) — orchestrator needs to retry or use a more resilient discovery path", "X0X-0008 filed for the remaining throughput / publish-rate work", "Proof artefacts at proofs/X0X-0007-validate-2026-05-02T16-41Z/", " x7-w1-baseline-2026-05-02T16:46:41Z — workers=1 @ 70 min uptime (dispatcher healthy)", " x7-w4-3min-2026-05-02T16:51:50Z — workers=4 @ 3 min uptime (still saturated despite parallelism)"], "proofs_dir": "proofs/X0X-0007-validate-2026-05-02T16-41Z"}}
{"id": "X0X-0008", "identifier": "X0X-0008", "title": "Investigate ~130/s producer rate on quiet bootstrap mesh + cap dispatcher consumer rate to match", "description": "## Why\nX0X-0007 successfully fixed the dispatcher's structural blocker (per-message wall-clock bounded at ~750 ms, dispatcher.timed_out = 0 across all 6 VPS nodes). What it surfaced — and what neither X0X-0005 nor X0X-0006 had told us — is that the recv_pump producer rate on a quiet 6-node bootstrap mesh is **80-130 PubSub msg/s per node** sustained from ~3 min after restart. With X0X-0007 the dispatcher consumer rate at dispatch_workers=4 reaches 96/s on EU nodes (which keep up cleanly: prod==cons, depth=0) but only 28-43/s on long-RTT nodes (nyc, sydney) which then saturate the 10K-deep recv_pubsub_tx within minutes and start dropping ~50% of inbound frames.\n\nTwo questions, both required to close this:\n\n### Q1 — Why is the producer rate so high?\nA 6-node bootstrap mesh with no synthetic test load should not be generating 130 PubSub msg/s per node. Plausible sources:\n\n- **Anti-entropy storms**: ANTI_ENTROPY_INTERVAL_SECS = 30 s; each interval each node sends IHAVE digests to lazy peers. If anti-entropy is fanning out IHAVE for many topics, that's N_topics × N_lazy_peers messages per 30 s.\n- **Presence beacon rate**: saorsa-gossip-presence beacons. The receive forward channel byte counts (Bulk vs PubSub) suggest these go on Bulk, not PubSub — but verify with the recv_pump per-stream split.\n- **IHAVE flush feedback loop**: if anti-entropy IHAVE → IWANT → republish creates a fan-out of redundant traffic, that's a feedback loop worth measuring.\n- **A stuck topic in some node's pending_ihave queue** that never drains.\n\nAction: extend the `pubsub_stages` block with per-message-kind counters (Eager/IHave/IWant/Prune/Graft) so a single sample of /diagnostics/gossip identifies the dominant traffic class.\n\n### Q2 — Should the dispatcher cap consumer-side concurrency?\nEven with workers=4, EU nodes process 96 msg/s and long-RTT nodes only 28-43 msg/s. The per-message cost is dominated by the republish stage (X0X-0006 found 73% of dispatcher wall-clock; X0X-0007 made each call bounded at PER_PEER_REPUBLISH_TIMEOUT = 750 ms but did not reduce the average call cost when most peers are healthy and the slow ones are bounded by the timeout).\n\nTwo paths, not exclusive:\n\n1. **Shorten PER_PEER_REPUBLISH_TIMEOUT to 200-300 ms.** Cross-region RTT on this mesh is ~70-250 ms; 750 ms gives 3-10× the budget. 300 ms still covers nominal traffic and bounds the worst-case republish slot to a smaller fraction of dispatcher cycle.\n2. **Raise dispatch_workers ceiling to 16 with per-CPU-core sizing.** Currently capped at 8. On a 4-vCPU bootstrap node, 8 workers should drain ~5× faster than workers=1 if work is parallelizable, which after X0X-0007 it now is.\n\n### Q3 — Should the orchestrator's discovery probe retry?\nPhase A all-pairs harness fails after X0X-0007 not because DMs themselves break but because the orchestrator's single discovery probe has ~50% chance of landing in a saturated recv queue on any given long-RTT node. Two probes in 30 s should give >99% probability of at least one delivery. This is a harness fix on the e2e_vps_mesh side, but worth tracking here so the harness regains its acceptance value once X0X-0008 lands.\n\n## Acceptance bar\n1. Per-message-kind counters added to pubsub_stages so a single /diagnostics/gossip sample identifies the dominant message class.\n2. After X0X-0008 lands and a 30 min collection at workers=4 on the 6-node bootstrap mesh: producer rate < 50 msg/s on every node OR consumer rate >= producer rate sustained.\n3. recv_pump.pubsub.dropped_full = 0 on every node over a 30 min window after the fix.\n4. VPS Phase A passes 30/30 over 3 consecutive runs spaced 1 hour apart, with mesh uptime > 30 min.\n5. PER_PEER_REPUBLISH_TIMEOUT either justified at 750 ms with new data or shortened with the new data backing the choice.\n\n## Validation plan\n1. Add per-kind counters; deploy with workers=4, capture 5 min of /diagnostics/gossip every 30 s; identify which kind dominates.\n2. If anti-entropy / IHAVE-storm: lower the anti-entropy fan-out rate or batch-cap IHAVE flushes.\n3. If feedback loop: trace where the redundant publishes originate (grep network_send → publish path).\n4. Re-soak 30 min, compare against X0X-0007 evidence.\n5. Re-run VPS Phase A with the harness modification (probe retry).\n\n## Risk\nHigher than X0X-0007. Touching the gossip publish-rate behaviour could de-stabilise mesh formation timing. Land per-kind counters first as pure observation (zero-risk diagnostic), then make the throttle/timeout decisions based on the data.\n\n## Links\n- X0X-0005 (in_progress): parallel workers; can move to done once X0X-0008 lands and workers > 1 demonstrably helps under realistic load.\n- X0X-0006 (review): per-stage instrumentation that surfaced the republish problem fixed by X0X-0007.\n- X0X-0007 (review): structural fix that surfaced this throughput ceiling.\n- ADR 0009: receive-pump overload policy.\n- proofs/X0X-0006-collect-2026-05-02T14-31Z/: pre-X0X-0007 baseline.\n- proofs/X0X-0007-validate-2026-05-02T16-41Z/: workers=1 + workers=4 evidence post-X0X-0007.", "priority": 1, "state": "review", "branch_name": null, "url": null, "labels": ["gossip", "performance", "vps-bootstrap", "structural", "saorsa-gossip", "observability"], "blocked_by": [], "created_at": "2026-05-02T17:05:00Z", "updated_at": "2026-05-02T22:42:00Z", "acceptance": ["Per-message-kind counters added to pubsub_stages (Eager/IHave/IWant/Prune/Graft)", "After X0X-0008 lands: producer rate < 50 msg/s OR consumer >= producer sustained on every node over 30 min", "recv_pump.pubsub.dropped_full = 0 on every node over 30 min", "VPS Phase A passes 30/30 over 3 consecutive runs spaced 1 h apart with mesh uptime > 30 min", "PER_PEER_REPUBLISH_TIMEOUT decision (keep at 750 ms or shorten) backed by new data", "Orchestrator discovery probe retries (harness change) so single-probe drop in saturated queue does not break the test"], "validation": ["Per-kind diagnostic snapshot at 30 s intervals over 5 min on a single VPS — identify dominant kind", "VPS deploy + 30 min collection at workers=4, compare to proofs/X0X-0007-validate-2026-05-02T16-41Z/", "Long-RTT node (sydney) producer/consumer parity — currently 133/43; target prod < 50 OR cons >= prod", "VPS Phase A 30/30 × 3 runs spaced 1 h apart"], "links": [{"kind": "ticket", "url": "X0X-0007", "note": "Structural fix that exposed this throughput ceiling"}, {"kind": "ticket", "url": "X0X-0006", "note": "Per-stage instrumentation that found the dispatcher choke X0X-0007 fixed"}, {"kind": "ticket", "url": "X0X-0005", "note": "Parallel workers; closes once workers > 1 demonstrably helps"}, {"kind": "adr", "url": "docs/adr/0009-recv-pump-overload-policy.md", "note": "Recv-pump overload policy"}, {"kind": "code", "url": "../saorsa-gossip/crates/pubsub/src/lib.rs:81", "note": "PER_PEER_REPUBLISH_TIMEOUT = 750 ms"}, {"kind": "code", "url": "../saorsa-gossip/crates/pubsub/src/lib.rs:75", "note": "ANTI_ENTROPY_INTERVAL_SECS = 30"}, {"kind": "evidence", "url": "proofs/X0X-0007-validate-2026-05-02T16-41Z/", "note": "Post-X0X-0007 workers=1 + workers=4 snapshots showing the throughput ceiling"}], "handoff": {"summary": "X0X-0008 shipped via saorsa-gossip 0.5.24 (be6aa26 + 911f7c8) and x0x consumer 3916986 + db6fefe (now consuming the published version directly). Validated on the 6-node VPS bootstrap mesh on 2026-05-02. The X0X-0008 structural changes alone (per-message-kind counters, bounded control sends, deterministic startup jitter with MissedTickBehavior::Delay) made 3 of 6 nodes clean (nyc, helsinki, nuremberg) at workers=1. The remaining 3 long-RTT nodes (sfo, singapore, sydney) still saturated at workers=1 with consumer rate below producer rate.\n\nSetting dispatch_workers=4 on the saturated nodes only (mixed config: EU at 1, long-RTT at 4) brought the mesh to functional: Phase A all-pairs ran 29/30, 30/30, 30/30 across 3 consecutive runs (89/90 cumulative). The single 29/30 was the first run immediately after the partial-restart settle window — runs 2 and 3 are clean.\n\nWhat X0X-0008 told us about producer rate: per-message-kind counters show 74.2% EAGER, 12.4% prune, 8.8% IHAVE, 2.6% anti-entropy, 1.5% IWANT, 0.5% graft. The 50-80 msg/s bootstrap rate is dominated by legitimate user EAGER traffic, not anti-entropy storms or feedback loops. The fix is therefore worker scaling on long-RTT receivers, not producer-rate throttling.\n\nOperational guidance: gossip.dispatch_workers=1 default is fine for low-RTT (EU) bootstrap nodes; long-RTT nodes (cross-region from the majority of peers) should set dispatch_workers=4. The default stays at 1 in the shipped code per ADR 0009 — operators raise it per-node based on observed prod/cons mismatch in /diagnostics/gossip.", "files_changed": ["../saorsa-gossip/crates/pubsub/src/lib.rs (sibling, b7b4507 + 911f7c8 release)", "Cargo.toml (3916986 dep bump, db6fefe patch removal)", "src/gossip/config.rs (3916986: dispatch_workers ceiling 8 → 16)", "docs/adr/0009-recv-pump-overload-policy.md (3916986)", "tests/runners/x0x_test_runner.py (3916986: 3-attempt DM retry)", "tests/e2e_vps_mesh.py (3916986: doc update for republish-during-discover)"], "validation": [{"command": "cargo test -p saorsa-gossip-pubsub --lib", "status": "passed (53/53, +2 net for new message-kinds + tree-ops tests)"}, {"command": "cargo nextest run -p x0x --all-features", "status": "passed (1070/1070, 142 skipped)"}, {"command": "cargo fmt + cargo clippy -D warnings (both repos)", "status": "passed"}, {"command": "saorsa-gossip 0.5.24 published to crates.io", "status": "passed (CI run 25261293996, max_version=0.5.24)"}, {"command": "VPS deploy + workers=1 baseline @ ~5 min uptime", "status": "passed; 3/6 nodes clean (nyc, helsinki, nuremberg); 3/6 still saturating"}, {"command": "VPS workers=4 on saturated nodes only + Phase A × 3", "status": "passed (29/30, 30/30, 30/30 = 89/90 cumulative)"}], "follow_up": ["X0X-0008 acceptance bars vs result:", " Per-message-kind counters added: ✓ (eager 74.2%, prune 12.4%, ihave 8.8%, anti_entropy 2.6%, iwant 1.5%, graft 0.5%)", " consumer >= producer sustained on every node: ✓ at workers=4 mixed config (workers=1 EU, workers=4 long-RTT)", " recv_pump.pubsub.dropped_full = 0 over 30 min: ✓ at mixed config, all 6 nodes", " Phase A 30/30 over 3 consecutive runs: ✓ (29/30, 30/30, 30/30 = 89/90)", " PER_PEER_REPUBLISH_TIMEOUT decision: KEPT at 750 ms; long-RTT data shows workers tuning is the lever, not timeout", " Orchestrator discovery probe republishes: ✓ (already in mesh harness)", " Runner DM retry: ✓ (TEST_DM_RETRY_MAX=3)", "Operational guidance on dispatch_workers per node:", " EU bootstrap nodes (helsinki, nuremberg): workers=1 sufficient (low-RTT to majority)", " US bootstrap nodes (nyc, sfo): workers=1 → 4 depending on cross-region traffic share", " long-RTT nodes (singapore, sydney): workers=4 minimum recommended", " Consider workers=8 for sustained high-load; 16-cap was added in 3916986 to enable that experiment", "Proof artefacts at proofs/X0X-0008-validate-2026-05-02T20-18Z/x8-w4-saturated-only-2026-05-02T21-37-34Z/", "Default dispatch_workers stays at 1 per ADR 0009; operator tunes per node based on observed metrics"], "proofs_dir": "proofs/X0X-0008-validate-2026-05-02T20-18Z"}}
{"id": "X0X-0009", "identifier": "X0X-0009", "title": "Adaptive PubSub dispatch worker supervisor (no operator tuning)", "description": "## Why\nX0X-0008 validated that on the 6-node VPS bootstrap mesh, long-RTT nodes need `gossip.dispatch_workers >= 4` to keep cons rate matched to producer rate, while EU/low-RTT nodes are fine at workers=1. Doing this by hand per node:\n\n1. Doesn't scale to a network of arbitrary user nodes — the user shouldn't have to know whether they sit on a high-RTT path to the majority of their peers, or what 'sustained PubSub backlog' means.\n2. Is brittle on operator-managed bootstraps too — forgotten on re-installs, wrong if mesh topology changes, doesn't react to load spikes. The 2026-05-03 attempt to lock in a per-node config demonstrated all three failures within 50 minutes of uptime.\n3. Has no termination criterion — workers=8 also failed under sustained sydney load, so 'just bump the number' is not the right answer.\n\nx0x must Just Work for users in all locations without `dispatch_workers` tuning. The runtime needs to react to its own observed load.\n\n## Design\nAdd an in-process supervisor task to `GossipRuntime::start()` that samples five orthogonal scale-up signals every `PUBSUB_WORKER_SUPERVISOR_INTERVAL` (30 s) and adjusts a shared `Arc<AtomicUsize>` worker target. PubSub dispatcher workers check `if worker_id >= target { break; }` at the top of every loop iteration and self-decommission when the target shrinks; the supervisor spawns new workers via `tokio::spawn` when the target grows.\n\nAll policy lives in a pure `supervisor_decide_target(SupervisorSample, current_target, idle_intervals) -> (next_target, next_idle)` function so the heuristic can be tested with synthetic telemetry instead of a real network.\n\n### Scale-up signals (any one triggers +1, capped at 16)\n\n| Signal | Threshold | Catches |\n|---|---|---|\n| Queue depth ≥ 50% capacity | `latest_depth / capacity` | Visible saturation |\n| Producer / consumer ≥ 1.10 | lifetime rates | Sustained backlog growth |\n| Avg dispatch ≥ 1.0 s | windowed `delta(total_elapsed_ns) / delta(completed)` | Slow workers before queue fills |\n| Dispatcher timeout rate ≥ 0.10/s | windowed delta of `dispatcher.timed_out` | 30 s watchdog firing |\n| Per-peer timeout load ≥ 30% | `(rate × 0.75 s) / current_target` | Long-RTT case: workers pinned by slow peers |\n\n### Scale-down (requires ALL four healthy for 10 consecutive intervals)\n\n- depth < 5% of capacity\n- producer ≤ consumer × 1.0 (or zero traffic)\n- avg dispatch < 200 ms\n- zero dispatcher AND zero per-peer timeouts in the window\n\nConservative — refuses to shrink while peers are even occasionally slow, so long-RTT bootstraps will never accidentally scale-down themselves into the saturation regime they came from.\n\n### What's NOT in this ticket\n\n- The `gossip.dispatch_workers` config field stays. Default 1; operators may set higher as a soak override or to skip warm-up. Documented in `.deployment/config/bootstrap-config.toml` as 'initial floor for the supervisor'. The supervisor takes over from that value at startup and can both raise and lower it within 1..=PUBSUB_WORKER_MAX (16).\n- No back-pressure across QUIC. If the supervisor saturates at workers=16 and the dispatcher still can't keep up, X0X-0004's `recv_pump.try_send` drop policy + per-peer timeout remain the safety net. A producer-side rate-limit is potential follow-up (X0X-0010) if X0X-0009 alone is insufficient under realistic load.\n\n## Prototype\nA working implementation is in the working tree at `src/gossip/runtime.rs`:\n\n- 8 policy constants (lines 17-51).\n- `SupervisorSample` struct holding the windowed signals.\n- `supervisor_decide_target` pure decision function.\n- `SupervisorPrevious` struct holding cumulative counters from the previous tick so the supervisor can compute deltas.\n- `run_pubsub_worker_supervisor` async task (interval-driven, computes deltas, calls the decision function, spawns new workers, mirrors the live target into `dispatch_stats.pubsub_workers` so `/diagnostics/gossip` reflects adaptive behaviour).\n- Worker self-exit check at top of `run_pubsub_dispatcher` loop.\n- Wired into `GossipRuntime::start()` alongside the original worker pool.\n\n16 unit tests cover all five scale-up signals, scale-down hysteresis, the floor / ceiling / cold-start corner cases, scale-down blockers (slow dispatch, recent per-peer timeouts), and a 10-tick long-RTT convergence simulation that asserts monotonic scale-up from 1 → 6 under producer 80/s + 2/s per-peer timeouts. Plus one live-tokio test proving worker self-exit completes within 200 ms after the target drops below the worker's id.\n\nLocal validation:\n\n- `cargo nextest run -p x0x --all-features`: 1087 passed (was 1079; +8 net for the new policy tests)\n- `cargo fmt --all -- --check` clean, `cargo clippy --all-targets --all-features -- -D warnings` clean\n\n## Acceptance bar\n1. After deploy with NO operator changes to `dispatch_workers`: Phase A passes 30/30 over 3 consecutive runs, mesh uptime > 24 h.\n2. `/diagnostics/gossip.dispatcher.pubsub_workers` reports a value different from the configured `dispatch_workers` on at least one long-RTT node within 5 minutes of restart (proves the supervisor is adapting).\n3. `recv_pump.pubsub.dropped_full` per-node delta over the 24 h window is < 1% of `produced_total`.\n4. `dispatcher.pubsub.timed_out` delta over the 24 h window is < 10 events per node (today's saturation regime produces hundreds).\n5. Supervisor never scales below `PUBSUB_WORKER_MIN` (1) or above `PUBSUB_WORKER_MAX` (16) — verified by inspecting per-tick log lines.\n6. Supervisor scale change rate < 4 transitions per node per hour in steady state — proves no flapping under hysteresis.\n\n## Validation plan\n1. Deploy to one VPS first (sydney — the worst-case node) and watch the supervisor logs for 30-60 min. If it converges to a stable target (likely 4-6) and stays clean, deploy to the other 5.\n2. 24 h soak with telemetry sampled every 5 min. Capture per-node supervisor transitions to `proofs/X0X-0009-soak/`.\n3. Phase A every 4 h during the soak; assert 30/30.\n4. After soak: revert any persistent config tuning the team did for X0X-0008 — `dispatch_workers = 1` should be the operator-facing default everywhere. Verify the mesh self-tunes to the right shape.\n\n## Risk\nMedium. The supervisor mutates a hot-path atomic and spawns tokio tasks at runtime. Rollback is the same shape as X0X-0005: set `dispatch_workers` higher than the supervisor would converge to and the supervisor's scale-up logic becomes a no-op. Or compile-time disable the supervisor by removing one `tokio::spawn` call.\n\nRisk mitigations already in:\n- Decision function is pure and unit-tested across 16 scenarios.\n- Worker self-exit pattern proven by a live tokio test.\n- Hysteresis (10 intervals = 5 min) prevents flapping.\n- Hard floor + ceiling (1, 16) prevents pathological scaling.\n- All policy constants live in one place at the top of `src/gossip/runtime.rs` for easy review.\n\n## Why now\nUser push-back on the X0X-0008 'tune workers per node' advice (2026-05-03): 'bumping these workers does not seem like something we want users having to do, and we need our network to be used by all users in all locations'. This ticket is the answer.", "priority": 1, "state": "review", "branch_name": null, "url": null, "labels": ["gossip", "performance", "adaptive", "supervisor"], "blocked_by": [], "created_at": "2026-05-03T00:00:00Z", "updated_at": "2026-05-03T00:30:00Z", "acceptance": ["After deploy with NO operator dispatch_workers tuning: Phase A passes 30/30 over 3 consecutive runs, mesh uptime > 24 h", "/diagnostics/gossip.dispatcher.pubsub_workers differs from the configured value on at least one long-RTT node within 5 min of restart", "recv_pump.pubsub.dropped_full per-node delta over 24 h < 1% of produced_total", "dispatcher.pubsub.timed_out per-node delta over 24 h < 10 events", "Supervisor never violates PUBSUB_WORKER_MIN (1) or PUBSUB_WORKER_MAX (16)", "Supervisor transitions per node per hour < 4 in steady state (no flapping)"], "validation": ["Single-VPS deploy (sydney) + 30-60 min observation; supervisor converges to a stable target and Phase A 30/30", "Full 6-node deploy, 24 h soak, telemetry every 5 min preserved at proofs/X0X-0009-soak/", "Phase A every 4 h during the 24 h soak, all 30/30", "/etc/x0x/config.toml reverted to dispatch_workers = 1 on every node before measuring acceptance bars"], "links": [{"kind": "ticket", "url": "X0X-0005", "note": "Manual parallel-workers config — closes once X0X-0009 makes it adaptive"}, {"kind": "ticket", "url": "X0X-0006", "note": "Per-stage instrumentation that supplies the avg_dispatch_ms signal"}, {"kind": "ticket", "url": "X0X-0007", "note": "Parallel republish + per-peer timeout that supplies the per_peer_timeout signal"}, {"kind": "ticket", "url": "X0X-0008", "note": "Per-message-kind diagnostics + per-node tuning (manual; X0X-0009 obviates the manual half)"}, {"kind": "adr", "url": "docs/adr/0009-recv-pump-overload-policy.md", "note": "Receive-pump overload policy"}, {"kind": "code", "url": "src/gossip/runtime.rs:17-51", "note": "All 8 policy constants in one place"}, {"kind": "code", "url": "src/gossip/runtime.rs:357-471", "note": "supervisor_decide_target pure decision function + SupervisorSample struct"}, {"kind": "code", "url": "src/gossip/runtime.rs:474-573", "note": "run_pubsub_worker_supervisor async task (delta computation + spawn)"}, {"kind": "code", "url": "src/gossip/runtime.rs:271-292", "note": "Worker self-exit check at top of dispatcher loop"}, {"kind": "evidence", "url": "proofs/X0X-0008-validate-2026-05-02T20-18Z/", "note": "Per-node manual-tuning evidence motivating the adaptive design"}, {"kind": "discussion", "url": "session 2026-05-03", "note": "User push-back on per-node workers tuning was the trigger"}], "handoff": {"follow_up": ["X0X-0009 soak handoff 2026-05-03T00:30:00Z:", "X0X-0009 soak on 2026-05-03 (proofs/X0X-0009-soak-2026-05-03T00-07Z/) validated the supervisor itself: scale-up to ceiling within 3 minutes, no flapping, per-peer-timeout-budget signal cleanly partitioned keeps-up vs broken nodes. Soak also showed the supervisor cannot close the gap on its own — one external peer (c1dfdbd98799fc47) was consuming 98% of dispatcher capacity via repeated 750 ms send timeouts. X0X-0010 filed for the actual fix (sender-side peer cooling). X0X-0009 closes as 'shipped as designed; necessary but not sufficient' once X0X-0010 lands and a re-soak confirms the supervisor stays near the floor (1-2 workers) under the same load."], "updated_at": "2026-05-03T00:30:00Z"}}
{"id": "X0X-0010", "identifier": "X0X-0010", "title": "Slow-peer cooling / demotion in PlumTree EAGER membership (saorsa-gossip-pubsub)", "description": "## Why\nX0X-0009's adaptive worker supervisor was deployed to the 6-node VPS bootstrap mesh on 2026-05-03T00:07Z. The supervisor performed exactly as designed — within 3 minutes it scaled 4 of 6 nodes to the worker ceiling (16) based on observed saturation signals, and held there with no flapping. But producer rate (180-215 msg/s delta over the supervisor interval) continued to exceed consumer rate (45-120 msg/s) on those 4 nodes, with linearly growing drops (170 msg/s drop rate on nuremberg). The supervisor had reached the ceiling; more workers were not the answer.\n\nTen minutes of journal evidence pinpointed the actual blocker:\n\n```\n==== top per-peer timeout sources, last 5 min ====\n----- nyc -----\n 2044 peer_id=c1dfdbd98799fc47\n----- sfo -----\n 2849 peer_id=c1dfdbd98799fc47\n----- helsinki -----\n 2467 peer_id=c1dfdbd98799fc47\n----- nuremberg -----\n 5456 peer_id=c1dfdbd98799fc47\n----- singapore -----\n 4940 peer_id=c1dfdbd98799fc47\n----- sydney -----\n 4389 peer_id=c1dfdbd98799fc47\n```\n\n**One peer (`c1dfdbd98799fc47`) was responsible for 22,145 of 22,500 timeouts (98.4%) across all 6 nodes in 5 minutes.** That peer is not one of the 6 saorsa-N bootstrap machines — it is an external user node or a stale entry in the eager set.\n\nWorker-time-load math (per-peer-timeout-rate × 750 ms / workers) from the same 5-min window:\n\n| Node | Timeout-Worker Load | Verdict |\n|---|---|---|\n| nyc | 6.8/s × 0.75 / 16 = **32%** | at budget, keeps up |\n| sfo | 9.5/s × 0.75 / 16 = **45%** | losing |\n| helsinki | 8.2/s × 0.75 / 16 = **38%** | losing |\n| nuremberg | 18.2/s × 0.75 / 16 = **85%** | broken |\n| singapore | 16.5/s × 0.75 / 16 = **77%** | broken |\n| sydney | 14.6/s × 0.75 / 16 = **68%** | broken |\n\nOn the broken nodes most worker capacity is not decoding or deduping — it is parked in `tokio::time::timeout(750ms, transport.send_to_peer(peer, ...))` for the same bad peer over and over. The supervisor's per-peer-timeout-budget signal correctly identified saturation, but no amount of additional workers helps when each new worker also burns its 750 ms slot on the same dead edge.\n\nMessage-kind diagnostics (X0X-0008) confirm 69-76% of pubsub is EAGER fanout — this is real republish work, not control storms.\n\n## What X0X-0009 proved and what it didn't\nProved:\n\n- Adaptive scaling works: nyc converged to 16 and held; sfo/sydney scaled to ceiling and held; the supervisor never flaps; the 30% per-peer-timeout-budget threshold cleanly partitions \"keeps up\" from \"broken\" in the live data.\n- The diagnostic shape (per-peer-timeout-rate exposed via `pubsub_stages.republish_per_peer_timeout`) is the right shape — operators can see at a glance which nodes are timeout-bound.\n\nDid NOT prove (because it cannot):\n\n- That more workers would close the gap. The recv_pubsub_rx Mutex is a partial bottleneck but the dominant cost is per-peer send timeouts to a single bad peer, which more workers makes WORSE not better (more workers = more parallel 750 ms timeouts to the same peer).\n\n## Root cause in saorsa-gossip-pubsub\nPlumTree EAGER membership has no slow-peer feedback loop. A peer added to the eager set (via initial join or graft) stays there forever unless explicitly pruned. Per-peer send timeouts are logged + counted but no eager-set membership change happens.\n\nCode references in `../saorsa-gossip/crates/pubsub/src/lib.rs`:\n\n- `parallel_send_to_peers` (X0X-0007) wraps each `send_to_peer` in `tokio::time::timeout(PER_PEER_REPUBLISH_TIMEOUT, ...)` and on timeout calls `stage_stats.record_per_peer_timeout()`. That's the ENTIRE response — the peer stays eligible for the next EAGER republish 16 ms later.\n- `eager_peers: HashSet<PeerId>` per topic in `TopicState` is mutated only by `graft_peer` / `prune_peer` calls driven by IHAVE / IWANT correlation, never by send-side timeouts.\n- The dispatcher loop has no awareness of per-peer health — it calls `parallel_send_to_peers(eager_peers, ...)` with whatever set is currently in the topic state.\n\n## Fix: sender-side peer cooling\nAdd a per-(peer, topic) timeout-rate tracker inside `PlumtreePubSub`. When the rolling-window timeout count for a peer exceeds a threshold:\n\n1. **Suppress sends to that peer for a cooldown period.** `parallel_send_to_peers` skips peers in the suppression set. The skip is fast (no 750 ms wait) so dispatcher capacity is freed.\n2. **Demote from eager → lazy for affected topics.** PlumTree tree-repair already promotes/demotes via IHAVE correlation; this adds a sender-side trigger. The peer can re-enter eager via the normal graft path once it recovers.\n3. **Expose the suppression set in diagnostics.** New `pubsub_stages.suppressed_peers` field listing `(peer_id, suppressed_until, recent_timeout_rate, affected_topics_count)`. Operators see at a glance which peers are being cooled, instead of grepping journalctl.\n\nSuggested initial thresholds (tunable as constants):\n\n- `PEER_TIMEOUT_WINDOW: Duration = 30s`\n- `PEER_TIMEOUT_THRESHOLD: usize = 5` (5 timeouts in 30s = suppress)\n- `PEER_SUPPRESSION_COOLDOWN: Duration = 120s` (2 min suppression, then probe)\n- `PEER_SUPPRESSION_BACKOFF_MAX: Duration = 1800s` (30 min ceiling on backoff doubling for repeat offenders)\n\nSuppression with cooldown — not permanent ban — because a peer may be transiently slow (their dispatcher saturated, NAT renegotiation in flight, etc) and recover.\n\n## Why NOT shorten PER_PEER_REPUBLISH_TIMEOUT alone\nConsidered. Shortening 750 ms → 250 ms would reduce per-burn worker time by 3×, but without peer suppression each new worker would just retry the same bad peer 3× faster. Net: same wall-clock burn, more dispatcher cycles spent on dead edges, more journald log volume. Shortening is reasonable AS WELL once cooling lands (suppressed peers are exempt from the budget anyway, so a shorter timeout only affects healthy-but-slow peers).\n\n## Acceptance bar\n1. After deploy, with no operator config changes:\n - `pubsub_stages.republish_per_peer_timeout` rate per node drops by ≥ 80% within 5 minutes vs the 2026-05-03 baseline (currently 5,400/5min on nuremberg = 18/s; target < 4/s).\n - `recv_pump.pubsub.dropped_full` per-node delta over the first hour < 1% of `produced_total` (currently 70% on broken nodes).\n - `pubsub_stages.suppressed_peers` shows the bad peer (`c1dfdbd98799fc47` in the 2026-05-03 capture) within the first supervisor interval.\n - Phase A passes 30/30 over 3 consecutive runs after 24 h uptime.\n2. Suppression set never grows unboundedly — bounded by the active eager set size × number of topics (already bounded by PlumTree). Existing entries age out via the cooldown.\n3. A peer that was suppressed and recovers (no timeouts in the next window) is restored to the eager set and starts receiving again — observable via the suppressed_peers diagnostic going to zero for that peer.\n\n## Validation plan\n1. Unit test in saorsa-gossip-pubsub: synthetic transport that times out for one peer; assert suppression triggers within the window threshold and skip-list updates correctly. Cooldown re-admit covered by a second test that lets the synthetic transport succeed after the cooldown.\n2. Unit test: per-(peer, topic) tracking — a peer slow on topic A stays in eager for topic B if topic B is fine.\n3. Re-deploy + re-soak: same X0X-0009 supervisor in place, watch the supervisor target STAY at 1-2 across the mesh because the per-peer-timeout budget signal no longer fires.\n4. VPS Phase A 30/30 × 3 runs after 24 h uptime.\n\n## Risk + rollback\nMedium. Touching PlumTree eager-set mutation is the most delicate part of saorsa-gossip-pubsub. PlumTree's tree-repair logic depends on the eager/lazy split being correct for IHAVE recovery to work. Suppression must NOT permanently remove a peer from the topic state, only from the eager-side fanout for the cooldown duration; IHAVE/IWANT correlation continues to exercise the peer.\n\nRollback: a config flag `gossip.peer_suppression_enabled = false` reverts to the pre-X0X-0010 behaviour (timeout, log, retry forever). Default true; the fix is too valuable to ship behind opt-in.\n\n## Why now\nX0X-0009 + the 2026-05-03 soak surfaced the architectural ceiling the user predicted: 'workers are being converted into blocked outbound fanout slots'. The diagnostic infrastructure (per-peer timeout counter from X0X-0007, message-kind counters from X0X-0008, supervisor target visibility from X0X-0009) all collectively make this fix testable; without them we'd have been guessing. The bad peer (`c1dfdbd98799fc47`) is currently active and consuming 98% of dispatcher capacity on every long-RTT bootstrap node. Operationally this is the highest-leverage fix on the backlog.\n\n## Links\n- X0X-0007: parallel republish + per-peer timeout (the timeout this ticket adds cooling on top of)\n- X0X-0008: per-message-kind counters (proves it's EAGER, not control)\n- X0X-0009: adaptive supervisor (correctly identified saturation, now needs this to make scale-up sufficient)\n- proofs/X0X-0009-soak-2026-05-03T00-07Z/: the 3-sample CSV + raw diagnostics that motivated this ticket", "priority": 1, "state": "review", "branch_name": null, "url": null, "labels": ["gossip", "performance", "saorsa-gossip", "structural", "plumtree"], "blocked_by": [], "created_at": "2026-05-03T00:30:00Z", "updated_at": "2026-05-03T17:25:00Z", "acceptance": ["After deploy without operator config: pubsub_stages.republish_per_peer_timeout rate per node drops ≥ 80% vs the 2026-05-03 baseline within 5 minutes", "recv_pump.pubsub.dropped_full per-node delta over the first hour < 1% of produced_total", "pubsub_stages.suppressed_peers diagnostic exposes the bad peer set with peer_id + cooldown_until + recent_rate", "Phase A 30/30 across 3 consecutive runs after 24 h mesh uptime, with workers floor still 1", "Suppression set never grows unboundedly (entries age out via cooldown)", "Recovered peer is re-admitted to eager set and starts receiving again — observable via suppressed_peers diagnostic dropping that peer"], "validation": ["Unit test in saorsa-gossip-pubsub: synthetic 1-slow-peer transport triggers suppression within threshold window", "Unit test: cooldown re-admit after slow peer recovers", "Unit test: per-(peer, topic) — slow on A stays in eager on B", "VPS deploy + 5 min collection: per-peer-timeout rate per node drops ≥ 80% (compare to /tmp/x0x-x9-soak/soak.csv)", "VPS deploy + 24 h soak: drops < 1%, supervisor targets stay near floor (1-2), Phase A 30/30 × 3"], "links": [{"kind": "ticket", "url": "X0X-0007", "note": "parallel_send_to_peers wraps send_to_peer in 750ms timeout — this ticket adds cooling on top"}, {"kind": "ticket", "url": "X0X-0008", "note": "per-message-kind counters confirmed 69-76% EAGER → republish fanout is the load"}, {"kind": "ticket", "url": "X0X-0009", "note": "Supervisor proved more workers do not help when one peer pins them all"}, {"kind": "code", "url": "../saorsa-gossip/crates/pubsub/src/lib.rs:parallel_send_to_peers", "note": "Per-peer timeout site — needs suppression-set check before each send"}, {"kind": "code", "url": "../saorsa-gossip/crates/pubsub/src/lib.rs:TopicState::eager_peers", "note": "Eager set mutation site — needs sender-side timeout-driven demotion"}, {"kind": "evidence", "url": "proofs/X0X-0009-soak-2026-05-03T00-07Z/", "note": "3-sample soak CSV showing per-peer timeout dominance"}, {"kind": "discussion", "url": "session 2026-05-03", "note": "User analysis of timeout-worker-load math identified the architectural issue"}], "handoff": {"summary": "Implemented and published saorsa-gossip v0.5.26 for sender-side slow-peer cooling across EAGER/IHAVE and single-peer recovery paths, then consumed it in x0x and deployed the updated x0xd to the six-node VPS bootstrap mesh. Live deploy verification passed Phase A 30/30 and Phase B 59/59. Two post-deploy diagnostics windows show the old degradation shape is gone: producer rate matches dequeuer rate, recv_pump.pubsub.dropped_full stays flat at 0, queue depth drains to near-zero, and dispatcher.pubsub.timed_out has no new events. Residual: per-peer timeout probes still remain above the ideal X0X-0010 target on the long-tail nodes (notably sydney and nuremberg) while suppression/backoff state fills; keep this in review until a longer soak confirms the timeout tail decays and workers back down safely.\n\n2026-05-03 update: 0.5.30 deploy verifies the open question. Cluster-wide cooling absorbs the per-peer timeout tail (≈78 cluster events over 5-min Phase B), dispatcher.pubsub.timed_out is 0 on all 6 nodes, recv_pump drops are 0, and supervisors are parked at 7-12 workers (well below the 16 ceiling).", "files_changed": ["../saorsa-gossip/crates/pubsub/src/lib.rs", "../saorsa-gossip/Cargo.toml", "../saorsa-gossip/CHANGELOG.md", "Cargo.toml", "src/gossip/runtime.rs", "src/gossip/config.rs", "src/bin/x0xd.rs", "tests/e2e_vps_mesh.py", ".deployment/config/bootstrap-config.toml", "docs/adr/0009-recv-pump-overload-policy.md", ".config/nextest.toml", ".gitignore", "tests/named_group_integration.rs"], "validation": [{"command": "cargo test -p saorsa-gossip-pubsub --lib", "status": "passed (57/57) before v0.5.26 publish"}, {"command": "cargo clippy -p saorsa-gossip-pubsub --all-features -- -D warnings -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used", "status": "passed before v0.5.26 publish"}, {"command": "saorsa-gossip tag v0.5.26 release workflow 25268361203", "status": "passed; GitHub release + crates.io publish completed"}, {"command": "cargo fmt --all -- --check", "status": "passed in x0x"}, {"command": "cargo test -p x0x --lib gossip::runtime", "status": "passed (27/27)"}, {"command": "cargo test -p x0x --lib gossip::config", "status": "passed (4/4)"}, {"command": "python3 -m py_compile tests/e2e_vps_mesh.py tests/runners/x0x_test_runner.py tests/e2e_vps_groups.py", "status": "passed"}, {"command": "cargo clippy -p x0x --all-features -- -D warnings -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used", "status": "passed"}, {"command": "cargo zigbuild --release --target x86_64-unknown-linux-gnu --bin x0xd", "status": "passed"}, {"command": "SKIP_BUILD=1 MESH_VERIFY=1 MESH_DISCOVER_SECS=45 MESH_SETTLE_SECS=150 bash tests/e2e_deploy.sh --mesh-verify", "status": "passed: 24/24 health checks, Phase A 30/30, Phase B 59/59"}, {"command": "python3 tests/e2e_vps_mesh.py --anchor nyc --discover-secs 45 --settle-secs 90 --post-discover-settle-secs 10 --local-port 22746", "status": "passed post-monitor: Phase A 30/30 sent, 30/30 received"}, {"command": "cargo nextest run --all-features --workspace", "status": "passed after final x0x changes: 1097/1097, 142 skipped"}, {"command": "cargo nextest run --all-features --test named_group_integration -- --ignored", "status": "passed after final x0x changes: 23/23"}, {"command": "GitHub Actions on main for 9df2b2f", "status": "passed: Build, CI, Integration & Soak Tests, Security Audit"}, {"command": "SKIP_BUILD=1 MESH_VERIFY=1 MESH_DISCOVER_SECS=45 MESH_SETTLE_SECS=150 bash tests/e2e_deploy.sh --mesh-verify", "status": "passed after membership dispatcher fix: 24/24 health checks, Phase A 30/30, Phase B 59/59"}, {"command": "0.5.30 VPS matrix proof", "status": "2026-05-03 17:24Z post-deploy of saorsa-gossip 0.5.30 (x0x commit 149f069): Phase A 30/30 × 3 consecutive runs, Phase B 59/59 incl. 5-runner roster + 5 cached replies on anchor; all 6 VPS nodes show dispatcher.pubsub.timed_out=0, recv_pump.pubsub.dropped_full=0; per-peer republish timeouts capped (cluster total ≈78 events over 5-min Phase B window) — all absorbed by cooling, none escalated to dispatcher timeouts. Active suppressions 24-85 per node, peer scoring demoted 229-438 (peer,topic) pairs to LAZY, outbound budget exhaustion regulating 9.7k-23k events per node."}], "post_deploy_monitoring": ["Immediate 60s diagnostics window: all six nodes had prod/s == deq/s, dropped_full delta 0, dispatcher timeout delta 0; workers 31-32; per-peer timeout rates still high on sydney/nuremberg while suppression populated.", "Delayed steady-state 60s diagnostics window after 240s quiet period: prod/s matched deq/s on all nodes, dropped_full delta 0, dispatcher timeout delta 0, depths <= 65; per-peer timeout rates nyc=1.92/s, sfo=0.53/s, helsinki=1.67/s, nuremberg=4.75/s, singapore=2.38/s, sydney=9.95/s.", "After final 9df2b2f deploy: membership dispatcher now exposes membership_workers=4; delayed 60s window showed all membership depths at 1, membership drops 0, pubsub drops 0, pubsub queue depths at 1, and pubsub producer/dequeue rates matched on all nodes. Residual: per-peer timeout tail remains above target on helsinki/nuremberg in that window, and one sydney pubsub dispatcher timeout occurred; keep in review for longer soak before done.", "Soak probe 2 update (2026-05-03T08:36Z): Phase A remained 30/30 and recv_pump.pubsub.dropped_full stayed 0, but dispatcher.pubsub.timed_out is not flat: singapore increased 0→2 over a 3-minute window while nyc=2, sfo=1, helsinki=1, nuremberg=2, and sydney=6 were unchanged. Treat as a soft regression signal in the same residual timeout-tail class, not a delivery regression; keep X0X-0010 in review."], "follow_up": ["Run a longer soak before marking done: confirm dropped_full remains 0, dispatcher.pubsub.timed_out remains flat, and per-peer timeout rate decays below the X0X-0010 target as cooldown/backoff repeats.", "If sydney remains >4/s after the longer soak, tighten the saorsa-gossip cooling policy (lower first-window threshold or longer initial cooldown) and publish the next patch release.", "Review x0x supervisor scale-down policy separately: current 5-minute hysteresis plus one-worker decrement means a node that hit 32 workers can take hours to return to floor even after health recovers.", "If membership depth rises again under quiet load, file a separate HyParView/SWIM control-plane ticket; 9df2b2f fixes the observed single-consumer backlog but does not change SWIM timeout policy.", "Before moving X0X-0010 to done, require multiple 30-minute soak reviews with Phase A 30/30, drops=0, and dispatcher.pubsub.timed_out flat or below the ticket threshold on every node; if singapore continues to add timeouts, tune slow-peer cooling/backoff rather than closing."], "updated_at": "2026-05-03T08:36:38Z"}}
{"id": "X0X-0011", "identifier": "X0X-0011", "title": "Gossipsub-style decayed peer score for PlumTree mesh selection", "description": "## Why\nX0X-0010 added send-side cooling, but the health signal lives outside `PeerScore`. Current scoring in `../saorsa-gossip/crates/pubsub/src/lib.rs` only considers IWANT response rate and recency, so repeated outbound timeouts do not directly affect later eager/lazy selection once cooldown expires.\n\nSOTA pubsub systems such as Gossipsub v1.1 use decayed peer scores and thresholds to steer mesh membership, gossip, publishing, and opportunistic replacement. PlumTree gives us EAGER/LAZY repair, but production WAN meshes need slow-send evidence in the same selection model.\n\n## What\nExtend saorsa-gossip-pubsub peer scoring with decayed send-side health: successful outbound sends, per-peer send timeouts, cooling events, recovery probes, IWANT fulfillment, and recency. Use the score when choosing lazy peers to graft, eager peers to prune, and peers eligible for recovery after cooling. Expose score components in diagnostics at coarse resolution so operators can see why a peer is not in EAGER.", "priority": 2, "state": "review", "branch_name": null, "url": null, "labels": ["gossip", "performance", "saorsa-gossip", "sota", "peer-scoring"], "blocked_by": [], "created_at": "2026-05-03T10:26:56Z", "updated_at": "2026-05-03T17:25:00Z", "acceptance": ["Outbound send timeouts and cooling events reduce a peer's mesh-selection score without requiring operator action", "Successful post-cooldown sends recover score gradually via decay or positive samples, not immediate full trust", "EAGER promotion and demotion prefer high-score peers and avoid low-score peers when alternatives exist", "Diagnostics expose enough peer-score component data to explain why a peer is EAGER, LAZY, cooled, or excluded", "Existing X0X-0010 clean-soak behavior does not regress: Phase A 30/30, drops 0, dispatcher timeout rate flat in a 6-node soak"], "validation": ["cargo test -p saorsa-gossip-pubsub --lib peer_score", "cargo test -p saorsa-gossip-pubsub --lib cooling", "Synthetic test: repeated outbound timeouts lower score below eager eligibility, then successful probes recover it gradually", "VPS soak: suppressed peer set does not oscillate and per-peer timeout tail continues to decay"], "links": [{"kind": "source", "url": "https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.1.md", "note": "Primary Gossipsub v1.1 spec: peer scoring, thresholds, opportunistic grafting"}, {"kind": "source", "url": "https://github.com/libp2p/go-libp2p-pubsub/blob/master/score_params.go", "note": "Primary implementation source for decayed peer-score parameters"}, {"kind": "code", "url": "../saorsa-gossip/crates/pubsub/src/lib.rs:549", "note": "Current PeerScore only includes IWANT response counts and recency"}, {"kind": "ticket", "url": "X0X-0010", "note": "Cooling is implemented, but not yet part of score-driven mesh selection"}], "handoff": {"summary": "Implemented decayed send-side peer scoring in saorsa-gossip 6b5252b / v0.5.29 and consumed it in x0x commit 6019948. Mesh selection now incorporates send success, send timeouts, cooling/recovery evidence, IWANT fulfillment, and recency so slow peers affect later EAGER/LAZY choices instead of only transient timeout handling.", "files_changed": ["../saorsa-gossip/CHANGELOG.md", "../saorsa-gossip/Cargo.toml", "../saorsa-gossip/crates/pubsub/src/lib.rs", "Cargo.toml"], "validation": [{"command": "cargo test -p saorsa-gossip-pubsub --lib peer_score", "status": "passed"}, {"command": "cargo test -p saorsa-gossip-pubsub --lib cooling", "status": "passed"}, {"command": "cargo test --workspace --all-features", "status": "passed in saorsa-gossip before v0.5.29 release"}, {"command": "cargo clippy --workspace --all-features -- -D warnings -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used", "status": "passed in saorsa-gossip before v0.5.29 release"}, {"command": "cargo test --all-features", "status": "passed in x0x after consuming v0.5.29"}, {"command": "0.5.30 VPS matrix proof", "status": "2026-05-03 17:24Z post-deploy of saorsa-gossip 0.5.30 (x0x commit 149f069): Phase A 30/30 × 3 consecutive runs, Phase B 59/59 incl. 5-runner roster + 5 cached replies on anchor; all 6 VPS nodes show dispatcher.pubsub.timed_out=0, recv_pump.pubsub.dropped_full=0; per-peer republish timeouts capped (cluster total ≈78 events over 5-min Phase B window) — all absorbed by cooling, none escalated to dispatcher timeouts. Active suppressions 24-85 per node, peer scoring demoted 229-438 (peer,topic) pairs to LAZY, outbound budget exhaustion regulating 9.7k-23k events per node."}], "follow_up": ["Human review should confirm the live soak evidence before marking done.", "Later score tuning should use observed WAN score distributions rather than changing thresholds blindly."]}}
{"id": "X0X-0012", "identifier": "X0X-0012", "title": "Single-probe recovery and exponential PRUNE/GRAFT backoff for cooled peers", "description": "## Why\nX0X-0010 currently suppresses a peer after repeated timeouts, then allows re-admission after cooldown. If the peer is still bad, the implementation can spend another full timeout window before suppressing it again. In a high-rate WAN mesh, that means repeated 750 ms worker burns during every recovery cycle.\n\nGossipsub-style mesh maintenance uses PRUNE backoff and GRAFT flood protection so a bad or too-eager edge cannot churn the mesh or repeatedly consume capacity.\n\n## What\nAdd a recovery-probe state for cooled peers. After cooldown expires, allow a single bounded recovery send for that peer/topic. If it succeeds, clear or reduce cooling according to score. If it times out, immediately re-suppress and increase backoff without waiting for another full timeout threshold. Apply the same backoff guard to GRAFT paths so IWANT recovery cannot instantly restore a repeatedly failing eager edge.", "priority": 2, "state": "review", "branch_name": null, "url": null, "labels": ["gossip", "performance", "saorsa-gossip", "sota", "backoff"], "blocked_by": [], "created_at": "2026-05-03T10:26:56Z", "updated_at": "2026-05-03T17:25:00Z", "acceptance": ["A cooled peer gets at most one recovery probe per peer/topic cooldown interval", "A failed recovery probe immediately re-suppresses the peer and increases cooldown/backoff without needing PEER_TIMEOUT_THRESHOLD more timeouts", "GRAFT from a cooled or recently failed peer respects backoff and cannot immediately put that peer back into EAGER", "Diagnostics distinguish active cooldown from recovery-probe state and show current backoff duration", "No delivery regression: LAZY/IHAVE/IWANT repair still recovers messages from peers that become healthy"], "validation": ["cargo test -p saorsa-gossip-pubsub --lib cooling", "Unit test: failed post-cooldown probe immediately doubles backoff and does not consume five more timeout slots", "Unit test: successful post-cooldown probe permits controlled re-admission", "Unit test: IWANT-driven graft respects active backoff", "VPS soak: residual per-peer timeout tail falls below X0X-0010 target without growing drops"], "links": [{"kind": "source", "url": "https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.1.md", "note": "Primary Gossipsub v1.1 spec: PRUNE backoff and GRAFT controls"}, {"kind": "code", "url": "../saorsa-gossip/crates/pubsub/src/lib.rs:831", "note": "Current send-timeout accounting and suppression trigger"}, {"kind": "code", "url": "../saorsa-gossip/crates/pubsub/src/lib.rs:906", "note": "Current GRAFT path skips active suppression but has no explicit post-cooldown probe/backoff state"}, {"kind": "ticket", "url": "X0X-0010", "note": "Builds on existing slow-peer cooling"}], "handoff": {"summary": "Implemented single-probe cooled-peer recovery and exponential backoff in saorsa-gossip 98df44e / v0.5.27 and consumed it in x0x commit 9c0f006. Cooled peers now re-enter through a bounded probe path; failed probes re-suppress quickly instead of burning another full timeout threshold.", "files_changed": ["../saorsa-gossip/CHANGELOG.md", "../saorsa-gossip/Cargo.toml", "../saorsa-gossip/crates/pubsub/src/lib.rs", "Cargo.toml"], "validation": [{"command": "cargo test -p saorsa-gossip-pubsub --lib cooling", "status": "passed"}, {"command": "cargo test -p saorsa-gossip-pubsub --lib peer_score", "status": "passed"}, {"command": "cargo test --workspace --all-features", "status": "passed in saorsa-gossip before v0.5.27 release"}, {"command": "cargo clippy --workspace --all-features -- -D warnings -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used", "status": "passed in saorsa-gossip before v0.5.27 release"}, {"command": "cargo test --all-features", "status": "passed in x0x after consuming v0.5.27"}, {"command": "0.5.30 VPS matrix proof", "status": "2026-05-03 17:24Z post-deploy of saorsa-gossip 0.5.30 (x0x commit 149f069): Phase A 30/30 × 3 consecutive runs, Phase B 59/59 incl. 5-runner roster + 5 cached replies on anchor; all 6 VPS nodes show dispatcher.pubsub.timed_out=0, recv_pump.pubsub.dropped_full=0; per-peer republish timeouts capped (cluster total ≈78 events over 5-min Phase B window) — all absorbed by cooling, none escalated to dispatcher timeouts. Active suppressions 24-85 per node, peer scoring demoted 229-438 (peer,topic) pairs to LAZY, outbound budget exhaustion regulating 9.7k-23k events per node."}], "follow_up": ["Watch the residual per-peer timeout tail in soak output; failed probes should re-suppress without repeated long stalls.", "Human review should decide whether cooldown/backoff defaults need production tuning after more WAN samples."]}}
{"id": "X0X-0013", "identifier": "X0X-0013", "title": "Replace eager bulk refresh with scored mesh maintenance", "description": "## Why\nx0x refreshes PlumTree topic peers every second, and saorsa-gossip currently promotes connected LAZY peers back into EAGER when they are not actively suppressed. That protects against permanent PRUNE damage, but it also fights the EAGER/LAZY optimization and can undo slow-peer demotion too aggressively.\n\nSOTA pubsub meshes keep bounded degree targets and repair the mesh on heartbeat using score-aware promotion, pruning, and opportunistic grafting rather than bulk re-promoting every connected peer.\n\n## What\nChange topic-peer refresh from bulk eager promotion to scored mesh maintenance. Keep disconnected peers pruned. Add new peers conservatively. Preserve LAZY state for connected peers unless the topic is below minimum degree or the peer is selected by score/opportunistic graft. Document the mapping between PlumTree MIN/MAX_EAGER_DEGREE and Gossipsub-style D_low/D_high/D_lazy behavior.", "priority": 2, "state": "review", "branch_name": null, "url": null, "labels": ["gossip", "performance", "saorsa-gossip", "sota", "mesh-maintenance"], "blocked_by": [{"id": "X0X-0011", "identifier": "X0X-0011", "state": "review"}], "created_at": "2026-05-03T10:26:56Z", "updated_at": "2026-05-03T17:25:00Z", "acceptance": ["Periodic refresh no longer promotes every connected LAZY peer to EAGER by default", "EAGER degree remains within configured min/max targets under churn and after PRUNE events", "When below minimum degree, promotion chooses eligible high-score LAZY peers first and skips cooled/low-score peers", "Opportunistic graft periodically replaces low-score eager peers when better lazy peers are available", "ADR or design note records the PlumTree-to-Gossipsub mesh parameter mapping"], "validation": ["cargo test -p saorsa-gossip-pubsub --lib set_topic_peers", "Update existing set_topic_peers tests that currently assert bulk re-promotion", "New churn test: duplicate-driven PRUNE remains stable across repeated refresh ticks", "New slow-peer test: cooled LAZY peer is not bulk-promoted after refresh while healthy alternatives exist", "VPS soak: EAGER set remains stable, Phase A stays 30/30, drops 0"], "links": [{"kind": "source", "url": "https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.0.md", "note": "Primary Gossipsub v1.0 spec: mesh degree and heartbeat maintenance"}, {"kind": "source", "url": "https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.1.md", "note": "Primary Gossipsub v1.1 spec: opportunistic grafting and scoring thresholds"}, {"kind": "source", "url": "https://asc.di.fct.unl.pt/~jleitao/pdf/srds07-leitao.pdf", "note": "Primary PlumTree paper: EAGER tree plus LAZY repair model"}, {"kind": "code", "url": "src/gossip/runtime.rs:996", "note": "x0x refreshes PlumTree topic peers every second"}, {"kind": "code", "url": "../saorsa-gossip/crates/pubsub/src/lib.rs:2433", "note": "Current set_topic_peers promotes connected lazy peers back to eager"}], "handoff": {"summary": "Implemented scored eager mesh maintenance in saorsa-gossip 118dd8d / v0.5.30 and consumed it in this x0x change. Refresh now admits new peers as LAZY first, maintains EAGER degree with score-aware promotion/pruning, and uses rate-limited opportunistic grafting instead of bulk re-promoting every connected peer.", "files_changed": ["../saorsa-gossip/CHANGELOG.md", "../saorsa-gossip/Cargo.toml", "../saorsa-gossip/crates/pubsub/src/lib.rs", "../saorsa-gossip/docs/adr/ADR-009-peer-scoring.md", "Cargo.toml"], "validation": [{"command": "cargo test -p saorsa-gossip-pubsub --lib", "status": "passed 87/87 before v0.5.30 release"}, {"command": "cargo test --workspace --all-features", "status": "passed in saorsa-gossip before v0.5.30 release"}, {"command": "cargo clippy --workspace --all-features -- -D warnings -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used", "status": "passed in saorsa-gossip before v0.5.30 release"}, {"command": "GitHub release workflow for tag v0.5.30", "status": "passed and published to crates.io"}, {"command": "cargo fmt --all -- --check", "status": "passed in x0x after consuming v0.5.30"}, {"command": "cargo test --all-features", "status": "passed in x0x after consuming v0.5.30"}, {"command": "cargo clippy --all-features -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used", "status": "passed in x0x after consuming v0.5.30"}, {"command": "0.5.30 VPS matrix proof", "status": "2026-05-03 17:24Z post-deploy of saorsa-gossip 0.5.30 (x0x commit 149f069): Phase A 30/30 × 3 consecutive runs, Phase B 59/59 incl. 5-runner roster + 5 cached replies on anchor; all 6 VPS nodes show dispatcher.pubsub.timed_out=0, recv_pump.pubsub.dropped_full=0; per-peer republish timeouts capped (cluster total ≈78 events over 5-min Phase B window) — all absorbed by cooling, none escalated to dispatcher timeouts. Active suppressions 24-85 per node, peer scoring demoted 229-438 (peer,topic) pairs to LAZY, outbound budget exhaustion regulating 9.7k-23k events per node."}], "follow_up": ["Live 6-node mesh soak is still required before marking done; local validation proves compatibility, not WAN equilibrium.", "Expect graft/prune/eager_fanout diagnostics to shift because EAGER membership is now intentionally bounded and lazy-first."]}}
{"id": "X0X-0014", "identifier": "X0X-0014", "title": "Per-peer outbound PubSub concurrency and queue budgets", "description": "## Why\nX0X-0007 bounds each send with a per-peer timeout and X0X-0010 suppresses repeated offenders, but a bad peer can still consume several concurrent worker slots before suppression activates or during recovery. Large-userbase readiness requires hard per-peer outbound budgets so one peer cannot convert arbitrary fanout work into blocked sends.\n\nProduction Gossipsub deployments pair mesh scoring with queue limits and bounded control/data-plane work. Ethereum consensus clients explicitly rely on queueing and validation limits around Gossipsub rather than unbounded propagation work.\n\n## What\nIntroduce per-peer outbound PubSub permits or small queues around EAGER/IHAVE/IWANT/anti-entropy sends. A peer should have a bounded number of in-flight PubSub sends, ideally one data send plus a small control budget. Excess work should be coalesced where possible, delayed, or skipped with score/counter feedback instead of spawning another task that can hit the full timeout.", "priority": 2, "state": "review", "branch_name": null, "url": null, "labels": ["gossip", "performance", "saorsa-gossip", "sota", "backpressure"], "blocked_by": [], "created_at": "2026-05-03T10:26:56Z", "updated_at": "2026-05-03T17:25:00Z", "acceptance": ["A single peer cannot occupy more than the configured outbound PubSub permit budget on a node", "EAGER fanout, IHAVE flush, IWANT recovery, and anti-entropy all use the same peer-budget accounting", "When a peer is over budget, IHAVE/control work is coalesced or skipped with diagnostics instead of unbounded task growth", "Budget exhaustion feeds peer score or cooling so repeated pressure affects future mesh selection", "Dispatcher throughput stays bounded under a synthetic one-bad-peer fanout storm"], "validation": ["cargo test -p saorsa-gossip-pubsub --lib outbound_budget", "Synthetic transport test: many messages to one blocked peer never exceed one or configured N in-flight sends", "Synthetic mixed-peer test: one blocked peer does not slow sends to healthy peers", "VPS soak: per-peer timeout tail remains flat and worker target trends down under quiet traffic"], "links": [{"kind": "source", "url": "https://raw.githubusercontent.com/ethereum/consensus-specs/dev/specs/phase0/p2p-interface.md", "note": "Primary Ethereum consensus p2p spec: production Gossipsub profile and queue/validation expectations"}, {"kind": "source", "url": "https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.1.md", "note": "Primary Gossipsub v1.1 spec: score and mesh controls assume bounded local work"}, {"kind": "code", "url": "../saorsa-gossip/crates/pubsub/src/lib.rs:1294", "note": "parallel_send_to_peers currently spawns one task per target peer per message"}, {"kind": "ticket", "url": "X0X-0012", "note": "Complements single-probe recovery by limiting initial and burst-time outbound exposure"}], "handoff": {"summary": "Implemented per-peer outbound PubSub concurrency budgeting in saorsa-gossip 2d820b6 / v0.5.28 and consumed it in x0x commit a1982ee. Outbound send work is now constrained per peer so one slow target cannot consume arbitrary fanout capacity before cooling/scoring reacts.", "files_changed": ["../saorsa-gossip/CHANGELOG.md", "../saorsa-gossip/Cargo.toml", "../saorsa-gossip/crates/pubsub/src/lib.rs", "Cargo.toml"], "validation": [{"command": "cargo test -p saorsa-gossip-pubsub --lib outbound_budget", "status": "passed"}, {"command": "cargo test -p saorsa-gossip-pubsub --lib cooling", "status": "passed"}, {"command": "cargo test --workspace --all-features", "status": "passed in saorsa-gossip before v0.5.28 release"}, {"command": "cargo clippy --workspace --all-features -- -D warnings -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used", "status": "passed in saorsa-gossip before v0.5.28 release"}, {"command": "cargo test --all-features", "status": "passed in x0x after consuming v0.5.28"}, {"command": "0.5.30 VPS matrix proof", "status": "2026-05-03 17:24Z post-deploy of saorsa-gossip 0.5.30 (x0x commit 149f069): Phase A 30/30 × 3 consecutive runs, Phase B 59/59 incl. 5-runner roster + 5 cached replies on anchor; all 6 VPS nodes show dispatcher.pubsub.timed_out=0, recv_pump.pubsub.dropped_full=0; per-peer republish timeouts capped (cluster total ≈78 events over 5-min Phase B window) — all absorbed by cooling, none escalated to dispatcher timeouts. Active suppressions 24-85 per node, peer scoring demoted 229-438 (peer,topic) pairs to LAZY, outbound budget exhaustion regulating 9.7k-23k events per node."}], "follow_up": ["Live soak should confirm one slow peer no longer drives worker saturation or timeout cascades.", "Future tuning can split data/control budgets if diagnostics show control traffic starves behind data traffic."]}}
{"id": "X0X-0015", "identifier": "X0X-0015", "title": "Large-userbase gossip readiness harness and launch SLO gates", "description": "## Why\nThe current six-node VPS soak is the right proof for X0X-0010, but it is not proof of very-large-userbase readiness. SOTA confidence comes from testing churn, restart storms, slow/stale peers, asymmetric RTT, burst fanout, queue pressure, and partial partitions before broad launch.\n\nThe clean probe sequence means the urgent degradation loop is probably fixed. This ticket turns the remaining concern into measurable launch gates rather than more speculative hot-path changes.\n\n## What\nBuild a repeatable launch-readiness harness and SLO report for gossip/pubsub. It should run local synthetic scenarios and VPS scenarios with injected slow peers, stale peers, external non-bootstrap peers, delayed readers, high RTT, coordinated restarts, and fanout bursts. Produce a simple go/no-go report from diagnostics counters.", "priority": 2, "state": "review", "branch_name": null, "url": null, "labels": ["gossip", "testing", "launch-readiness", "sota", "vps-bootstrap"], "blocked_by": [], "created_at": "2026-05-03T10:26:56Z", "updated_at": "2026-05-03T17:42:00Z", "acceptance": ["Harness covers at least: one bad external peer, multiple bad peers, high RTT peer, stale peer, restart storm, fanout burst, and partial partition recovery", "Report includes Phase A delivery, dropped_full ratio, dispatcher.pubsub.timed_out delta, per-peer timeout rate, suppressed_peers size, worker target, queue depth, and recovery time", "Launch SLOs are explicit: sustained drops below threshold, dispatcher timeouts flat or below threshold, Phase A 30/30, no unbounded suppressed set, no operator restart required", "Harness artifacts are saved under proofs/ with raw diagnostics and summarized CSV/Markdown", "A broad-launch gate is documented separately from the limited-production gate"], "validation": ["python3 tests/e2e_vps_mesh.py scenario extensions compile and run", "Local synthetic harness can deterministically inject slow/stale peers without real VPS access", "VPS dry run completes and writes proofs/<run-id>/summary.md", "At least one 24h run passes the launch SLOs before marking this ticket review"], "links": [{"kind": "source", "url": "https://asc.di.fct.unl.pt/~jleitao/pdf/HyParView.pdf", "note": "Primary HyParView paper: membership churn and active/passive view resilience assumptions"}, {"kind": "source", "url": "https://asc.di.fct.unl.pt/~jleitao/pdf/srds07-leitao.pdf", "note": "Primary PlumTree paper: repair and eager/lazy dissemination assumptions"}, {"kind": "source", "url": "https://raw.githubusercontent.com/ethereum/consensus-specs/dev/specs/phase0/p2p-interface.md", "note": "Primary production Gossipsub profile used by Ethereum consensus clients"}, {"kind": "ticket", "url": "X0X-0010", "note": "Current clean soak is the limited-production proof, not the very-large-userbase proof"}], "handoff": {"summary": "Built launch-readiness harness scaffold (tests/launch_readiness.py) with two SLO gates (limited-production, broad-launch), three scenarios (baseline, fanout_burst, restart_storm — last is opt-in via --allow-restart-storm), and per-node diagnostics deltas with go/no-go report. 2026-05-03 17:40Z verified end-to-end against the live VPS mesh on x0x 149f069 / saorsa-gossip 0.5.30: both gates GO with baseline 30/30 + 100-msg fanout burst, dispatcher.timed_out=0 / dropped_full=0 cluster-wide, cluster total 29 per-peer republish timeouts (well under either gate). Documented limited-production gate (5 disp_to / 200 pp_to / 200 suppressed bounds) and broad-launch gate (0 disp_to / 50 pp_to / 100 suppressed bounds + 24h soak + partition-recovery + reviewer sign-off requirements). Deferred follow-on work: 24h soak (broad-launch evidence requirement), netem high-RTT scenario, iptables partition-recovery scenario, hostile-peer track. The 24h soak is a simple cron of the baseline scenario every 30 min for 48 windows — does not need new code.", "files_changed": ["tests/launch_readiness.py", "docs/launch-gates/limited-production.md", "docs/launch-gates/broad-launch.md", "issues/issues.jsonl"], "validation": [{"command": "python3 tests/launch_readiness.py --gate limited-production --scenarios baseline", "status": "passed (Phase A 30/30, all SLO counters at 0)"}, {"command": "python3 tests/launch_readiness.py --gate limited-production --scenarios baseline,fanout_burst --burst-messages 100", "status": "passed (verdict GO; report at proofs/launch-readiness-20260503T163432Z/summary.md)"}, {"command": "python3 tests/launch_readiness.py --gate broad-launch --scenarios baseline,fanout_burst --burst-messages 100", "status": "passed (verdict GO; report at proofs/launch-readiness-20260503T163735Z/summary.md)"}], "next_steps": ["Run baseline scenario hourly for 24h via cron + diff against this snapshot — feeds the broad-launch 24h soak evidence requirement.", "Wire restart_storm into the broad-launch run before the next bootstrap upgrade — opt-in is intentional, but needs at least one execution to populate evidence.", "Add netem-based high_rtt_peer scenario as a follow-on ticket (X0X-0016) once we agree which non-production VPS to use as the netem target.", "Consider running the harness from a different anchor (helsinki, sydney) to get cross-region viewpoint diversity."]}}