x0x 0.19.18 - Docs.rs

{"id":"X0X-0001","identifier":"X0X-0001","title":"Bootstrap non-Linear Symphony workflow for x0x","description":"Create the repo-owned WORKFLOW.md and git-committed issue database scaffold used by the first x0x-symphony runner prototype. This intentionally avoids Linear and prepares for a future x0x CRDT-backed tracker adapter.","priority":2,"state":"review","branch_name":null,"url":null,"labels":["x0x-symphony","workflow","tracker-git"],"blocked_by":[],"created_at":"2026-04-28T00:00:00Z","updated_at":"2026-04-28T00:00:00Z","links":[{"kind":"design","url":"../x0x-symphony/docs/design/symphony.md","note":"Authoritative architecture for x0x-symphony"},{"kind":"adr","url":"../x0x-symphony/docs/adr/0001-tracker-abstraction.md"},{"kind":"adr","url":"../x0x-symphony/docs/adr/0002-sharded-claim-ttl.md"},{"kind":"adr","url":"../x0x-symphony/docs/adr/0003-no-external-tracker-v1.md"},{"kind":"adr","url":"../x0x-symphony/docs/adr/0004-x0x-tasklist-as-backbone.md"}],"acceptance":["WORKFLOW.md exists at the repository root","Workflow uses tracker.kind=git_issues instead of Linear","issues/issues.jsonl exists and contains machine-readable records","issues/schema.md documents states, fields, and future x0x mapping"],"validation":["Review WORKFLOW.md front matter and prompt for consistency","Review issues/schema.md and issues/issues.jsonl for JSONL validity"],"handoff":{"summary":"Initial non-Linear Symphony workflow and git issue database scaffold created for x0x. Open architectural questions in the original handoff are now answered in the sibling x0x-symphony repo: GitHub adapter is rejected (ADR-0003), JSONL\u2192CRDT mapping is locked (ADR-0004), and tracker abstraction is fixed (ADR-0001).","files_changed":["WORKFLOW.md","issues/README.md","issues/schema.md","issues/issues.jsonl"],"validation":[{"command":"python3 - <<'PY'\nimport json, pathlib\nfor line in pathlib.Path('issues/issues.jsonl').read_text().splitlines():\n    if line.strip():\n        json.loads(line)\nPY","status":"passed"}],"follow_up":["Architecture decisions are now locked in ../x0x-symphony/docs/adr/0001..0004.","WORKFLOW.md updated to use the harness-agnostic runner: block; legacy codex: block kept for compatibility and slated for deprecation in M4.","issues/schema.md extended with shard and claim fields used by x0x-symphony's M2.","M1 implementation issues live in ../x0x-symphony/issues/issues.jsonl as XSY-0002..XSY-0008."]}}
{"id":"X0X-0002","identifier":"X0X-0002","title":"Self-DM short-circuit in send_direct_with_config","description":"## Symptom\nWhen `/direct/send` is called with `agent_id == self.agent_id`, the daemon returns `{\"error\":\"peer_disconnected\",\"detail\":\"closed: ReaderExit\"}`. Reproduced live on nyc bootstrap (saorsa-2) by issuing `POST /direct/send` with the daemon's own agent_id as recipient.\n\n## Root cause\n`Agent::send_direct_with_config` (`src/lib.rs:2828`) has no self-DM short-circuit. For self as recipient:\n- `capability_store.lookup(to)` returns `None` (a daemon does not advertise capabilities to itself), so `gossip_ok = false`.\n- `prefer_raw_quic_if_connected: false` (new default) skips the preferred-raw branch, so `preferred_raw_err = None` and `preferred_raw_receipt = None`.\n- Dispatch falls through to the final else branch which calls `send_direct_raw_quic(self, ...)`.\n- ant-quic has no self-connection \u2014 returns `peer_disconnected: ReaderExit`.\n\nPre-existing behaviour (raw-first default) hit the same dead end via a different code path. This is not a regression introduced by the second-pass patch \u2014 but it was exposed by the new Phase A harness pattern in `tests/e2e_vps_mesh.py` where the anchor is also one of the runners and result envelopes from the anchor's runner are addressed to the orchestrator (= the anchor's own agent_id).\n\n## Evidence\n- VPS deploy run 2026-05-01T20:13:55Z (commit 2a9949a + working-tree second-pass patch); rerun 2026-05-01T20:23:29Z.\n- nyc anchor `journalctl -u x0x-test-runner.service` shows repeated `WARNING runner[nyc] DM result to da2233d6ba2f9569\u2026 failed, falling back to pubsub` \u2014 one per nyc-originated send_result. Each retry path is 3\u00d7 attempts at `PUBLISH_RETRY_BACKOFF_SECS * attempt`, so this serializes nyc's results behind the fallback, increasing the chance of a settle-window miss.\n- `python3 tests/e2e_vps_mesh.py --anchor nyc` reported `Sent: 29/30, Received: 30/30, Send fails: 1` with the missed pair `nyc-singapore` \u2014 destination delivered, only the source's confirmation envelope went missing because legacy pubsub fallback is more lossy than the primary DM path.\n\n## Fix\nShort-circuit at the top of `send_direct_with_config`: if `to == self.identity.agent_id()`, deliver the payload directly to the local direct event bus (the same path `recv_direct_annotated` consumes) without going through the network stack. Construct a `DmReceipt` with `path = DmPath::Loopback` (new variant) so callers can distinguish.\n\nTouchpoints:\n- `src/dm.rs` \u2014 add `DmPath::Loopback` variant.\n- `src/lib.rs:2828` \u2014 add the short-circuit before the rtt_hint/capability lookup.\n- `src/direct.rs` \u2014 expose a fast-path enqueue API onto the direct event channel.\n- `src/dm_send.rs` \u2014 receipt helper for the loopback path.\n\n## Why now\nThe Phase A all-pairs harness will keep flaking on whichever node is the anchor until this is fixed. Any external client that runs both the daemon and an agent in the same process and addresses self for diagnostic / loopback messaging hits the same wall.","priority":2,"state":"review","branch_name":null,"url":null,"labels":["dm","transport","regression-mask","vps-bootstrap"],"blocked_by":[],"created_at":"2026-05-01T20:35:00Z","updated_at":"2026-05-01T21:02:36Z","acceptance":["POST /direct/send with self agent_id returns 200 ok with a receipt whose path is the new Loopback variant","Recipient's /direct/events SSE stream emits the message envelope identically to a remote DM","tests/runners/x0x_test_runner.py self-DM result envelopes succeed without falling back to legacy pubsub","New unit/integration test in src/lib.rs `tests` module verifying self-DM (analogous to `connected_peer_clears_stale_lifecycle_block_before_raw_send`)","Phase A all-pairs matrix on 6-node VPS mesh: Sent == Received == 30/30 over 3 consecutive runs"],"validation":["cargo nextest run --all-features -E 'test(self_dm) | test(direct)'","python3 tests/e2e_vps_mesh.py --anchor nyc --discover-secs 45 --settle-secs 60 (3 consecutive clean runs)","ssh root@saorsa-2 'journalctl -u x0x-test-runner.service --since=<run-start> | grep -c \"falling back to pubsub\"' returns 0"],"links":[{"kind":"evidence","url":"see ticket description","note":"VPS deploy + Phase A run on 2026-05-01"},{"kind":"code","url":"src/lib.rs:2828","note":"send_direct_with_config dispatcher"},{"kind":"code","url":"src/lib.rs:2922","note":"fallthrough else branch that hits raw self-DM"}],"handoff":{"summary":"Added a true self-DM loopback path. send_direct_with_config now short-circuits self-addressed DMs before RTT/capability/offline checks, enqueues through DirectMessaging subscriber/internal delivery, returns DmPath::Loopback, and surfaces loopback in REST/direct diagnostics.","files_changed":["src/dm.rs","src/dm_send.rs","src/direct.rs","src/lib.rs","src/bin/x0xd.rs"],"validation":[{"command":"cargo nextest run --all-features -E 'test(self_dm) | test(direct)'","status":"passed"},{"command":"cargo nextest run --all-features -E 'test(self_dm) | test(warn_forward_channel_pressure) | test(recv_pump)'","status":"passed"},{"command":"just fmt-check","status":"passed"},{"command":"just lint","status":"passed"},{"command":"just test","status":"passed"},{"command":"python3 tests/e2e_vps_mesh.py --anchor nyc --discover-secs 45 --settle-secs 60 (3 runs)","status":"not_run","note":"requires live VPS deployment/mesh window"}],"follow_up":["Run the 3 consecutive VPS Phase A mesh checks from the ticket before closing as done."]}}
{"id":"X0X-0003","identifier":"X0X-0003","title":"INFO trend signal in warn_forward_channel_pressure misses production saturation pattern","description":"## Symptom\nProduction saturation of `recv_pubsub_tx` on VPS bootstrap nodes consistently triggers the >80% WARN log but never triggers the >50% INFO trend signal. Across a 4-min Phase A run on the 6-node VPS mesh: 37 WARN events, 0 INFO events.\n\n## Root cause\n`warn_forward_channel_pressure` in `src/network.rs:223` gates the INFO branch on:\n\n```rust\nlet bucket = (max / 10).max(1);\nif used.is_multiple_of(bucket) {\n    info!(...)\n}\n```\n\nWith `max = 10000`, `bucket = 1000`, so INFO only fires when `used` lands exactly on 5000, 6000, 7000, 8000, or 9000 at the moment a forward call samples it. The actual production saturation pattern jumps from low-usage to `used = 9999..10000` between two consecutive forward calls (per-peer channel fills inside one send burst), so `used` never lands on the 1000-multiple boundaries during the climb. The INFO branch is dead code under real load.\n\n## Evidence\n- VPS deploy run 2026-05-01T20:13:55Z (commit 2a9949a + working-tree second-pass patch); rerun 2026-05-01T20:23:29Z.\n- Per-node WARN counts (>80%): nyc=2, sfo=4, helsinki=0, nuremberg=0, singapore=10, sydney=21. Per-node INFO counts (>50%): all 0.\n- All WARN entries report `used = 9999` or `used = 10000` (used_pct = 99 or 100). No WARN entry has `used` between 5000 and 9000.\n\n## Fix options\n1. **Time-rate-limited sampling** (recommended). Track per-channel `last_info_at: Instant` (e.g., on `NetworkNode` itself or in a `OnceLock<Mutex<HashMap<&'static str, Instant>>>` keyed on channel_name). Emit INFO when `used > max/2 && now - last_info_at > Duration::from_secs(30)`. Caps log volume to N events per channel per run.\n2. **Threshold-edge sampling**. Track per-channel `last_used_pct: AtomicUsize` and emit INFO when crossing into a higher 10% bucket (50\u219260, 60\u219270, etc.). Captures the climb shape but spammy on oscillation.\n3. **Sampled probabilistic** \u2014 emit INFO with probability `(used_pct - 50) / 50` once above 50%. Cheap, no state, but produces dust at low-pressure thresholds.\n\nOption 1 is the right shape for the operator audience: rare, deterministic, contains trend information.\n\n## Why this matters\nWithout an early signal the operator only learns about queue pressure when it is already at saturation \u2014 same blind spot the WARN was supposed to address but one threshold lower. The current INFO branch is dead code that gives a false sense of graduated observability.","priority":3,"state":"review","branch_name":null,"url":null,"labels":["observability","network","bug"],"blocked_by":[],"created_at":"2026-05-01T20:35:00Z","updated_at":"2026-05-01T21:02:36Z","acceptance":["Synthetic local stress test that climbs `recv_pubsub_tx` past max/2 emits at least one INFO trend event before saturation","Same VPS Phase A run that produced 37 WARNs and 0 INFOs now produces non-zero INFO events on the same nodes","INFO event volume per channel per run is bounded (no more than ~10 INFOs per channel per minute under sustained pressure)","WARN >80% behaviour is unchanged"],"validation":["cargo test -p x0x --lib warn_forward_channel_pressure","bash tests/e2e_stress_gossip.sh --nodes 5 --messages 5000 then grep INFO + WARN counts in proofs/","bash tests/e2e_deploy.sh --mesh-verify then re-harvest with /tmp/harvest-vps-pressure.sh and confirm INFO > 0"],"links":[{"kind":"code","url":"src/network.rs:223","note":"warn_forward_channel_pressure helper"}],"handoff":{"summary":"Replaced exact bucket-boundary INFO sampling with deterministic per-channel/per-stream time-rate-limited sampling. INFO now fires on the first sample above 50%, including direct jumps to >80% saturation, while the existing >80% WARN condition remains unchanged.","files_changed":["src/network.rs"],"validation":[{"command":"cargo test -p x0x --lib warn_forward_channel_pressure","status":"passed"},{"command":"cargo nextest run --all-features -E 'test(self_dm) | test(warn_forward_channel_pressure) | test(recv_pump)'","status":"passed"},{"command":"just fmt-check","status":"passed"},{"command":"just lint","status":"passed"},{"command":"just test","status":"passed"},{"command":"VPS Phase A/B pressure re-harvest","status":"not_run","note":"requires live VPS deployment/mesh window"}],"follow_up":["After VPS deploy, confirm nodes with saturation WARNs now also produce non-zero >50% INFO trend events."]}}
{"id":"X0X-0004","identifier":"X0X-0004","title":"Structural recv_pubsub_tx saturation on VPS bootstrap nodes \u2014 10\u00d7 buffer is mitigation, not fix","description":"## Symptom\nOn the 6-node VPS bootstrap mesh, `recv_pubsub_tx` saturates to `used_pct = 100` sustained for tens of seconds at a time on far-from-anchor nodes. Across a 4-min Phase A + Phase B run: 37 saturation WARNs distributed nyc=2, sfo=4, helsinki=0, nuremberg=0, singapore=10, sydney=21. The 1024 \u2192 10000 capacity bump merged in the second-pass patch (`src/network.rs:307`) does not prevent saturation \u2014 it raises the ceiling and delays the choke, but the underlying recv-pump throughput cannot keep pace with cross-region fanout under sustained gossip load.\n\n## Why this matters\nZero drops are observed (the `mpsc::Sender::send().await` back-pressures the producer rather than dropping), so on the surface the system is correct. But back-pressure propagates upstream into ant-quic's recv reader task, stalling the entire QUIC receive pipeline for the duration of the saturation. Concrete consequences:\n- Phase A `nyc-singapore` send_result envelope went missing (1/30 fail in receive matrix context) because the singapore daemon's recv pump was stalled on its 10\u00d710000 saturated queue at the moment the fallback pubsub publish arrived.\n- Any latency-sensitive control message (lease renewal for exec sessions, SWIM ping ack, presence beacon) on the same connection blocks behind the saturated channel.\n- Memory cost is now ~10\u00d7 per peer \u00d7 per stream-type (10000 \u00d7 payload-arc-overhead). On a bootstrap node with 7 peers \u00d7 4 stream types \u00d7 10K queue depth, that is ~280K queued messages of headroom \u2014 multi-MB to multi-GB depending on actual payload retention. Headroom we cannot drain.\n\n## Evidence\n- VPS deploy run 2026-05-01T20:13:55Z (commit 2a9949a + working-tree second-pass patch); rerun 2026-05-01T20:23:29Z.\n- VPS log harvest via `/tmp/harvest-vps-pressure.sh`: every saturation event reports `available=0..1, used=9999..10000, used_pct=99..100, channel=\"recv_pubsub_tx\", stream=Some(PubSub)`.\n- Geographic correlation: saturation rate ~ RTT to anchor. sydney (250 ms RTT to nyc): 21 events. singapore (220 ms): 10. sfo (70 ms): 4. helsinki/nuremberg (~110 ms via EU peering): 0. The slow consumer side is the long-RTT receiver, not the publisher.\n- The previous v0.18.3 fix bumped `NetworkNode::recv_tx` 128 \u2192 10000 to handle a different stall (PubSubManager subscriber lock + EAGER fan-out). That fix landed at the transport layer; this one is one layer up at the per-peer recv forward channel inside x0x. Same underlying shape: single-consumer mpsc that can't drain at fanout rate.\n\n## Investigation needed\nBefore picking a fix, instrument the actual choke point. Add diagnostics for:\n- Per-peer per-stream-type producer rate (`tx.send` calls/s).\n- Per-stream-type consumer drain rate (`rx.recv` calls/s, latency to drain).\n- Median + p99 dwell time inside the channel.\n- Subscriber count per topic and which subscriber is the slowest consumer (which is the real choke: gossip-pubsub subscribers fan out one mpsc per subscription downstream of this channel).\n\nHypothesis to validate: the choke is the single shared `recv_pubsub_rx` consumer task in `saorsa_gossip_transport`'s adapter \u2014 every received pubsub frame is decoded, ML-DSA-verified, and re-fanned-out to per-subscription mpsc channels by one task. Under fanout load (one msg \u2192 N subscribers \u00d7 per-sub mpsc(10000) sends), that single decode/verify/fanout loop is the rate limit.\n\n## Fix options (after instrumentation)\n1. **Parallelize the recv pump per stream-type or per peer**. Multiple decode/verify workers feeding off `recv_pubsub_rx`. Requires reshaping the saorsa-gossip adapter.\n2. **Drop-oldest under sustained pressure with a counter**. Convert to `try_send` with `Full(_) \u2192 drop and bump `recv_pubsub_dropped` atomic`. Expose drops via `/diagnostics/gossip`. Operator gets a real signal; pubsub reliability degrades gracefully under overload instead of stalling the whole transport.\n3. **Bound producer side by per-peer rate quota**. Reject pubsub frames from a peer whose channel is > 80% full for more than N seconds \u2014 surfaces as a peer-level signal (IHAVE retransmit later) instead of transport-level stall.\n4. **Increase per-subscription mpsc(10000) in saorsa_gossip_pubsub** if profiling shows that is the actual choke (likely contributes \u2014 subscriber bound to PubSubManager is the ultimate consumer).\n\nRecommended order: instrument first, then prototype option 2 (drop-oldest with counter) as the smallest change with the biggest signal-to-noise ratio. Option 1 is the right long-term shape but invasive.\n\n## Acceptance bar\nSame Phase A + Phase B VPS run produces no sustained `used_pct=100` for more than 5 consecutive seconds on any node, OR produces a non-zero drop counter that the operator can act on. The current state \u2014 silent stall masquerading as zero-drop correctness \u2014 is not acceptable for production.","priority":2,"state":"review","branch_name":null,"url":null,"labels":["network","performance","vps-bootstrap","structural"],"blocked_by":[],"created_at":"2026-05-01T20:35:00Z","updated_at":"2026-05-01T21:02:36Z","acceptance":["Per-peer per-stream-type producer/consumer rate metrics exposed on /diagnostics/gossip or new /diagnostics/recv_pump endpoint","Decision recorded in an ADR (drop-oldest vs parallel pump vs producer rate-quota) with profiling data backing it","Same Phase A + B VPS run sustains no recv_pubsub_tx saturation > 5s OR exposes a drop counter the operator can act on","WARN volume per node per minute drops by at least 80% on sydney (worst-case node in 2026-05-01 baseline)"],"validation":["Repeat /tmp/harvest-vps-pressure.sh after fix lands and compare WARN counts vs the 2026-05-01 baseline (nyc=2 sfo=4 helsinki=0 nuremberg=0 singapore=10 sydney=21)","bash tests/e2e_stress_gossip.sh --nodes 5 --messages 5000 with new diagnostics enabled, capture stress-report.json deltas","Memory RSS growth on saorsa-9 (sydney) over a 30-min sustained Phase A + Phase B loop stays within 2\u00d7 steady-state baseline"],"links":[{"kind":"code","url":"src/network.rs:307","note":"data_channel_capacity(10_000) bump"},{"kind":"code","url":"src/network.rs:283-294","note":"per-stream-type recv_*_tx mpsc senders"},{"kind":"code","url":"src/network.rs:223","note":"warn_forward_channel_pressure helper"},{"kind":"memory","url":"memory/x0x_v0_18_3_fanout_stall_fixed.md","note":"previous transport-layer recv_tx 128 \u2192 10000 bump"},{"kind":"blocked-by-prerequisite","url":"X0X-0003","note":"INFO trend fix is prerequisite for clean before/after telemetry"}],"handoff":{"summary":"Added receive-pump diagnostics under /diagnostics/gossip.recv_pump and implemented the first overload mitigation: PubSub forwarding now uses try_send, increments visible full-drop counters instead of stalling ant-quic receive draining, while Membership/Bulk retain blocking sends. ADR 0009 records the decision and baseline evidence.","files_changed":["src/network.rs","src/lib.rs","src/bin/x0xd.rs","docs/adr/0009-recv-pump-overload-policy.md","docs/adr/README.md"],"validation":[{"command":"cargo test -p x0x --lib recv_pump","status":"passed"},{"command":"cargo nextest run --all-features -E 'test(self_dm) | test(warn_forward_channel_pressure) | test(recv_pump)'","status":"passed"},{"command":"just fmt-check","status":"passed"},{"command":"just lint","status":"passed"},{"command":"just test","status":"passed"},{"command":"bash tests/e2e_stress_gossip.sh --nodes 5 --messages 5000","status":"not_run","note":"not run in this pass; full local/VPS stress proof still required"},{"command":"bash tests/e2e_deploy.sh --mesh-verify and VPS pressure harvest","status":"not_run","note":"requires live VPS deployment/mesh window"}],"follow_up":["Run stress and VPS Phase A+B proof loops to compare recv_pump.pubsub.dropped_full and WARN counts against the 2026-05-01 baseline.","If PubSub drops are unacceptable, prototype parallel PubSub decode/verify/fanout workers as described in ADR 0009."]}}