spargio 0.5.13 - Docs.rs

# Implementation Log

## Snapshot (2026-02-25)

Repository state at start of this log:

- Git repo initialized in `/workspace/spargio`
- Initial implementation committed as:
  - `59d0b34` (`Implement sharded msg-ring-style runtime with TDD tests and benchmarks`)

## Completed So Far

### Design docs

- Added runtime design options:
  - `DESIGN_OPTIONS.md`

### Runtime crate

- Created crate:
  - `spargio`
- Implemented a sharded runtime with:
  - `RuntimeBuilder`, `Runtime`, `ShardCtx`, `RemoteShard`
  - `spawn_on` and `spawn_local`
  - `send_raw` and typed `send` via `RingMsg`
  - `next_event` event stream (`Event::RingMsg`)
  - sender completion tickets (`SendTicket`)

Current backend in this snapshot:

- In-process queue-based message transport (useful as baseline/fallback and for comparative benchmarking).

### TDD tests

- Added API/behavior tests in `tests/runtime_tdd.rs`:
  - local spawn runs on shard
  - raw send delivers to target with sender shard id
  - typed send round-trips through event path

Workflow used:

- Red: tests failed on placeholder API
- Green: implemented runtime until tests passed

### Benchmarks

- Added Criterion benchmark:
  - `benches/ping_pong.rs`
- Includes:
  - runtime ping-pong
  - simple Tokio baseline
  - simple Glommio baseline (feature-gated)

Feature:

- `glommio-bench` enables Glommio benchmark code path on Linux.

## Validation Results

Executed and passing:

- `cargo test`
- `cargo bench --no-run`
- `cargo bench --no-run --features glommio-bench`

Short benchmark sample run completed:

- `spargio`: ~1.62 ms (sample config)
- `tokio_unbounded_channel`: ~1.53 ms (sample config)
- `glommio_simple`: ~3.77–4.47 ms (with `glommio-bench`)

Note:

- These are quick smoke numbers, not stable performance conclusions.

## Next Work (Requested)

- Add a Linux `io_uring` backend that uses `msg_ring` for cross-shard delivery.
- Keep current queue backend for comparative benchmarks and fallback behavior.
- Preserve existing API so both backends can be measured under similar workloads.

## Update: Linux io_uring Backend Added

Implemented after the snapshot above:

- Added runtime backend selector:
  - `BackendKind::Queue`
  - `BackendKind::IoUring`
- Added builder controls:
  - `RuntimeBuilder::backend(BackendKind)`
  - `RuntimeBuilder::ring_entries(u32)`
- Default backend remains:
  - `BackendKind::Queue`

### Backend behavior

- Queue backend:
  - existing in-process message transport path retained.
- io_uring backend (Linux):
  - each shard owns an `IoUring` instance.
  - `send_raw` issued from a shard thread is routed through the source shard ring using:
    - `IORING_OP_MSG_RING` (`opcode::MsgRingData` via `io-uring` crate)
  - target shard receives an event via ring completion and emits:
    - `Event::RingMsg { from, tag, val }`
  - sender ticket completion is tied to sender-ring completion CQE.
- External/non-shard callers:
  - still supported using queue injection fallback (kept intentionally for safety and portability).

### Runtime loop adjustments

- Added backend-aware loop behavior:
  - queue backend keeps timeout-driven idle wait.
  - io_uring backend prefers busy polling (`yield_now`) to avoid artificial millisecond latency.

### Tests

- Existing tests still pass.
- Added Linux-only backend test:
  - `io_uring_backend_delivers_message`
- Full test status:
  - `cargo test` passes.

### Benchmarks updated

- `benches/ping_pong.rs` now benchmarks:
  - `spargio_queue`
  - `spargio_io_uring` (only when backend init succeeds)
  - `tokio_unbounded_channel`
  - `glommio_simple` (with `glommio-bench` feature)

Validation:

- `cargo bench --no-run` passes
- `cargo bench --no-run --features glommio-bench` passes

Quick benchmark sample (short run config):

- `spargio_queue`: ~1.66-1.70 ms
- `spargio_io_uring`: ~0.60-0.72 ms
- `tokio_unbounded_channel`: ~1.49-1.58 ms
- `glommio_simple`: ~4.05-4.85 ms

## Update: Stricter Benchmark Suite

Implemented to improve comparability and isolate what is being measured:

- Switched to persistent harnesses for steady-state measurements.
- Added matched two-worker topology for baselines:
  - Tokio: dedicated runtime thread, two-worker message loop.
  - Glommio (`glommio-bench`): two executor threads with message channels.
- Added explicit benchmark groups:
  - `steady_ping_pong_rtt`
  - `steady_one_way_send_drain`
  - `cold_start_ping_pong`

### Metric definitions

- `steady_ping_pong_rtt`:
  - per-round request/ack round-trip latency over persistent workers.
- `steady_one_way_send_drain`:
  - repeated one-way sends followed by a flush barrier ack.
  - for `spargio`, this now uses a bounded send-ticket window (`SEND_WINDOW=64`) to avoid fully serial per-send awaiting while preserving backpressure.
  - for Tokio/Glommio channel sends, send completion is synchronous enqueue.
- `cold_start_ping_pong`:
  - includes harness/runtime construction and teardown each iteration.

### Safety constraints observed

- No machine-level or persistent system tuning performed.
- No CPU governor/turbo/IRQ/process-affinity changes applied.
- Benchmarks are runnable on standard developer machines.

### Validation

- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.
- Sample full run completed for non-Glommio path.
- Sample targeted run completed for Glommio path.

### Notes from latest tuning pass

- Updated runtime one-way harness from strict per-send await to windowed in-flight tickets.
- Targeted one-way io_uring sample improved from roughly `~1.44 ms` to `~1.17 ms` under short Criterion settings.

## Update: Send Path Optimizations (Proceed Phase)

Implemented next optimization wave:

- Added no-ticket send APIs:
  - `RemoteShard::send_raw_nowait(tag, val)`
  - `RemoteShard::send_nowait(msg)`
  - `ShardCtx::send_raw_nowait(target, tag, val)`
- Added shard-local fast path:
  - local sends now enqueue into a local per-shard queue (`LocalCommand`) and no longer bounce through the shard command channel.
- Added io_uring batching:
  - deferred `ring.submit()` with batched flush (`IOURING_SUBMIT_BATCH=64`)
  - flush on poll/reap and on SQ pressure.
- Added io_uring no-ticket CQE suppression:
  - uses `IORING_MSG_RING_CQE_SKIP` flag value for no-ticket `msg_ring` sends to avoid sender-CQ flooding.

### Benchmark harness alignment updates

- Runtime one-way benchmark now uses `send_raw_nowait` for fire-and-drain semantics.
- io_uring steady one-way harness uses larger ring entries (`4096`) to avoid CQ overflow in high-burst synthetic load.
- Cold-start io_uring path kept at default ring sizing to keep init broadly reliable on dev machines.

### Additional test coverage

- Added test:
  - `send_raw_nowait_delivers_event`

### Current quick sample numbers (50ms warmup/50ms measure)

- `steady_ping_pong_rtt/spargio_queue`: ~`1.47-1.51 ms`
- `steady_ping_pong_rtt/spargio_io_uring`: ~`336-348 us`
- `steady_ping_pong_rtt/tokio_two_worker`: ~`1.21-1.34 ms`
- `steady_one_way_send_drain/spargio_queue`: ~`1.25-1.27 ms`
- `steady_one_way_send_drain/spargio_io_uring`: ~`232-234 us`
- `steady_one_way_send_drain/tokio_two_worker`: ~`69-71 us`

## Update: Fast-Path Checklist Pass (Current)

Requested optimization checklist from the prior analysis and status:

- Doorbell + payload queue batching for io_uring no-ticket sends:
  - Implemented.
  - No-ticket sends now enqueue payloads into per `(target, source)` shared queues and only emit a `msg_ring` doorbell when transitioning empty -> non-empty.
- `send_many_nowait` API:
  - Implemented.
  - Added:
    - `RemoteShard::send_many_raw_nowait`
    - `RemoteShard::send_many_nowait`
    - `ShardCtx::send_many_raw_nowait`
    - `ShardCtx::send_many_nowait`
- Explicit flush API:
  - Implemented.
  - Added:
    - `ShardCtx::flush() -> SendTicket`
    - `RemoteShard::flush() -> SendTicket` (no-op success outside shard context)
  - io_uring implementation flushes pending submissions and uses a `NOP` completion barrier.
- Send waiter structure (`HashMap -> slab`):
  - Implemented.
  - Waiters are now stored in `Slab`, with completion `user_data` carrying slab index.
- Optional io_uring setup knobs (SQPOLL path):
  - Implemented on Linux builder:
    - `io_uring_sqpoll(Option<u32>)`
    - `io_uring_sqpoll_cpu(Option<u32>)`
    - `io_uring_single_issuer(bool)`
    - `io_uring_coop_taskrun(bool)`
- EventState lock removal (`Mutex -> RefCell`):
  - Not applied.
  - Reason: current `spawn_on` API requires `Send` futures; making event state shard-local `Rc<RefCell<...>>` makes `NextEvent` non-`Send`, which breaks valid `spawn_on` usage.

### Correctness note on CQE suppression

- Previous pass used `IORING_MSG_RING_CQE_SKIP` under the assumption it only removed sender-side completions.
- This pass corrected no-ticket suppression to use SQE `SKIP_SUCCESS` for source CQE suppression while preserving receiver delivery.

### Additional tests added

- `send_many_raw_nowait_delivers_in_order`
- `flush_completes_without_messages`
- `io_uring_send_many_nowait_delivers_messages`

### Validation

- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

### Latest quick benchmark sample (50ms warmup/50ms measure)

- `steady_ping_pong_rtt/spargio_queue`: ~`1.36-1.39 ms`
- `steady_ping_pong_rtt/spargio_io_uring`: ~`365-370 us`
- `steady_ping_pong_rtt/tokio_two_worker`: ~`1.23-1.31 ms`
- `steady_one_way_send_drain/spargio_queue`: ~`1.23-1.25 ms`
- `steady_one_way_send_drain/spargio_io_uring`: ~`62.8-64.5 us`
- `steady_one_way_send_drain/tokio_two_worker`: ~`69.0-72.7 us`
- `cold_start_ping_pong/spargio_queue`: ~`2.39-2.40 ms`
- `cold_start_ping_pong/spargio_io_uring`: ~`255-276 us`
- `cold_start_ping_pong/tokio_two_worker`: ~`453-484 us`

## Update: Tokio Batched One-Way Controls

To make the one-way comparison fairer, added additional Tokio benchmarks that batch payloads before crossing threads:

- `steady_one_way_send_drain/tokio_two_worker_batched_64`
- `steady_one_way_send_drain/tokio_two_worker_batched_all`

Implementation notes:

- Added `TokioWire::OneWayBatch(Vec<u32>)`.
- Added `TokioCmd::OneWayBatched { rounds, batch, reply }`.
- Existing `tokio_two_worker` remains unchanged as the per-message baseline.

Quick sample (50ms warmup/50ms measure):

- `steady_one_way_send_drain/spargio_io_uring`: ~`64.2-65.4 us`
- `steady_one_way_send_drain/tokio_two_worker`: ~`83.7-96.0 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_64`: ~`23.3-25.3 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_all`: ~`14.9-15.7 us`

Interpretation:

- The previous Tokio gap was largely due to per-send cross-thread signaling overhead, not an inherent runtime scheduler limit.
- With batching, Tokio is substantially faster on this one-way synthetic workload.

## Update: Disk IO Benchmark (4K Read RTT)

Added a dedicated disk benchmark:

- New bench target:
  - `benches/disk_io.rs`
- Cargo bench config:
  - `[[bench]] name = "disk_io" harness = false`

### Benchmark shape

- Persistent fixture file:
  - 16 MiB (`4096 * 4 KiB`) temp file under system temp dir.
- Metric:
  - `disk_read_rtt_4k` (per-iteration round-trip for `256` 4 KiB reads).
- Compared paths:
  - `tokio_two_worker_pread`
    - two-worker Tokio runtime
    - request/ack over Tokio unbounded channels
    - worker performs `pread` (`FileExt::read_at`)
  - `io_uring_msg_ring_two_ring_pread` (Linux)
    - two rings (`client` + `worker`)
    - request/ack over `IORING_OP_MSG_RING`
    - worker performs `IORING_OP_READ` and replies via `msg_ring`

### Quick sample (50ms warmup/50ms measure)

- `disk_read_rtt_4k/tokio_two_worker_pread`: ~`1.71-1.91 ms`
- `disk_read_rtt_4k/io_uring_msg_ring_two_ring_pread`: ~`2.64-3.09 ms`

### Notes

- This first disk RTT harness is not yet optimized for io_uring throughput; it is currently request/ack serialized and favors simplicity/debuggability.
- VFS work is still present for both paths; `io_uring` changes submission/completion mechanics, not filesystem lookup/permission/page-cache semantics.

## Update: Tokio Interop API Slice (TDD)

Started implementation toward the ADR with a first interop slice focused on submission APIs that can be called from Tokio tasks.

### Red phase

Added failing tests in `tests/tokio_compat_tdd.rs` for:

- `Runtime::handle()` availability.
- `RuntimeHandle::spawn_pinned(shard, fut)` execution on requested shard.
- `RuntimeHandle::spawn_stealable(fut)` round-robin placement.
- `RuntimeHandle` usage from Tokio tasks, including remote send + ticket await.
- `RuntimeHandle` cloneability and `Send + Sync`.

### Green phase

Implemented in `src/lib.rs`:

- New public `RuntimeHandle` (`Clone`, `Send + Sync`).
- `Runtime::handle() -> RuntimeHandle`.
- `RuntimeHandle` APIs:
  - `backend()`
  - `shard_count()`
  - `remote(shard)`
  - `spawn_pinned(shard, fut)`
  - `spawn_stealable(fut)` (round-robin via `AtomicUsize`)
- Refactored spawn logic into shared helper:
  - `spawn_on_shared(...)`

Validation:

- `cargo test` passes (including new `tokio_compat_tdd` tests).
- `cargo bench --no-run` passes.

## Update: Tokio-Compat POLL_ADD Reactor Scaffold (TDD)

Implemented the first compatibility-reactor scaffold behind feature gating.

### Red phase

Added failing tests in `tests/tokio_poll_reactor_tdd.rs` (`cfg(all(feature = "tokio-compat", target_os = "linux"))`) for:

- `PollReactor::register(..., PollInterest::Readable)` receives readable event.
- `PollReactor::deregister(token)` returns `NotFound` on second deregister.
- Token uniqueness across registrations.

### Green phase

Implemented new module in `src/lib.rs`:

- `tokio_compat` (Linux + feature gated):
  - `PollReactor`
  - `PollInterest`
  - `PollToken`
  - `PollEvent`
  - `PollReactorError`
- Uses `IORING_OP_POLL_ADD` for registration and `IORING_OP_POLL_REMOVE` for deregistration.
- Includes minimal completion routing and internal completion tagging for deterministic deregister behavior.

Cargo feature updates (`Cargo.toml`):

- Added features:
  - `tokio-compat`
  - `uring-native`
- Added Linux dependency:
  - `libc`

Validation:

- `cargo test --features tokio-compat` passes.
- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

## Current Status: Tokio-Uring Alternative Scope

Snapshot of what is implemented vs remaining for the target architecture (`msg_ring` + poll-compat + work-stealing + native fast lane):

### Implemented

- Core `msg_ring` runtime and Linux `io_uring` backend.
- Tokio interop handle APIs:
  - `Runtime::handle()`
  - `spawn_pinned(...)`
  - `spawn_stealable(...)` (current policy: round-robin placement).
- `tokio-compat` lane scaffold:
  - `PollReactor` (`IORING_OP_POLL_ADD` / `IORING_OP_POLL_REMOVE`)
  - async `TokioPollReactor`
  - `TokioCompatLane` via `RuntimeHandle::tokio_compat_lane(...)`
  - lane readiness helpers: `wait_readable(fd)`, `wait_writable(fd)`.
- Cancellation cleanup and active-token tracking for poll registrations.
- TDD coverage for all above in:
  - `tokio_compat_tdd.rs`
  - `tokio_poll_reactor_tdd.rs`
  - `tokio_poll_async_tdd.rs`
  - `tokio_runtime_lane_tdd.rs`
  - `tokio_runtime_wait_tdd.rs`

### Remaining

- True work-stealing scheduler:
  - per-worker deque + global injector + steal loop (not implemented yet).
- Submission-time stealing/placement policy for native I/O work (not implemented yet).
- Poll-compat path integrated into shard driver with `msg_ring` doorbells:
  - current poll path uses dedicated reactor worker thread + command channel.
- `uring-native` fast lane:
  - feature flag exists, but native async API surface is not implemented yet.
- Tokio-like compatibility wrappers (`AsyncRead`/`AsyncWrite`) are not implemented yet.
- Full stress/race suite for rearm/cancel/drop edge cases under load is not complete yet.
- Compat-vs-native and mixed-load stealing benchmark suite is not complete yet.

## Proposed Sequence: Functional Slices First

Priority order to ship usable slices earlier:

1. Compat ergonomics slice:
   - stabilize `tokio-compat` lane ergonomics and add simple compatibility wrappers.
2. Native fast-lane MVP:
   - add first `uring-native` read/write APIs with pinned submission.
3. Mixed-mode app slice:
   - make compat and native lanes easy to combine in one app.
4. Submission-time placement policies:
   - add `round_robin`, `sticky`, and explicit shard placement options.
5. True work-stealing scheduler:
   - introduce per-worker deque + global injector + steal loop for stealable tasks.
6. Poll path re-home to shard driver:
   - move poll processing into shard driver path with `msg_ring` wakeups.
7. Hardening and benchmark gate slice:
   - race stress tests + mixed-load benchmark gates.

User stories unlocked after each slice:

1. After compat ergonomics:
   - migrate Tokio readiness-style code with minimal rewrites.
2. After native fast-lane MVP:
   - move only hot I/O paths to native `io_uring` APIs.
3. After mixed-mode:
   - run compatibility code and native ops side by side.
4. After placement policies:
   - control locality/load-balance at submission time.
5. After true work-stealing:
   - auto-balance CPU/control tasks while keeping I/O ring-affine.
6. After poll re-home:
   - reduce poll-path overhead without API changes.
7. After hardening/bench gates:
   - rely on correctness/perf regression protection in CI.

## User Stories Already Possible

With current implementation, users can already:

1. Build and run a sharded runtime with queue or Linux `io_uring` backend.
2. Send typed/raw shard-to-shard messages and await sender tickets.
3. Use no-ticket batched message sends and explicit flush barriers.
4. Spawn pinned or round-robin stealable tasks from Tokio tasks via `RuntimeHandle`.
5. Create a `tokio-compat` lane and use poll registration (`POLL_ADD`/`POLL_REMOVE`) through:
   - direct poll API (`register`, `wait_one`, `deregister`)
   - lane helpers (`wait_readable`, `wait_writable`).
6. Cancel readiness waits without leaking poll registrations (covered by tests).
7. Benchmark message RTT/one-way/cold-start and run a first disk I/O RTT comparison harness.

## Update: Compat Ergonomics Slice (TDD)

Implemented the next functional slice aimed at easier migration ergonomics for readiness-style code.

### Red phase

Added failing tests in `tests/tokio_compat_fd_tdd.rs` (`cfg(all(feature = "tokio-compat", target_os = "linux"))`) for:

- lane-scoped compatibility FD wrapper creation.
- wrapper `writable().await` and `readable().await` behavior.
- wrapper cloneability and FD identity access.

### Green phase

Implemented in `src/lib.rs`:

- New `CompatFd` type (`Clone`) under `tokio-compat`:
  - stores `TokioCompatLane` + `RawFd`.
- New lane factory:
  - `TokioCompatLane::compat_fd(fd) -> CompatFd`
- Wrapper methods:
  - `fd()`
  - `readable().await`
  - `writable().await`

This reuses the lane's cancellation-safe wait logic and poll token cleanup.

Validation:

- `cargo test --features tokio-compat` passes.
- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

## Update: Async Tokio Poll Wrapper (TDD)

Added a Tokio-usable async wrapper over the `POLL_ADD` scaffold to allow direct use from Tokio tasks.

### Red phase

Added failing tests in `tests/tokio_poll_async_tdd.rs` (`cfg(all(feature = "tokio-compat", target_os = "linux"))`) for:

- async `wait_one()` returning readable events.
- async `deregister()` reporting `NotFound` on second remove.

### Green phase

Implemented in `src/lib.rs` (`tokio_compat` module):

- `TokioPollReactor` (`Clone`) wrapping `PollReactor` in `Arc<Mutex<_>>`.
- Methods:
  - `new(entries)`
  - `register(fd, interest)`
  - `wait_one().await`
  - `deregister(token).await`
- Async methods use `tokio::task::spawn_blocking` to execute blocking ring wait/remove logic safely off async worker threads.

Feature/dependency update:

- `tokio-compat` now enables optional Tokio dependency (`dep:tokio`).

Validation:

- `cargo test --features tokio-compat` passes.
- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

## Update: Tokio Compat Lane via RuntimeHandle (TDD)

Integrated poll-compat usage into a runtime-lane API so Tokio tasks can use a single handle for both runtime operations and readiness waiting.

### Red phase

Added failing tests in `tests/tokio_runtime_lane_tdd.rs` (`cfg(all(feature = "tokio-compat", target_os = "linux"))`) for:

- `RuntimeHandle::tokio_compat_lane(entries)` creation.
- Combined lane behavior:
  - `spawn_pinned`
  - `remote(...).send_raw(...).await`
  - event receive path
- Poll API through lane:
  - `register`
  - async `wait_one`

### Green phase

Implemented in `src/lib.rs`:

- `RuntimeHandle::tokio_compat_lane(entries) -> Result<TokioCompatLane, PollReactorError>`
- New `TokioCompatLane` (`Clone`) with delegated runtime APIs:
  - `backend`
  - `shard_count`
  - `remote`
  - `spawn_pinned`
  - `spawn_stealable`
- Lane poll APIs:
  - `register`
  - async `wait_one`
  - async `deregister`

Validation:

- `cargo test --features tokio-compat` passes.
- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

## Update: Lane Readiness Futures + Cancellation Cleanup (TDD)

Implemented lane-scoped readiness waits and fixed cancellation behavior.

### Red phase

Added failing tests in `tests/tokio_runtime_wait_tdd.rs` (`cfg(all(feature = "tokio-compat", target_os = "linux"))`) for:

- `wait_writable(fd)` and `wait_readable(fd)` APIs through `TokioCompatLane`.
- cancellation cleanup:
  - aborting `wait_readable` should not leak poll registrations.

### Green phase

Implemented in `src/lib.rs`:

- `TokioCompatLane` readiness methods:
  - `wait_readable(fd).await`
  - `wait_writable(fd).await`
- Drop cleanup guard for wait futures:
  - best-effort deregistration on cancellation.
- Debug helper for validation:
  - `debug_poll_registered_count()`.

Important fix during this slice:

- Reworked `TokioPollReactor` implementation from `spawn_blocking + Mutex<PollReactor>` to a dedicated worker-thread command loop.
- Reason:
  - prior design could deadlock cleanup when aborted tasks left blocking waits holding the mutex.
- New design:
  - command channel (`register` / `wait_one` / `deregister`)
  - non-blocking waiter pump (`try_wait_one`) to keep deregistration responsive.

Additional reactor hardening:

- Track active poll tokens in `PollReactor`.
- Ignore stale completions for inactive tokens.
- Fast `NotFound` on deregister for unknown token.

Validation:

- `cargo test --features tokio-compat` passes.
- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

## Recap: Requested Slice Sequence and Status (2026-02-26)

Per the requested "functional slices first" plan, the sequence and current status are:

1. Compat ergonomics slice: `completed`.
2. Native fast-lane MVP slice: `completed` (this update).
3. Mixed-mode app slice: `partially completed` (compat + native lanes both exist; additional app-level helpers still pending).
4. Submission-time placement policies: `not started`.
5. True work-stealing scheduler: `not started`.
6. Poll path re-home to shard driver + `msg_ring` wakeups: `not started`.
7. Hardening + benchmark gate slice: `in progress` (coverage exists, full stress/benchmark gates pending).

## Update: Compat Stream Wrappers (TDD)

Extended compat ergonomics with Tokio `AsyncRead`/`AsyncWrite` wrappers for easier migration from socket-like code.

### Red phase

Added failing tests:

- `tests/tokio_compat_stream_tdd.rs`
  - `compat_stream_fd_reads_and_writes`
  - `compat_stream_fd_pending_read_wakes_on_write`
- `tests/tokio_compat_stream_hardening_tdd.rs`
  - `compat_fd_into_stream_reads_bytes`
  - `compat_stream_reads_eof_as_zero`
  - `lane_compat_stream_helper_wraps_asrawfd`

### Green phase

Implemented in `src/lib.rs` (Linux + `tokio-compat`):

- `CompatStreamFd` wrapper.
- `TokioCompatLane::compat_stream_fd(fd)`.
- `TokioCompatLane::compat_stream<T: AsRawFd>(&T)`.
- `CompatFd::into_stream()`.
- `AsyncRead`/`AsyncWrite` impls for `CompatStreamFd` using:
  - nonblocking `libc::read`/`libc::write`
  - lane readiness waits (`wait_readable`/`wait_writable`) on `WouldBlock`.
- helper utilities:
  - `set_nonblocking(fd)`
  - poll-error -> `std::io::Error` mapping.

Validation:

- `cargo test --features tokio-compat` passes.
- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

## Update: `uring-native` Fast-Lane MVP (TDD)

Implemented first native lane API for direct `io_uring` read/write-at operations with pinned shard submission.

### Red phase

Added failing tests in `tests/uring_native_tdd.rs` (`cfg(all(feature = "uring-native", target_os = "linux"))`):

- `uring_native_lane_requires_io_uring_backend`
- `uring_native_lane_reads_file_at_offset`
- `uring_native_lane_writes_file_at_offset`

### Green phase

Implemented in `src/lib.rs` (Linux + `uring-native`):

- `RuntimeHandle::uring_native_lane(shard) -> Result<UringNativeLane, RuntimeError>`.
- `UringNativeLane` API:
  - `read_at(fd, offset, len).await -> io::Result<Vec<u8>>`
  - `write_at(fd, offset, buf).await -> io::Result<usize>`
  - `shard()`.
- `TokioCompatLane::uring_native_lane(shard)` bridge (when both `tokio-compat` and `uring-native` features are enabled).
- Native op command plumbing from shard tasks to backend.
- `IoUringDriver` native op tracking/completion with `IORING_OP_READ` and `IORING_OP_WRITE`.
- Completion demuxing for native op user-data and cleanup on shutdown/error paths.

Notes:

- Native lane currently uses pinned submission through shard-local command flow.
- Queue backend intentionally returns `UnsupportedBackend` for native lane creation.

Validation:

- `cargo test` passes.
- `cargo test --features tokio-compat` passes.
- `cargo test --features uring-native` passes.
- `cargo test --features "tokio-compat uring-native"` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

## Revised Task List: Value Proposition Execution (2026-02-26)

Revised priority list aligned to the current project premise.

Update:

- reordered for faster proof generation.
- benchmark evidence is moved near the front so we validate value earlier.

- Core premise:
  - deliver a differentiated `io_uring` runtime centered on `msg_ring`-based cross-shard coordination and work-stealing.
- Not the core premise:
  - broad Tokio drop-in compatibility across dependency internals.

### Slice 1: Compatibility De-Scoping

Goal:

- remove or deprecate `tokio-compat` paths as active project focus.
- retain only interop boundaries needed for mixed-mode deployment.

Done criteria:

- code/docs/feature flags no longer present `tokio-compat` as strategic direction.
- README + ADRs + crate feature docs reflect runtime-first focus.

Validation gate:

- `cargo test`
- `cargo test --features uring-native`
- `cargo bench --no-run`

### Slice 2: Benchmark MVP Harness (Early Proof)

Goal:

- add the first coordination-heavy benchmark harness early:
  - intra-request fan-out/fan-in
  - shard-skew scenarios
  - mixed control/CPU + ring-affine I/O path.

Done criteria:

- reproducible harness exists and can run quickly on dev machines.
- first p50/p95/p99 + throughput-at-SLO snapshots are recorded.

Validation gate:

- benchmark smoke run in local workflow.
- `cargo bench --no-run` remains green.

### Slice 3: Placement Policy MVP

Goal:

- implement policy-driven submission placement needed by the benchmark:
  - explicit shard
  - sticky-key routing
  - policy round-robin.

Done criteria:

- public APIs expose placement policy selection.
- deterministic tests verify routing behavior.

Validation gate:

- placement policy tests + no regression in existing send/flush tests.

### Slice 4: True Work-Stealing MVP

Goal:

- replace spawn-time round-robin with true stealing mechanics:
  - per-worker deque
  - global injector
  - steal loop with cooperative budgeting.

Done criteria:

- stealable tasks move under load/skew.
- pinned/ring-affine tasks remain protected.

Validation gate:

- scheduler TDD for steal/no-steal invariants and skew behavior.

### Slice 5: Ring-Affine Native I/O Enforcement

Goal:

- make ring-affinity guarantees explicit in runtime state transitions.

Done criteria:

- in-flight native I/O cannot migrate across shards.
- cancellation and completion paths preserve ownership invariants.

Validation gate:

- race/cancel/drop tests for native I/O ownership safety.

### Slice 6: `msg_ring` Transport Hardening

Goal:

- harden coordination path under load:
  - batching behavior
  - doorbell policy
  - SQ/CQ pressure handling.

Done criteria:

- overload behavior is well-defined and tested.
- transport metrics (drops/retries/backpressure) are surfaced.

Validation gate:

- stress tests with bounded memory and deterministic failure semantics.

### Slice 7: Mixed-Runtime Boundary API Hardening

Goal:

- define robust communication contracts between `spargio` and host runtimes (Tokio or others):
  - bounded request/reply channels
  - backpressure semantics
  - cancellation and deadline propagation.

Done criteria:

- boundary API is explicit and documented.
- tests cover cancellation, timeout, and overload behavior.

Validation gate:

- boundary TDD suite (correctness + cancellation + overload).
- existing core tests remain green.

### Slice 8: Observability and Operator Signals

Goal:

- expose metrics and debug hooks needed for production tuning.

Candidate signals:

- per-shard queue depth
- steal rate
- doorbell rate
- pending native ops
- timeout/cancel counters.

Done criteria:

- metrics API and/or tracing events documented and test-covered.

Validation gate:

- instrumentation tests + low-overhead checks in benchmark runs.

### Slice 9: CI Regression Gates

Goal:

- lock in correctness and performance trajectory.

Done criteria:

- mandatory correctness suites for scheduler/transport/native I/O invariants.
- perf guardrails for critical benchmark scenarios.

Validation gate:

- CI blocks regressions on defined thresholds.

### Slice 10: Reference Mixed-Mode Service + Benchmark Expansion

Goal:

- provide a small reference app showing Tokio + `spargio` mixed-runtime usage:
  - request fan-out into `spargio`
  - aggregation and response path
  - explicit cancellation/backpressure boundary.
- expand benchmark suite from MVP to release-grade scenarios and reporting.

Done criteria:

- runnable example with docs and benchmark entry point.
- linked from README as adoption blueprint.
- expanded benchmark scenarios tracked in log and docs.

Validation gate:

- example integration test + benchmark smoke pass.

## Update: Benchmark Review and Suite Refocus (2026-02-26)

Reviewed benchmark outputs against current value proposition (`io_uring` + `msg_ring` coordination + work-stealing trajectory), then refocused the suite.

### Latest quick benchmark sample (Criterion 50ms warmup / 50ms measure / 20 samples)

From `ping_pong`:

- `steady_ping_pong_rtt/spargio_io_uring`: ~`340-360 us`
- `steady_ping_pong_rtt/tokio_two_worker`: ~`1.33-1.45 ms`
- `steady_ping_pong_rtt/spargio_queue`: ~`1.38-1.52 ms`
- `steady_one_way_send_drain/spargio_io_uring`: ~`63-65 us`
- `steady_one_way_send_drain/tokio_two_worker`: ~`84-97 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_64`: ~`23-25 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_all`: ~`13-15 us`
- `cold_start_ping_pong/spargio_io_uring`: ~`255-288 us`
- `cold_start_ping_pong/tokio_two_worker`: ~`505-593 us`

From `disk_io`:

- `disk_read_rtt_4k/tokio_two_worker_pread`: ~`1.81-2.01 ms`
- `disk_read_rtt_4k/io_uring_msg_ring_two_ring_pread`: ~`2.54-2.83 ms`

### Interpretation

- Current value is strongest in control-path/message-path microbenchmarks for `io_uring` backend (`steady_ping_pong_rtt`, unbatched `steady_one_way_send_drain`).
- Batched Tokio one-way is still faster in that synthetic path, so batching-sensitive comparisons remain context, not headline.
- Current serialized disk RTT harness does not yet demonstrate `spargio` advantage.

### Benchmark taxonomy update

Primary KPI direction (to add/expand next):

- coordination-heavy fan-out/fan-in benchmarks with skew and tail-latency focus.

Context / microbench (kept):

- `steady_ping_pong_rtt`
- `steady_one_way_send_drain`

De-emphasized for value-prop claims:

- `cold_start_ping_pong`
- `tokio_two_worker_batched_*` (useful context, not primary proof)
- current `disk_read_rtt_4k` harness (until reworked beyond strict serialized request/ack)

### Glommio benchmark removal decision

Decision:

- remove Glommio comparison path for now.

Reason:

- not currently aligned with primary proof objective and adds maintenance noise.
- current harness shape is not the target benchmark niche for `spargio`.

Changes applied:

- removed Glommio benchmark harness/code from `benches/ping_pong.rs`.
- removed `glommio` dependency and `glommio-bench` feature from `Cargo.toml`.
- removed `glommio-bench` mention from README feature list.

Validation:

- `cargo test` passes.
- `cargo test --features uring-native` passes.
- `cargo bench --no-run` passes.

## Update: Tokio-Compat Removal + Fanout/Fan-in Benchmark MVP (2026-02-26)

Applied the scope change to fully de-emphasize drop-in Tokio emulation and move proof work to coordination-heavy fan-out/fan-in benchmarks.

### Tokio-compat removal (code + tests)

Changes:

- removed `tokio-compat` feature flag from `Cargo.toml`.
- removed optional non-dev Tokio dependency from `[dependencies]`.
- removed all `tokio-compat` lane and poll-emulation code from `src/lib.rs`:
  - deleted `tokio_compat` module.
  - deleted `RuntimeHandle::tokio_compat_lane(...)`.
  - deleted `TokioCompatLane`, `CompatFd`, `CompatStreamFd`, and associated helpers.
- removed compat-only TDD files:
  - `tests/tokio_compat_fd_tdd.rs`
  - `tests/tokio_compat_stream_tdd.rs`
  - `tests/tokio_compat_stream_hardening_tdd.rs`
  - `tests/tokio_poll_reactor_tdd.rs`
  - `tests/tokio_poll_async_tdd.rs`
  - `tests/tokio_runtime_lane_tdd.rs`
  - `tests/tokio_runtime_wait_tdd.rs`
- renamed remaining Tokio interoperability coverage from `tests/tokio_compat_tdd.rs` to `tests/tokio_interop_tdd.rs` for clearer intent.

### New benchmark: fan-out/fan-in with skew

Added `benches/fanout_fanin.rs` and registered it in `Cargo.toml`.

Harness design:

- Same worker width on both runtimes (`4` threads/shards).
- Same workload model on both runtimes:
  - per-request spawn fan-out (`16` branches), then fan-in on join.
  - deterministic synthetic compute per branch.
- Two scenarios:
  - `fanout_fanin_balanced`: all branches equal work.
  - `fanout_fanin_skewed`: one hot branch per request has much heavier work.
- Bench variants:
  - `tokio_mt_4`
  - `spargio_queue`
  - `spargio_io_uring` (Linux)

### Quick MVP benchmark sample

Command:

- `cargo bench --bench fanout_fanin -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

Observed ranges:

- `fanout_fanin_balanced/tokio_mt_4`: ~`1.41-1.51 ms`
- `fanout_fanin_balanced/spargio_queue`: ~`10.7-18.1 ms`
- `fanout_fanin_balanced/spargio_io_uring`: ~`0.782-0.813 ms`
- `fanout_fanin_skewed/tokio_mt_4`: ~`2.34-2.40 ms`
- `fanout_fanin_skewed/spargio_queue`: ~`54.0-54.4 ms`
- `fanout_fanin_skewed/spargio_io_uring`: ~`1.882-1.889 ms`

### Validation

- `cargo fmt` passes.
- `cargo test` passes.
- `cargo test --features uring-native` passes.
- `cargo bench --no-run` passes (includes `fanout_fanin`).

## Direction Note: Full io_uring Runtime Scope (2026-02-26)

Long-term direction:

- evolve `spargio` toward a fuller `io_uring` runtime surface (disk + network I/O), comparable in scope to specialized runtimes.

Near-term priority remains unchanged:

- prove differentiated value first in `msg_ring`-coordinated cross-shard scheduling, placement, and work-stealing benchmarks.

Implication for sequencing:

- full disk/network API breadth is explicitly treated as a later expansion track after current scheduler/coordination milestones are validated.

## Update: Slice Execution MVP (Placement, Stealing, Boundary, CI, Reference App) (2026-02-26)

Executed the remaining planned slices in MVP form with red/green TDD coverage.

### Red-phase tests added

New failing suites introduced first:

- `tests/slices_tdd.rs`
  - placement policy routing (`Pinned`, `Sticky`)
  - stealable execution on non-preferred shard under load
  - runtime stats snapshot counters/shape
- `tests/boundary_tdd.rs`
  - bounded overload behavior (`Overloaded`)
  - blocking timeout behavior (`Timeout`)
  - cancellation-safe reply path (`Canceled`)
  - deadline metadata propagation

Then implementation was iterated until all tests passed.

### Slice 3: Placement policy MVP

Implemented:

- `TaskPlacement` enum:
  - `Pinned(ShardId)`
  - `RoundRobin`
  - `Sticky(u64)`
  - `Stealable`
  - `StealablePreferred(ShardId)`
- `RuntimeHandle::spawn_with_placement(...)`
- `RuntimeHandle::spawn_stealable_on(preferred_shard, ...)`

Notes:

- sticky placement uses stable key hashing to shard index.

### Slice 4: True work-stealing MVP

Implemented:

- global stealable injector channel (`StealableTask`) shared across shard workers.
- shard workers opportunistically drain stealable tasks and execute locally.
- preferred-shard hint tracking with `stealable_stolen` counter when execution shard differs from preferred shard.

Validation:

- `stealable_preferred_tasks_can_run_on_another_shard_under_load` now passes.

### Slice 5: Ring-affine native I/O enforcement

Implemented:

- native local commands now carry `origin_shard`.
- backend validates `origin_shard == current_shard` before submitting native ops.
- affinity violations increment `native_affinity_violations` and fail the operation.
- pending native-op gauge (`pending_native_ops`) is tracked.

### Slice 6: `msg_ring` transport hardening

Implemented:

- configurable `msg_ring_queue_capacity` on `RuntimeBuilder`.
- io_uring payload queues enforce bounded capacity.
- overload now reports `SendError::Backpressure` for saturated payload queues.
- backpressure counter surfaced via `ring_msgs_backpressure`.

### Slice 7: Mixed-runtime boundary API hardening

Implemented `spargio::boundary` module:

- bounded channel construction via `boundary::channel(capacity)`.
- client API:
  - `call(...)`
  - `try_call(...)`
  - `call_with_timeout(...)`
- server API:
  - `recv()`
  - `recv_timeout(...)`
- request API:
  - `request()`
  - `deadline()`
  - `respond(...)` (cancellation-safe)
- ticket API:
  - `Future` implementation
  - `wait_timeout_blocking(...)`

Error model:

- `BoundaryError::{Closed, Overloaded, Timeout, Canceled}`.

### Slice 8: Observability and operator signals

Implemented snapshot API:

- `RuntimeHandle::stats_snapshot() -> RuntimeStats`

Current signals:

- per-shard command depth (`shard_command_depths`)
- submitted pinned / stealable spawn counts
- stealable executed / stolen counts
- ring message submitted / completed / failed / backpressure counts
- native affinity violation count
- pending native-op gauge

### Slice 9: CI regression gates

Added:

- `.github/workflows/ci.yml` with gates for:
  - format check
  - tests
  - `uring-native` tests
  - `cargo bench --no-run`
  - fan-out benchmark smoke + guardrail scripts

Added scripts:

- `scripts/bench_fanout_smoke.sh`
- `scripts/bench_fanout_guardrail.sh`

### Slice 10: Reference mixed-mode service + benchmark expansion

Added:

- `examples/mixed_mode_service.rs`
  - Tokio-hosted request fan-out to `spargio` via boundary channel
  - stealable placement usage + aggregation response path
  - timeout-aware boundary call path

Benchmark update:

- `benches/fanout_fanin.rs` now records throughput units per group (`Throughput::Elements`).

### Validation

- `cargo test` passes.
- `cargo test --features uring-native` passes.
- `cargo bench --no-run` remains green.

## Update: Full Benchmark Snapshot Refresh (2026-02-26)

Captured a fresh baseline across all active benchmark suites after slice MVP implementation.

### Command profile

- `cargo bench --bench ping_pong -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench fanout_fanin -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench disk_io -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Observed ranges

From `ping_pong`:

- `steady_ping_pong_rtt/spargio_queue`: ~`1.37-1.42 ms`
- `steady_ping_pong_rtt/spargio_io_uring`: ~`353-380 us`
- `steady_ping_pong_rtt/tokio_two_worker`: ~`1.41-1.51 ms`
- `steady_one_way_send_drain/spargio_queue`: ~`1.31-1.35 ms`
- `steady_one_way_send_drain/spargio_io_uring`: ~`66.9-69.1 us`
- `steady_one_way_send_drain/tokio_two_worker`: ~`87.2-91.1 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_64`: ~`22.4-23.4 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_all`: ~`13.7-14.7 us`
- `cold_start_ping_pong/spargio_queue`: ~`2.43-2.44 ms`
- `cold_start_ping_pong/spargio_io_uring`: ~`242-264 us`
- `cold_start_ping_pong/tokio_two_worker`: ~`511-560 us`

From `fanout_fanin`:

- `fanout_fanin_balanced/tokio_mt_4`: ~`1.35-1.38 ms`
- `fanout_fanin_balanced/spargio_queue`: ~`3.80-4.10 ms`
- `fanout_fanin_balanced/spargio_io_uring`: ~`1.61-1.65 ms`
- `fanout_fanin_skewed/tokio_mt_4`: ~`2.39-2.59 ms`
- `fanout_fanin_skewed/spargio_queue`: ~`3.44-3.73 ms`
- `fanout_fanin_skewed/spargio_io_uring`: ~`1.99-2.00 ms`

From `disk_io`:

- `disk_read_rtt_4k/tokio_two_worker_pread`: ~`1.80-1.95 ms`
- `disk_read_rtt_4k/io_uring_msg_ring_two_ring_pread`: ~`2.61-2.78 ms`

### Readout

- `spargio_io_uring` is strongest in control-path RTT and cold-start latency.
- one-way unbatched send/drain favors `spargio_io_uring`, but batched Tokio remains significantly faster.
- skewed fan-out/fan-in currently favors `spargio_io_uring`.
- balanced fan-out/fan-in currently favors Tokio.
- current disk RTT harness remains a loss for the io_uring+msg_ring path.

## Update: msg_ring Stealable Dispatch + Benchmark Refresh (2026-02-26)

Implemented work-stealing data-path changes to align with project premise:

- replaced global stealable injector channel with per-shard stealable inboxes.
- changed stealable submit path to:
  1. choose target shard by inbox depth (submission-time decision),
  2. enqueue task into target inbox,
  3. wake target via `msg_ring` doorbell on `IoUring` backend.
- added wake plumbing:
  - `LocalCommand::SubmitStealableWake`
  - `Command::StealableWake`
  - backend `submit_stealable_wake(...)` path.
- kept queue-backend fallback wake semantics for non-io_uring runs.

TDD additions:

- added Linux io_uring slice test proving stealable dispatch submits ring wake traffic:
  - `tests/slices_tdd.rs::io_uring_stealable_dispatch_uses_msg_ring_wake`.

Validation:

- `cargo fmt`
- `cargo test`
- `cargo test --features uring-native`

Benchmark profile:

- `cargo bench --bench ping_pong -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench fanout_fanin -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench disk_io -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `./scripts/bench_fanout_guardrail.sh`

Observed ranges:

From `ping_pong`:

- `steady_ping_pong_rtt/spargio_io_uring`: ~`352-370 us`
- `steady_ping_pong_rtt/tokio_two_worker`: ~`1.30-1.42 ms`
- `steady_one_way_send_drain/spargio_io_uring`: ~`66.6-68.3 us`
- `steady_one_way_send_drain/tokio_two_worker`: ~`84.0-90.6 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_64`: ~`24.2-26.1 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_all`: ~`14.4-15.7 us`
- `cold_start_ping_pong/spargio_io_uring`: ~`248-305 us`
- `cold_start_ping_pong/tokio_two_worker`: ~`500-555 us`

From `fanout_fanin`:

- `fanout_fanin_balanced/tokio_mt_4`: ~`1.43-1.51 ms`
- `fanout_fanin_balanced/spargio_io_uring`: ~`982-989 us`
- `fanout_fanin_skewed/tokio_mt_4`: ~`2.35-2.42 ms`
- `fanout_fanin_skewed/spargio_io_uring`: ~`1.92-1.93 ms`

From `disk_io`:

- `disk_read_rtt_4k/tokio_two_worker_pread`: ~`1.82-2.00 ms`
- `disk_read_rtt_4k/io_uring_msg_ring_two_ring_pread`: ~`2.52-2.74 ms`

Interpretation:

- value proposition now shows up directly in coordination-heavy fan-out/fan-in:
  - balanced and skewed scenarios both favor `spargio_io_uring`.
- compared with earlier same-day snapshot, `fanout_fanin_balanced` flipped from loss to win after the stealable dispatch changes.
- batched Tokio one-way throughput remains a known gap.
- disk RTT benchmark remains a known gap.

## Roadmap: Toward Full Runtime Scope

Objective:

- evolve `spargio` into a fuller async runtime in the class of `glommio` / `monoio` / `compio`, while preserving the current differentiator (`msg_ring`-coordinated cross-shard scheduling + stealing).

Priority roadmap:

1. Lock the differentiator with stable KPI gates.
2. Build scheduler v2 (true per-worker deque stealing + fairness controls).
3. Complete core runtime primitives (timers, cancellation, task groups, backpressure semantics).
4. Deliver native network I/O MVP (TCP/UDP) on io_uring.
5. Deliver native filesystem I/O MVP with clear FD/buffer ownership and affinity rules.
6. Harden reliability and observability (stress/soak, failure injection, per-shard metrics and tracing).
7. Keep sidecar interop first-class; treat broad Tokio-compat readiness emulation as an optional long-term lane.

Immediate milestone sequence:

1. Deque-based stealing + fairness/budgeting.
2. Timer + timeout + cancellation primitives.
3. TCP MVP + dedicated latency/throughput/tail benchmarks.

## Update: Roadmap Tasks 1-5 MVP Implementation (TDD) (2026-02-26)

Implemented the first pass for roadmap tasks 1-5 with red/green TDD, then validated with tests and benchmark guardrails.

### 1) KPI gates for value proposition

Added benchmark guardrails/scripts:

- `scripts/bench_ping_guardrail.sh`
  - checks `steady_ping_pong_rtt`, unbatched `steady_one_way_send_drain`, and `cold_start_ping_pong` against Tokio ratio thresholds.
- `scripts/bench_kpi_guardrail.sh`
  - runs ping + fanout guardrails together.
- existing `scripts/bench_fanout_guardrail.sh` retained.

CI update:

- `.github/workflows/ci.yml` now runs:
  - fanout smoke
  - ping perf guardrail
  - fanout perf guardrail

### 2) Scheduler v2 (per-worker deque stealing + fairness controls)

Runtime changes:

- added `RuntimeBuilder::stealable_queue_capacity(...)`.
- added `RuntimeBuilder::steal_budget(...)`.
- changed stealable submission path:
  - submit to preferred shard deque (`StealablePreferred`) with bounded capacity.
  - return `RuntimeError::Overloaded` on enqueue backpressure.
- worker execution loop now:
  - drains local deque first up to budget.
  - attempts bounded victim steals via rotating cursor when local queue has room.

New stats signals:

- `stealable_backpressure`
- `steal_attempts`
- `steal_success`

### 3) Core runtime primitives (timer/cancellation/task groups/backpressure semantics)

Added:

- `sleep(Duration) -> impl Future<Output = ()>`
- `timeout(Duration, fut) -> Result<T, TimeoutError>`
- `CancellationToken` with:
  - `new()`
  - `cancel()`
  - `is_canceled()`
  - `cancelled() -> Future`
- `TaskGroup` with cooperative cancellation:
  - `TaskGroup::new(handle)`
  - `spawn_with_placement(...) -> TaskGroupJoinHandle<T>`
  - `cancel()`
  - `token()`

Backpressure semantics now include stealable task-queue overload via `RuntimeError::Overloaded`.

### 4) Native network I/O MVP (io_uring lane)

Extended `UringNativeLane` with:

- `recv(fd, len)`
- `send(fd, buf)`

Implemented via native io_uring ops:

- `IORING_OP_RECV`
- `IORING_OP_SEND`

### 5) Native filesystem I/O MVP (ownership + affinity surface)

Added:

- `UringNativeLane::fsync(fd)` (`IORING_OP_FSYNC`)
- `UringBoundFd` ownership wrapper bound to a lane/shard with methods:
  - `read_at`, `write_at`, `recv`, `send`, `fsync`
- binding helpers:
  - `bind_owned_fd`
  - `bind_file`
  - `bind_tcp_stream`
  - `bind_udp_socket`

This gives an explicit ownership + shard-affinity API surface for FD-driven native ops.

### Red/green tests added

- `tests/primitives_tdd.rs`
  - sleep timing
  - timeout success/failure
  - cancellation token notification
  - task-group cancellation and completion semantics
- `tests/slices_tdd.rs` additions
  - stealable queue backpressure -> `RuntimeError::Overloaded`
  - steal attempts/success stats under blocked-owner load
- `tests/uring_native_tdd.rs` additions
  - bound file write/read/fsync
  - bound TCP send/recv
  - bound UDP send/recv

### Validation

- `cargo fmt`
- `cargo test`
- `cargo test --features uring-native`
- `./scripts/bench_ping_guardrail.sh`
- `./scripts/bench_fanout_guardrail.sh`
- `cargo bench --bench disk_io -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Benchmark readout (latest local run profile)

From ping guardrail run:

- `steady_ping_pong_rtt/spargio_io_uring`: ~`363-380 us`
- `steady_ping_pong_rtt/tokio_two_worker`: ~`1.37-1.48 ms`
- `steady_one_way_send_drain/spargio_io_uring`: ~`73.0-75.3 us`
- `steady_one_way_send_drain/tokio_two_worker`: ~`104.6-115.8 us`
- `cold_start_ping_pong/spargio_io_uring`: ~`260-297 us`
- `cold_start_ping_pong/tokio_two_worker`: ~`463-511 us`

From fanout guardrail run:

- `fanout_fanin_balanced/tokio_mt_4`: ~`1.42-1.50 ms`
- `fanout_fanin_balanced/spargio_io_uring`: ~`1.33-1.35 ms`
- `fanout_fanin_skewed/tokio_mt_4`: ~`2.42-2.53 ms`
- `fanout_fanin_skewed/spargio_io_uring`: ~`2.03-2.04 ms`

From disk benchmark run:

- `disk_read_rtt_4k/tokio_two_worker_pread`: ~`1.79-1.93 ms`
- `disk_read_rtt_4k/io_uring_msg_ring_two_ring_pread`: ~`2.65-2.80 ms`

## Benchmark suite update: FS/Net API coverage and legacy disk bench removal

Implemented benchmark suite changes to align with current runtime API surface:

- removed legacy disk RTT benchmark harness:
  - deleted `benches/disk_io.rs`
  - removed `[[bench]] name = "disk_io"` from `Cargo.toml`
- added filesystem API benchmark suite:
  - `benches/fs_api.rs`
  - `fs_read_rtt_4k`:
    - `tokio_spawn_blocking_pread_qd1`
    - `spargio_uring_bound_file_qd1`
  - `fs_read_throughput_4k_qd32`:
    - `tokio_spawn_blocking_pread_qd32`
    - `spargio_uring_bound_file_qd32`
- added network API benchmark suite:
  - `benches/net_api.rs`
  - `net_echo_rtt_256b`:
    - `tokio_tcp_echo_qd1`
    - `spargio_uring_bound_tcp_qd1`
  - `net_stream_throughput_4k_window32`:
    - `tokio_tcp_echo_window32`
    - `spargio_uring_bound_tcp_window32`
- updated `Cargo.toml` benchmark targets:
  - `ping_pong`
  - `fanout_fanin`
  - `fs_api`
  - `net_api`

Validation run:

- `cargo fmt --all`
- `cargo bench --no-run`
- `cargo bench --no-run --features uring-native`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --bench fs_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench net_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

Latest benchmark readout (short smoke profile):

From `fs_api`:

- `fs_read_rtt_4k/tokio_spawn_blocking_pread_qd1`: ~`1.59-1.68 ms`
- `fs_read_rtt_4k/spargio_uring_bound_file_qd1`: ~`1.98-2.11 ms`
- `fs_read_throughput_4k_qd32/tokio_spawn_blocking_pread_qd32`: ~`7.66-7.76 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_bound_file_qd32`: ~`7.51-8.23 ms`

From `net_api`:

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: ~`8.17-8.54 ms`
- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: ~`6.89-6.97 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: ~`11.12-11.42 ms`
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: ~`29.33-30.01 ms`

## Net benchmark tuning pass: reduce `net_stream_throughput_4k_window32` gap

Goal:

- reduce overhead in the `uring-native` TCP path and re-run `net_api` to improve `net_stream_throughput_4k_window32`.

Implemented runtime/API changes (`src/lib.rs`):

- added owned-buffer native APIs:
  - `UringNativeLane::recv_owned(fd, Vec<u8>) -> io::Result<(usize, Vec<u8>)>`
  - `UringNativeLane::send_owned(fd, Vec<u8>) -> io::Result<(usize, Vec<u8>)>`
  - `UringBoundFd::recv_owned(Vec<u8>) -> io::Result<(usize, Vec<u8>)>`
  - `UringBoundFd::send_owned(Vec<u8>) -> io::Result<(usize, Vec<u8>)>`
- kept existing convenience APIs by adapting through owned-buffer path:
  - `recv(fd, len)` now uses `recv_owned` + truncate
  - `send(fd, &[u8])` now uses `send_owned`
- added same-shard fast path in `recv_owned`/`send_owned`:
  - if called from matching runtime/shard context, enqueue native op directly to local command queue instead of spawning a new pinned task.
- wired owned-buffer request/response shapes through local command + backend + io_uring native op completion path.

TDD coverage:

- added `uring_bound_tcp_stream_supports_owned_send_and_recv_buffers` in `tests/uring_native_tdd.rs`.

Benchmark harness tuning (`benches/net_api.rs`):

- moved Spargio net workload execution into a pinned runtime worker task (command-driven harness), instead of issuing all ops from outside the runtime.
- switched throughput receive path to stream-byte draining with a reusable scratch buffer (`64 KiB`) for both Tokio and Spargio:
  - reduces per-op overhead and keeps the workload apples-to-apples as stream throughput.
- switched Spargio send path to owned-buffer reuse (`send_owned`) with fallback for partial sends.

Validation:

- `cargo fmt --all`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --no-run --features uring-native`
- `cargo bench --bench net_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

Result delta from this tuning pass:

- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: improved from ~`6.89-6.97 ms` to ~`5.46-5.70 ms`.
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: improved from ~`29.33-30.01 ms` to ~`12.96-13.16 ms`.

Current comparison (same run):

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: ~`7.62-8.10 ms`
- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: ~`5.46-5.70 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: ~`10.47-11.01 ms`
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: ~`12.96-13.16 ms`

Interpretation:

- RTT is now clearly in Spargio’s favor for this harness.
- Stream throughput gap versus Tokio is substantially reduced (from ~2.6x slower to ~1.2x slower), but still present.

## Next optimization batch (committed plan before implementation)

Based on current net throughput gap, the next batch is:

1. Introduce provided-buffer multishot receive path (`IORING_OP_RECV_MULTISHOT` + `IORING_OP_PROVIDE_BUFFERS`) for stream receive-heavy benchmarks.
2. Expand reusable-buffer APIs (`recv_into`/owned-buffer reuse) so stream loops avoid per-op allocation churn.
3. Add batch-oriented stream APIs (`send_batch`, `recv_batch`/multishot helpers) to reduce per-message control overhead.
4. Increase pipelining depth in throughput paths by issuing batched/native operations with configurable in-flight windows.
5. Add an io_uring throughput preset (`single_issuer`, `coop_taskrun`, optional `sqpoll`) and use it in benchmark harnesses with fallback when unsupported.

Execution approach remains red/green TDD: add failing tests for each new API/behavior, then implement minimal passing behavior, then re-benchmark.

## Implementation: proposal batch (multishot/batching/tuning) completed

Implemented all items from the prior optimization proposal set.

### 1) Provided-buffer multishot receive path

Runtime additions (`src/lib.rs`):

- new local command: `SubmitNativeRecvMultishot`
- new native op state: `NativeIoOp::RecvMulti` (buffer group, target bytes, collected chunks)
- new driver path:
  - `submit_native_recv_multishot(...)`
  - submits `IORING_OP_PROVIDE_BUFFERS` + `IORING_OP_RECV_MULTISHOT`
  - collects CQEs until target bytes reached or stream ends
  - issues `IORING_OP_ASYNC_CANCEL` when target reached while CQE `MORE` continues
  - removes provided buffers via `IORING_OP_REMOVE_BUFFERS` on completion/failure
- completion path updated to process multishot/native housekeeping CQEs safely.

### 2) Reusable-buffer API expansion

Added:

- `UringNativeLane::recv_into(fd, Vec<u8>)`
- `UringBoundFd::recv_into(Vec<u8>)`

These preserve caller-owned buffers and avoid per-op allocation churn.

### 3) Batch-oriented stream APIs

Added:

- `UringNativeLane::send_batch(fd, Vec<Vec<u8>>, window)`
- `UringNativeLane::recv_batch_into(fd, Vec<Vec<u8>>, window)`
- `UringBoundFd::send_batch(...)`
- `UringBoundFd::recv_batch_into(...)`
- `UringNativeLane::recv_multishot(...)`
- `UringBoundFd::recv_multishot(...)`

### 4) Pipelining depth in throughput path

Benchmark harness updates (`benches/net_api.rs`):

- throughput send path now uses `send_batch` with reusable buffer pool.
- throughput receive path attempts `recv_multishot` first, then falls back to `recv_owned` if unsupported.
- this increases in-flight native work while keeping a fallback for older kernels.

### 5) io_uring throughput preset + harness usage

Runtime builder addition:

- `RuntimeBuilder::io_uring_throughput_mode(sqpoll_idle_ms)`
  - enables `coop_taskrun`
  - optional sqpoll setting through argument

Harness usage:

- `benches/fs_api.rs` and `benches/net_api.rs` now try throughput mode and fall back to plain io_uring runtime build if unavailable.

### Additional hardening done while implementing

- `flush_submissions()` now treats transient submit errors (`EAGAIN`/`EBUSY`/`Interrupted`) as retry/defer instead of immediate fatal teardown.
- this removed runtime cancellation failures seen under benchmark pressure.

### TDD additions

`tests/uring_native_tdd.rs` now includes:

- `uring_bound_tcp_stream_supports_recv_into_and_send_batch`
- `uring_bound_tcp_stream_supports_recv_multishot` (with unsupported-kernel fallback)

### Validation

- `cargo fmt --all`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --no-run`
- `cargo bench --no-run --features uring-native`
- `cargo bench --bench fs_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench net_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Latest benchmark readout after this implementation batch

From `fs_api`:

- `fs_read_rtt_4k/tokio_spawn_blocking_pread_qd1`: ~`1.64-1.71 ms`
- `fs_read_rtt_4k/spargio_uring_bound_file_qd1`: ~`1.98-2.28 ms`
- `fs_read_throughput_4k_qd32/tokio_spawn_blocking_pread_qd32`: ~`8.57-8.97 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_bound_file_qd32`: ~`6.73-7.42 ms`

From `net_api`:

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: ~`7.92-8.35 ms`
- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: ~`5.57-5.88 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: ~`10.93-11.85 ms`
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: ~`11.92-12.28 ms`

Interpretation:

- proposal batch is functionally implemented end-to-end (APIs + runtime + tests + benches).
- stream throughput gap versus Tokio narrowed further while preserving RTT advantage.

## Next optimization batch: close net throughput gap vs Tokio

Goal:

- improve `net_stream_throughput_4k_window32` by reducing per-frame control-path overhead in Spargio’s native TCP path.

Planned items (to implement with red/green TDD):

1. True native send batching:
   - add a single-command native submit path for multiple sends (`send_batch_native`) instead of `join_all(send_owned(...))` fanout.
   - aggregate completions in-driver and reply once per batch.
2. Persistent multishot provided-buffer groups:
   - keep a reusable provided-buffer pool per fd/lane for throughput loops.
   - avoid `ProvideBuffers`/`RemoveBuffers` on every throughput batch.
3. Zero-copy-ish multishot completion path cleanup:
   - remove `chunks.clone()` completion duplication.
   - finish by moving accumulated chunks once.
4. Capability caching in benchmark/harness:
   - probe multishot support once and stop retrying unsupported ops each batch.
5. Stronger throughput semantics:
   - add `send_all_batch` behavior (or equivalent) so batch send handles partial writes without leaking throughput accounting.

## Implementation: net throughput optimization batch completed

Implemented all five planned items.

### 1) True native send batching

Runtime changes (`src/lib.rs`):

- new API:
  - `UringNativeLane::send_all_batch(fd, bufs, window)`
  - `UringBoundFd::send_all_batch(bufs, window)`
- `send_batch(...)` now delegates to `send_all_batch(...)`.
- new local command:
  - `SubmitNativeSendBatchOwned`
- new backend + driver path:
  - `ShardBackend::submit_native_send_batch(...)`
  - `IoUringDriver::submit_native_send_batch(...)`
- batch state and CQE handling:
  - `NativeSendBatch`
  - `NativeSendBatchPart`
  - `native_send_batches` + `native_send_parts`
  - `complete_native_send_batch_part(...)`
  - single batch reply channel per batch (not per send op).

### 2) Persistent multishot provided-buffer groups

Runtime changes (`src/lib.rs`):

- `NativeIoOp::RecvMulti` now references a pool key rather than owning temporary storage.
- new pool model:
  - `NativeRecvPoolKey`
  - `NativeRecvPool`
  - `native_recv_pools: HashMap<...>`
- multishot flow now:
  - registers provided buffers once per pool (`registered`).
  - reuses pool storage/group across calls.
  - reprovides consumed bids via `reprovide_multishot_buffers(...)`.
  - marks pool free via `mark_recv_pool_free(...)`.
  - removes all registered groups on driver shutdown.

### 3) Multishot completion path copy cleanup

- removed `chunks.clone()` completion duplication in `complete_native_op(...)`.
- completion now moves collected chunks with `std::mem::take(...)` when finishing multishot ops.

### 4) Capability caching in benchmark path

Benchmark changes (`benches/net_api.rs`):

- `spargio_echo_windowed(...)` now caches multishot support in-loop:
  - if `recv_multishot` returns `EINVAL` / `ENOSYS` / `EOPNOTSUPP`, disable further multishot attempts for the rest of the run.

### 5) Stronger send semantics (`send_all_batch`)

- `send_all_batch` tracks per-buffer progress and retries partial writes until each buffer is fully sent or an error occurs.
- benchmark throughput sender now uses `send_all_batch(...)` (full-send semantics).

### Red/Green TDD additions

Added tests first in `tests/uring_native_tdd.rs`, then implemented runtime until green:

- `uring_bound_tcp_stream_supports_send_all_batch`
- `uring_bound_tcp_stream_reuses_recv_multishot_path_across_calls`

### Validation

- `cargo fmt --all`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --no-run --features uring-native`
- `cargo bench --bench fs_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench net_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Latest benchmark readout after this batch

From `net_api`:

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: ~`7.62-8.07 ms`
- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: ~`5.26-5.70 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: ~`10.42-10.73 ms`
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: ~`11.02-11.16 ms`

From `fs_api`:

- `fs_read_rtt_4k/tokio_spawn_blocking_pread_qd1`: ~`1.60-1.75 ms`
- `fs_read_rtt_4k/spargio_uring_bound_file_qd1`: ~`1.85-1.92 ms`
- `fs_read_throughput_4k_qd32/tokio_spawn_blocking_pread_qd32`: ~`7.51-7.62 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_bound_file_qd32`: ~`6.40-6.96 ms`

Interpretation:

- net throughput gap vs Tokio narrowed again (roughly from ~1.1x slower to ~1.05x slower in this short-run harness).
- net RTT lead remains.
- fs throughput lead remains.

## Implementation: follow-up net throughput optimizations (session + segment path + reprovide coalescing)

Applied the next optimization set aimed at reducing remaining `net_stream_throughput_4k_window32` overhead.

### 1) Persistent session in benchmark worker

`benches/net_api.rs`:

- added `SpargioWindowedSession` that persists across `EchoWindowed` benchmark commands.
- session retains:
  - reusable tx buffer pool,
  - reusable recv scratch buffer,
  - cached multishot capability state.
- worker now reuses this session for matching `(payload, window)` rather than rebuilding per invocation.

### 2) Segment-based multishot API (avoid `Vec<Vec<u8>>` materialization in hot path)

`src/lib.rs`:

- new public types:
  - `UringRecvSegment { offset, len }`
  - `UringRecvMultishotSegments { buffer, segments }`
- new APIs:
  - `UringNativeLane::recv_multishot_segments(...)`
  - `UringBoundFd::recv_multishot_segments(...)`
- `recv_multishot(...)` remains for compatibility and now adapts from segment output.
- `NativeIoOp::RecvMulti` now accumulates into one flat output buffer + segment metadata rather than `Vec<Vec<u8>>`.

### 3) Reprovide coalescing (reduce housekeeping SQEs)

`src/lib.rs`:

- `reprovide_multishot_buffers(...)` now:
  - sorts + deduplicates consumed bids,
  - coalesces contiguous bids into runs,
  - submits one `ProvideBuffers` SQE per contiguous run (instead of one per bid).

### TDD updates

- added test:
  - `uring_bound_tcp_stream_supports_recv_multishot_segments`
- preserved existing multishot compatibility tests; full `--features uring-native` test suite remains green.

### Validation

- `cargo fmt --all`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --no-run --features uring-native`
- `cargo bench --bench net_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Latest net benchmark snapshot after this follow-up

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: ~`7.58-7.90 ms`
- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: ~`5.25-5.35 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: ~`10.51-10.85 ms`
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: ~`10.84-10.95 ms`

Interpretation:

- stream-throughput gap narrowed further and is now close to parity in this short-run harness.
- RTT lead for Spargio remains.

## Implementation: fs RTT (`qd=1`) optimization batch (items 1-3)

Implemented the requested three-item set for `fs_read_rtt_4k`.

### 1) Run Spargio FS loops inside pinned runtime worker

`benches/fs_api.rs`:

- replaced external `block_on` Spargio loop with a pinned worker command loop (`SpargioFsCmd`).
- `ReadRtt` and `ReadQd` now execute on shard `1` in the runtime task itself.
- benchmark caller uses std mpsc request/reply to drive the worker, mirroring Tokio harness structure more closely.

### 2) Reusable read buffer API (`read_at_into`)

`src/lib.rs`:

- added:
  - `UringNativeLane::read_at_into(fd, offset, buf)`
  - `UringBoundFd::read_at_into(offset, buf)`
- `read_at(...)` now adapts through `read_at_into(...)`.
- added native read-owned command path:
  - `LocalCommand::SubmitNativeReadOwned`
  - backend routing `submit_native_read_owned(...)`
  - driver submission `submit_native_read_owned(...)`
  - native op state `NativeIoOp::ReadOwned`
- completion and failure handling updated for `ReadOwned`.

### 3) Persistent file session API (actor-style)

`src/lib.rs`:

- added `UringFileSession`:
  - `read_at_into(...)`
  - `read_at(...)`
  - `shutdown(...)`
  - `shard()`
- new constructor on bound fd:
  - `UringBoundFd::start_file_session()`
- session is implemented as a pinned shard task with command channel (`UringFileSessionCmd`), keeping repeated file operations on one shard.

### Red/Green TDD

Added failing tests first, then implemented until green:

- `uring_bound_file_supports_read_at_into_reuse`
- `uring_bound_file_session_supports_repeated_reads`

### Validation

- `cargo fmt --all`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --no-run --features uring-native`
- `cargo bench --bench fs_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Latest FS benchmark snapshot after this batch

- `fs_read_rtt_4k/tokio_spawn_blocking_pread_qd1`: ~`1.62-1.73 ms`
- `fs_read_rtt_4k/spargio_uring_bound_file_qd1`: ~`0.99-1.01 ms`
- `fs_read_throughput_4k_qd32/tokio_spawn_blocking_pread_qd32`: ~`7.59-7.75 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_bound_file_qd32`: ~`5.74-6.27 ms`

Interpretation:

- `qd=1` RTT moved from slower-than-Tokio to faster-than-Tokio in this short-run harness.
- throughput lead at `qd=32` remains.

## Proposal: unbound submission-time steering for all native ops

Goal:

- allow stealable tasks to issue native ops without pre-pinning a lane, while selecting target shard at submission time.

Design slices:

1. Unbound native entrypoint:
   - add `RuntimeHandle::uring_native_unbound() -> UringNativeAny`.
   - expose all native ops (`read/write/fsync`, `send/recv`, batch, multishot) on `UringNativeAny`.
2. Lane selector:
   - introduce `NativeLaneSelector` using per-shard pending native-op depth + round-robin tie-break.
   - support optional locality hints (`preferred_shard`).
3. FD affinity lease table:
   - add `FdAffinityTable` (`fd -> shard`) with TTL/release on idle.
   - use weak leases for file ops, stronger leases for stream/socket ops, hard affinity for multishot lifetime.
4. Generic native command envelope:
   - add `SubmitNativeAny { op, reply }` and route to selected shard.
   - preserve local fast path when selected shard == current shard.
5. Op-family behavior:
   - file single-shot ops steerable per op,
   - stream single-shot ops steerable with lease-aware ordering,
   - batch ops single-lane per batch,
   - multishot fixed-lane for op lifetime (token/stream tied to owning lane).
6. Cancellation/timeouts:
   - add global `op_id -> shard` tracking for correct cancel routing.
   - keep resource cleanup on owning lane.
7. TDD rollout:
   - slice A: unbound file ops + selector correctness/distribution tests.
   - slice B: unbound stream single-shot + batch ordering tests.
   - slice C: unbound multishot lifecycle/cancel/cleanup tests.
   - slice D: benchmark variants (`*_unbound_*`) vs pinned/session APIs.

Recommendation:

- yes, this is worth doing, but as a phased effort.
- rationale:
  - it preserves explicit pinned/session fast paths while adding flexible scheduler-friendly mode for stealable compute tasks.
  - it unlocks broader ergonomics without forcing users to choose one affinity model globally.
- risk:
  - correctness complexity is non-trivial (lease ownership, cancellation routing, multishot lifetime rules), so TDD slice gating is required.

## Implementation: unbound submission-time steering (slices A-D)

Implemented the full unbound slice set in this pass.

### Slice A: unbound entrypoint + selector + file ops

`src/lib.rs`:

- added `RuntimeHandle::uring_native_unbound() -> UringNativeAny`.
- added `NativeLaneSelector`:
  - selection by per-shard pending native-op depth (`pending_native_ops_by_shard`) with round-robin tie-break.
  - optional preferred-shard hinting.
- added `UringNativeAny` API surface for native ops:
  - `read_at`, `read_at_into`, `write_at`, `fsync`
  - plus stream/batch/multishot methods (below).
- added FD affinity lease table (`FdAffinityTable`):
  - weak lease for file-family ops,
  - strong lease for stream single-shot/batch,
  - hard lease for multishot lifetime.
- added unbound op-route tracking:
  - global `NativeOpId` allocation and `op_id -> shard` map.
  - `active_native_op_count()` / `active_native_op_shard(...)` observability.

Stats:

- `RuntimeStats` now includes `pending_native_ops_by_shard`.
- io_uring driver now updates both global pending-native count and per-shard pending-native depth.

### Slice B: stream single-shot + batch behavior

`UringNativeAny` now supports:

- `recv`, `recv_owned`, `recv_into`
- `send`, `send_owned`
- `send_batch`, `send_all_batch`
- `recv_batch_into`

Behavior:

- stream ops are lease-aware (`strong` lease), preserving lane-local ordering tendencies for repeated ops on the same FD.
- batch ops run single-lane per batch.

### Slice C: multishot lifecycle + cleanup

`UringNativeAny` now supports:

- `recv_multishot`
- `recv_multishot_segments`

Behavior:

- multishot uses `hard` FD affinity for operation lifetime.
- affinity is released when multishot completes.
- op-route map entries are added/removed around each unbound op, preserving ownership tracking.

### Slice D: benchmark variants (`*_unbound_*`)

`benches/fs_api.rs`:

- added `SpargioFsUnboundHarness`.
- added benchmark cases:
  - `spargio_uring_unbound_file_qd1`
  - `spargio_uring_unbound_file_qd32`

`benches/net_api.rs`:

- added `SpargioNetUnboundHarness`.
- added benchmark cases:
  - `spargio_uring_unbound_tcp_qd1`
  - `spargio_uring_unbound_tcp_window32`

### Red/Green TDD

Added failing tests first in `tests/uring_native_tdd.rs`, then implemented to green:

- `uring_native_unbound_requires_io_uring_backend`
- `uring_native_unbound_selector_distributes_when_depths_equal`
- `uring_native_unbound_file_ops_work`
- `uring_native_unbound_stream_ops_preserve_affinity_and_order`
- `uring_native_unbound_multishot_releases_hard_affinity_after_completion`
- `uring_native_unbound_tracks_active_op_routes_for_inflight_work`

### Validation

- `cargo fmt --all`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --no-run --features uring-native`
- `cargo bench --bench fs_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench net_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Latest short-run benchmark snapshot

FS:

- `fs_read_rtt_4k/tokio_spawn_blocking_pread_qd1`: ~`1.55-1.68 ms`
- `fs_read_rtt_4k/spargio_uring_bound_file_qd1`: ~`1.03-1.07 ms`
- `fs_read_rtt_4k/spargio_uring_unbound_file_qd1`: ~`1.01-1.03 ms`
- `fs_read_throughput_4k_qd32/tokio_spawn_blocking_pread_qd32`: ~`8.55-8.70 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_bound_file_qd32`: ~`5.93-6.68 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_unbound_file_qd32`: ~`6.57-7.38 ms`

Net:

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: ~`7.74-7.97 ms`
- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: ~`5.48-5.75 ms`
- `net_echo_rtt_256b/spargio_uring_unbound_tcp_qd1`: ~`7.64-8.04 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: ~`10.69-11.17 ms`
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: ~`11.09-11.33 ms`
- `net_stream_throughput_4k_window32/spargio_uring_unbound_tcp_window32`: ~`10.83-10.99 ms`

## Implementation: direct unbound command-envelope optimization (`SubmitNativeAny`)

Implemented the previously planned unbound-path optimization to remove per-op pinned-spawn overhead.

### What changed

`src/lib.rs`:

- added direct native command envelope:
  - `Command::SubmitNativeAny { op: NativeAnyCommand }`
  - `NativeAnyCommand` variants for read/write/fsync, send/recv, batch, multishot.
- `UringNativeAny` now dispatches native ops via:
  - same-shard local fast path: enqueue `LocalCommand` directly.
  - cross-shard envelope path: send `SubmitNativeAny` command to selected shard.
- preserved existing affinity/route semantics:
  - `NativeLaneSelector` selection.
  - FD lease table (`weak`/`strong`/`hard`).
  - `op_id -> shard` tracking and cleanup.

### New observability

`RuntimeStats` now includes:

- `native_any_envelope_submitted`
- `native_any_local_fastpath_submitted`

### Red/Green TDD

Added failing tests first, then implemented to green:

- `uring_native_unbound_records_command_envelope_submission`
- `uring_native_unbound_records_local_fast_path_submission`

### Validation

- `cargo fmt --all`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --no-run --features uring-native`
- `cargo bench --bench fs_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench net_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Latest short-run snapshot after optimization

FS:

- `fs_read_rtt_4k/tokio_spawn_blocking_pread_qd1`: ~`1.754-1.867 ms`
- `fs_read_rtt_4k/spargio_uring_bound_file_qd1`: ~`1.013-1.062 ms`
- `fs_read_rtt_4k/spargio_uring_unbound_file_qd1`: ~`1.003-1.028 ms`
- `fs_read_throughput_4k_qd32/tokio_spawn_blocking_pread_qd32`: ~`8.732-9.015 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_bound_file_qd32`: ~`5.967-6.988 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_unbound_file_qd32`: ~`6.085-6.866 ms`

Net:

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: ~`7.918-8.187 ms`
- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: ~`6.840-8.632 ms`
- `net_echo_rtt_256b/spargio_uring_unbound_tcp_qd1`: ~`5.539-5.812 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: ~`10.544-10.656 ms`
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: ~`11.073-11.449 ms`
- `net_stream_throughput_4k_window32/spargio_uring_unbound_tcp_window32`: ~`10.996-11.408 ms`

Interpretation:

- unbound `net_echo_rtt_256b` improved materially after removing per-op spawn overhead.
- unbound fs remains competitive and generally close to bound.

## Roadmap Revision: ergonomics-first sequence (requested)

No implementation in this update; this section revises priority order only.

### New priority order

1. Scope simplification first: remove bound APIs to keep the codebase manageable.
   - deprecate/remove `UringNativeLane`/`UringBoundFd`-centric public paths in favor of unbound-first APIs.
   - remove bound-only benchmark variants and docs references once replacement coverage exists.
2. Ergonomics project (highest priority after simplification):
   - deliver a high-level API layer targeting parity with Compio-style filesystem and network ergonomics.
   - target outcome: common file/network flows can be written without manual lane/FD plumbing boilerplate.
3. After ergonomics parity milestone is complete:
   - add benchmark suites against Compio for filesystem and network APIs, with matched workload shapes.
   - prioritize broader native I/O surface expansion.
4. Then continue with remaining milestones:
   - production-grade work-stealing policy (fairness/starvation/adaptive heuristics),
   - tail-latency perf program (longer windows + p95/p99 gates),
   - production hardening (stress/soak/failure injection/observability),
   - optional Tokio-compat readiness shim as a separate large-investment track.

### Ergonomics parity target (Compio-like)

At completion of the ergonomics project, Spargio should provide equivalent day-to-day usability for core filesystem/network tasks:

- filesystem:
  - high-level async file open/create/read/write helpers,
  - convenience methods equivalent to common `read_to_end_at`/buffer-reuse workflows.
- network:
  - high-level async TCP/UDP connect/accept/send/recv helpers,
  - convenience traits/wrappers for common read/write loops and batching patterns.
- runtime entry ergonomics:
  - straightforward app entry patterns (macro or helper-based) with minimal setup boilerplate.

### Notes

- This roadmap change intentionally favors API usability and adoption surface before deeper policy/perf-hardening tracks.
- Bound APIs are treated as temporary complexity and are planned for removal ahead of the ergonomics phase.
- Post-ergonomics benchmarking will include explicit Spargio-vs-Compio fs/net comparisons.

## Update: scope simplification + ergonomics APIs + Compio benchmark lane

Completed the requested implementation batch in three slices:

### 1) Scope simplification (bound API removal)

Removed bound-centric native public APIs from `src/lib.rs`:

- removed `RuntimeHandle::uring_native_lane(...)`
- removed `UringNativeLane`
- removed `UringBoundFd`
- removed `UringFileSession`

Native public surface is now unbound-first:

- `RuntimeHandle::uring_native_unbound() -> UringNativeAny`

Also removed bound-oriented TDD/bench usage and migrated coverage to unbound equivalents.

### 2) Ergonomics project (Compio-like API shape)

Added high-level wrappers over unbound native ops in `src/lib.rs`:

- `spargio::fs`
  - `OpenOptions`
  - `File`
    - `open`, `create`, `from_std`
    - `read_at`, `read_at_into`, `read_to_end_at`
    - `write_at`, `write_all_at`, `fsync`
- `spargio::net`
  - `TcpStream`
    - `connect`, `from_std`
    - `send`, `recv`, `send_owned`, `recv_owned`
    - `send_all_batch`, `recv_multishot_segments`
    - `write_all`, `read_exact`
  - `TcpListener`
    - `bind`, `from_std`, `local_addr`, `accept`

Added red/green tests:

- new `tests/ergonomics_tdd.rs`
  - `fs_open_read_to_end_and_write_at`
  - `net_tcp_stream_connect_supports_read_write_all`
  - `net_tcp_listener_bind_accepts_and_wraps_stream`
- rewrote `tests/uring_native_tdd.rs` to unbound-only coverage.

### 3) Benchmark refresh + Compio comparisons

Added Compio to Linux dev-dependencies:

- `Cargo.toml`:
  - `[target.'cfg(target_os = "linux")'.dev-dependencies]`
  - `compio = { version = "0.18.0", default-features = false, features = ["runtime", "io-uring", "fs", "net", "io"] }`

Rewrote benchmark harnesses:

- `benches/fs_api.rs`
  - compares:
    - `tokio_spawn_blocking_pread_qd1`
    - `spargio_fs_read_at_qd1`
    - `compio_fs_read_at_qd1`
    - `tokio_spawn_blocking_pread_qd32`
    - `spargio_fs_read_at_qd32`
    - `compio_fs_read_at_qd32`
- `benches/net_api.rs`
  - compares:
    - `tokio_tcp_echo_qd1`
    - `spargio_tcp_echo_qd1`
    - `compio_tcp_echo_qd1`
    - `tokio_tcp_echo_window32`
    - `spargio_tcp_echo_window32`
    - `compio_tcp_echo_window32`

### Validation

- `cargo fmt`
- `cargo test --features uring-native --tests`
- `cargo bench --features uring-native --no-run`
- `cargo bench --features uring-native --bench fs_api -- --sample-size 20`
- `cargo bench --features uring-native --bench net_api -- --sample-size 20`

### Latest benchmark snapshot (sample-size 20)

FS:

- `fs_read_rtt_4k/tokio_spawn_blocking_pread_qd1`: `1.601-1.641 ms`
- `fs_read_rtt_4k/spargio_fs_read_at_qd1`: `1.012-1.026 ms`
- `fs_read_rtt_4k/compio_fs_read_at_qd1`: `1.388-1.421 ms`
- `fs_read_throughput_4k_qd32/tokio_spawn_blocking_pread_qd32`: `7.680-7.767 ms`
- `fs_read_throughput_4k_qd32/spargio_fs_read_at_qd32`: `5.971-6.054 ms`
- `fs_read_throughput_4k_qd32/compio_fs_read_at_qd32`: `5.983-6.119 ms`

Net:

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: `7.913-8.056 ms`
- `net_echo_rtt_256b/spargio_tcp_echo_qd1`: `5.542-5.606 ms`
- `net_echo_rtt_256b/compio_tcp_echo_qd1`: `6.530-6.646 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: `11.306-11.511 ms`
- `net_stream_throughput_4k_window32/spargio_tcp_echo_window32`: `16.903-17.082 ms`
- `net_stream_throughput_4k_window32/compio_tcp_echo_window32`: `6.928-7.091 ms`

### Notes

- This completes the requested simplification + ergonomics + Compio benchmark scope.
- Current ergonomic `fs::OpenOptions::open`, `net::TcpListener::bind/accept`, and `net::TcpStream::connect` are async wrappers using blocking helper threads for setup operations; native io_uring open/accept/connect op coverage remains future work.

## Update: net throughput optimization pass (owned buffers + batch/multishot receive)

Focused on `net_stream_throughput_4k_window32`, where Spargio remained behind Tokio/Compio after the ergonomics migration.

### Red/Green TDD

Added failing ergonomics test first:

- `tests/ergonomics_tdd.rs`
  - `net_tcp_stream_owned_buffers_support_read_write_all`

Then implemented the API and benchmark-path changes to green.

### API changes (`spargio::net::TcpStream`)

`src/lib.rs`:

- added `write_all_owned(Vec<u8>) -> io::Result<Vec<u8>>`
- added `read_exact_owned(Vec<u8>) -> io::Result<Vec<u8>>`
- optimized `read_exact(&mut [u8])` to reuse a scratch receive buffer rather than allocating per recv loop.

These allow high-frequency send/recv loops to reuse caller-owned buffers and avoid repeated allocation churn.

### Benchmark harness changes

`benches/net_api.rs`:

- `spargio_echo_rtt` now uses owned-buffer helpers:
  - `write_all_owned`
  - `read_exact_owned`
- `spargio_echo_windowed` now uses a throughput-oriented native path:
  - prebuild frame batch from reusable tx pool
  - `send_all_batch(...)`
  - `recv_multishot_segments(...)` with kernel capability fallback (`EINVAL/ENOSYS/EOPNOTSUPP`)
  - fallback receive path uses `read_exact_owned` with reusable buffer

### Validation

- `cargo test --features uring-native --test ergonomics_tdd`
- `cargo bench --features uring-native --bench net_api --no-run`
- `cargo bench --features uring-native --bench net_api -- --sample-size 20`

### Latest `net_api` snapshot after optimization

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: `7.878-8.032 ms`
- `net_echo_rtt_256b/spargio_tcp_echo_qd1`: `5.516-5.613 ms`
- `net_echo_rtt_256b/compio_tcp_echo_qd1`: `6.555-6.715 ms`

- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: `11.147-11.318 ms`
- `net_stream_throughput_4k_window32/spargio_tcp_echo_window32`: `10.889-10.974 ms`
- `net_stream_throughput_4k_window32/compio_tcp_echo_window32`: `7.090-7.225 ms`

Result: Spargio throughput moved from clearly behind Tokio to slightly ahead in this harness run, while remaining behind Compio in sustained stream throughput.

## Update: local stream-session fast path + pool-backed multishot snapshot

Follow-up optimization work after the prior net-throughput pass.

### What was implemented

1) Local stream-session fast path (submission without unbound route tracking)

`src/lib.rs` (`UringNativeAny` + `spargio::net::TcpStream`):

- added direct-to-shard submit helper in `UringNativeAny`:
  - bypasses `op_routes` + FD-affinity lock bookkeeping for stream-session calls.
- added stream-session methods on `UringNativeAny`:
  - `select_stream_session_shard`
  - `recv_owned_on_shard`
  - `send_owned_on_shard`
  - `send_all_batch_on_shard`
  - `recv_multishot_segments_on_shard`
- `spargio::net::TcpStream` now selects a session shard at construction and routes stream ops through these methods.

2) Multishot receive copy-path change

`src/lib.rs` (`IoUringDriver::complete_native_op`):

- removed per-CQE compaction copy (`out.extend_from_slice(...)`) for multishot segments.
- now records segment offsets directly against buffer-pool layout (`bid * buffer_len`).
- returns a pool-backed snapshot buffer (`pool.storage.to_vec()`) with segment metadata.

Note: this is a safe pool-backed snapshot path (no per-segment compaction copy), not a full ownership-transfer zero-copy path. A first ownership-transfer attempt caused unsafe kernel buffer-registration interactions and was not kept.

### Red/Green TDD additions

Added failing tests first, then implemented to green:

- `tests/ergonomics_tdd.rs`
  - `net_tcp_stream_session_path_does_not_track_unbound_op_routes`
- `tests/uring_native_tdd.rs`
  - `uring_native_unbound_multishot_segments_expose_pool_backing_without_compaction_copy`

### Validation

- `cargo test --features uring-native --tests`
- `cargo bench --features uring-native --bench net_api -- --sample-size 20`

### Latest `net_api` snapshot after this pass

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: `7.923-8.118 ms`
- `net_echo_rtt_256b/spargio_tcp_echo_qd1`: `5.410-5.516 ms`
- `net_echo_rtt_256b/compio_tcp_echo_qd1`: `6.447-6.530 ms`

- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: `10.902-11.155 ms`
- `net_stream_throughput_4k_window32/spargio_tcp_echo_window32`: `11.225-11.441 ms`
- `net_stream_throughput_4k_window32/compio_tcp_echo_window32`: `7.007-7.118 ms`

Interpretation:

- stream RTT improved further on Spargio.
- throughput remains near Tokio (within a few percent in this run) and behind Compio on sustained stream throughput.

## Update: imbalanced net-stream benchmark (hot/cold skew)

Added a third `net_api` benchmark to measure skewed stream load across multiple concurrent TCP connections.

### What changed

- `benches/net_api.rs`:
  - refactored echo server fixture to support N accepted client connections per harness (`spawn_echo_server_with_clients`).
  - extended Tokio/Spargio/Compio harness command sets with `EchoImbalanced`.
  - each harness now creates `IMBALANCED_STREAMS=8` persistent streams.
  - existing RTT/windowed benchmarks continue to use the primary stream.
  - new benchmark group: `net_stream_imbalanced_4k_hot1_light7`.

### Imbalanced workload definition

- Streams: `8`
- Payload: `4096` bytes
- Window: `32`
- Heavy stream (`idx=0`): `2048` frames
- Light streams (`idx=1..7`): `128` frames each
- Total per iteration: `11,468,800` bytes

### Validation

- `cargo check --features uring-native --bench net_api`
- `cargo bench --features uring-native --bench net_api -- --sample-size 20`

### Latest results (`--sample-size 20`)

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: `7.903-8.093 ms`
- `net_echo_rtt_256b/spargio_tcp_echo_qd1`: `5.405-5.474 ms`
- `net_echo_rtt_256b/compio_tcp_echo_qd1`: `6.472-6.593 ms`

- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: `11.157-11.203 ms`
- `net_stream_throughput_4k_window32/spargio_tcp_echo_window32`: `11.085-11.166 ms`
- `net_stream_throughput_4k_window32/compio_tcp_echo_window32`: `7.136-7.277 ms`

- `net_stream_imbalanced_4k_hot1_light7/tokio_tcp_8streams_hotcold`: `13.595-13.853 ms` (`830-846 MiB/s`)
- `net_stream_imbalanced_4k_hot1_light7/spargio_tcp_8streams_hotcold`: `16.335-16.502 ms` (`697-704 MiB/s`)
- `net_stream_imbalanced_4k_hot1_light7/compio_tcp_8streams_hotcold`: `12.089-12.215 ms` (`942-951 MiB/s`)

### Notes

- The new skew benchmark is stable and repeatable.
- In the current implementation, Spargio is behind Tokio and Compio on this hot/cold multi-stream workload.

## Update: hypotheses and A/B plan for imbalanced net-stream slowdown

This captures why `net_stream_imbalanced_4k_hot1_light7` is currently slower on Spargio and what we should test next before changing core runtime behavior.

### Hypotheses

1. Workload shape is dominated by one serialized hot stream.
- In hot1/light7, one stream carries most bytes; single-stream TCP ordering limits parallelism and reduces benefits from work stealing.

2. Session-shard concentration reduces lane spread.
- Streams are created from one worker context; `TcpStream` picks `session_shard` at construction.
- With preferred-shard bias in selector, many streams may end up on the same shard.

3. Cross-shard submit overhead in imbalanced path.
- Imbalanced benchmark spawns stealable tasks per stream, but stream I/O still routes to stream `session_shard`.
- If task executes off-session-shard, each op pays envelope/command/oneshot overhead.

4. Multishot receive path still performs heavy copying.
- Current multishot completion returns a pool snapshot via `pool.storage.to_vec()`.
- This copies the full pool per batch and can dominate throughput in hot stream workloads.

### Quick A/B plan to prove each cause

A/B-1: workload-shape sensitivity (hot-stream serialization)
- A: current `hot1/light7` profile.
- B: balanced profile with same total bytes spread evenly across streams.
- Success signal: if Spargio narrows/erases gap on balanced profile, shape serialization is a primary contributor.

A/B-2: stream session-shard distribution
- A: current stream construction path.
- B: instrument and enforce explicit spread (round-robin stream creation context or per-stream target shard) and record distribution.
- Success signal: if better spread improves imbalanced throughput, lane concentration is a contributor.

A/B-3: task placement vs. stream session shard
- A: current `spawn_stealable` for stream workers.
- B: run stream workers pinned/preferred to each stream `session_shard`.
- Success signal: if B improves latency/throughput, cross-shard submit overhead is material.

A/B-4: multishot copy cost
- A: current `take_recv_pool_storage -> to_vec()` behavior.
- B: copy only touched segment ranges (or temporarily force non-multishot read path as control).
- Success signal: lower time and reduced CPU/memory pressure confirms copy-path dominance.

### Copy-reduction and related optimization options

1) Copy only touched bytes from multishot segments (low risk).
- Replace full-pool clone with segment-aware gather into a compact output buffer.
- Expected effect: materially lower copy volume on partial-pool consumption.

2) Segment-fold API to avoid materializing receive buffers (medium risk).
- Add API that processes multishot segments in-place and returns folded result (checksum/parser state/etc.).
- Expected effect: near-zero extra copy for many streaming workloads.

3) Pool lease API for true zero-copy receive view (higher complexity).
- Return a lease object that references registered pool storage + segment metadata.
- Reclaim buffers on lease drop, with double-buffered pool strategy to keep pipeline full.

4) Placement alignment for stream workers (complementary).
- Run per-stream tasks on their `session_shard` by default in throughput-oriented paths.
- Expected effect: remove cross-shard submit + response overhead from hot I/O loops.

### Priority suggestion

- First: A/B-4 (copy path) and A/B-3 (placement alignment).
- Then: A/B-2 (distribution), A/B-1 (shape sensitivity) for explanatory confidence and benchmark positioning.

## Update: A/B results for imbalanced net-stream hypotheses

Ran targeted A/B matrix in `benches/net_api.rs` via benchmark group:
- `net_stream_imbalanced_ab_4k`

Command used:
- `cargo bench --features uring-native --bench net_api -- net_stream_imbalanced_ab_4k --sample-size 12`

### Key results (time ranges)

- `tokio_hotcold`: `13.547-13.682 ms`
- `tokio_balanced_total_bytes`: `8.046-8.174 ms`

- `spargio_hotcold_stealable_multishot`: `16.337-16.454 ms`
- `spargio_hotcold_pinned_multishot`: `16.358-16.512 ms`
- `spargio_hotcold_stealable_readexact`: `17.902-17.970 ms`
- `spargio_hotcold_pinned_readexact`: `17.742-17.896 ms`

- `spargio_balanced_stealable_multishot` (single-context stream init): `16.861-16.986 ms`

- `spargio_hotcold_stealable_multishot_distributed_connect`: `13.534-13.684 ms`
- `spargio_hotcold_pinned_multishot_distributed_connect`: `13.300-13.360 ms`
- `spargio_balanced_stealable_multishot_distributed_connect`: `9.080-9.172 ms`

### Hypothesis outcomes

1) Workload shape (hot-stream serialization) matters: **confirmed**.
- Tokio hotcold vs balanced shows a large swing.
- Spargio shows the same swing once stream session distribution is fixed (`13.6 ms` hotcold vs `9.1 ms` balanced in distributed-connect mode).

2) Session-shard concentration / stream distribution: **strongly confirmed (primary factor)**.
- Spargio hotcold improves from ~`16.4 ms` to ~`13.6 ms` by only changing stream init to distributed-connect.
- This is the biggest single improvement in the A/B set.

3) Placement alignment (stealable vs pinned-to-session): **secondary effect**.
- In single-context mode, pinned vs stealable is effectively flat.
- In distributed-connect mode, pinned gives a modest gain (~2%).

4) Multishot copy-path concern: **not primary in this workload**.
- `read_exact` variants are slower than multishot by ~8-10%.
- Conclusion: reducing full-pool clone may still help, but it is not the top bottleneck for this benchmark shape.

### Re-evaluated optimization priorities

1. Make stream session-shard distribution explicit/default for multi-stream workloads.
- Add runtime/net API controls for connect-time lane selection (e.g., round-robin shard hinting).

2. Add stream-task placement helpers that align execution with stream session shard.
- Keep work-stealable default, but provide an easy pinned/session-aligned fast path for throughput loops.

3. Keep multishot as default receive path for throughput profiles.
- Do not switch to read_exact-only path for this workload class.

4. Move copy-reduction work to medium priority.
- Touched-range copy and lease-based zero-copy remain worthwhile, but after (1) and (2).

5. Add follow-up benchmark scenarios to validate generality.
- skewed + distributed under larger windows, mixed payload sizes, and parser-like downstream processing.

## Update: implemented optimization priorities from imbalanced A/B findings

Implemented the re-prioritized optimization set focused on multi-stream distribution, session-aligned execution ergonomics, and receive-copy reduction.

### 1) Stream distribution controls (runtime API)

`src/lib.rs` (`spargio::net`):

- added `StreamSessionPolicy`:
  - `ContextPreferred`
  - `RoundRobin`
  - `Fixed(ShardId)`
- added session-policy connect APIs on `TcpStream`:
  - `connect_with_session_policy(...)`
  - `connect_round_robin(...)`
  - `connect_many_with_session_policy(...)`
  - `connect_many_round_robin(...)`
- added session-policy wrap API:
  - `from_std_with_session_policy(...)`
- kept existing `connect(...)` / `from_std(...)` behavior via `ContextPreferred`.
- added session-policy accept APIs on `TcpListener`:
  - `accept_with_session_policy(...)`
  - `accept_round_robin(...)`

This makes multi-stream session placement explicit and gives a first-class round-robin path without requiring benchmark-specific task orchestration.

### 2) Session-shard-aligned execution helpers

`src/lib.rs` (`spargio::net::TcpStream`):

- added `spawn_on_session(&RuntimeHandle, fut)`
- added `spawn_stealable_on_session(&RuntimeHandle, fut)`

This removes boilerplate for session-aligned throughput loops and enables straightforward pinned-to-session execution from stream handles.

### 3) Keep multishot as default throughput receive path

`benches/net_api.rs`:

- throughput/imbalanced hot paths continue to default to multishot receive mode.
- read-exact is kept only as A/B comparison lane.

### 4) Copy reduction for multishot completion

`src/lib.rs` (io_uring driver):

- replaced full pool clone in multishot completion path with compact touched-range copy:
  - old: full `pool.storage.to_vec()` clone
  - new: copy only segment-covered ranges and rewrite segment offsets to compact buffer coordinates

This reduces receive-copy volume when only a subset of the registered pool is used per operation.

### Benchmark harness updates

`benches/net_api.rs`:

- `SpargioStreamInitMode::DistributedConnect` now uses runtime API (`connect_many_round_robin`) instead of benchmark-local pinned-connect orchestration.
- `bench_net_stream_imbalanced_4k_hot1_light7` uses distributed-connect Spargio harness (optimized multi-stream path).
- A/B matrix retained (`net_stream_imbalanced_ab_4k`) and updated to use the new helpers.

### Red/Green TDD

Added failing tests first, then implemented to green:

- `tests/ergonomics_tdd.rs`
  - `net_tcp_stream_connect_round_robin_distributes_session_shards`
  - `net_tcp_stream_spawn_on_session_runs_on_stream_session_shard`
- `tests/uring_native_tdd.rs`
  - updated multishot-copy expectation:
    - `uring_native_unbound_multishot_segments_use_compact_buffer_copy`

Validation:

- `cargo test --features uring-native --tests`
- `cargo check --features uring-native --bench net_api`
- `cargo bench --features uring-native --bench net_api -- net_stream_imbalanced_ab_4k --sample-size 12`
- `cargo bench --features uring-native --bench net_api -- net_stream_imbalanced_4k_hot1_light7 --sample-size 12`
- `cargo bench --features uring-native --bench net_api -- net_echo_rtt_256b --sample-size 12`

### Post-change benchmark snapshot (latest runs)

Imbalanced target benchmark:

- `net_stream_imbalanced_4k_hot1_light7/tokio_tcp_8streams_hotcold`: `14.058-14.331 ms`
- `net_stream_imbalanced_4k_hot1_light7/spargio_tcp_8streams_hotcold`: `13.300-13.734 ms`
- `net_stream_imbalanced_4k_hot1_light7/compio_tcp_8streams_hotcold`: `12.174-12.499 ms`

A/B confirmation:

- `spargio_hotcold_stealable_multishot_distributed_connect`: `13.410-13.639 ms`
- `spargio_hotcold_pinned_multishot_distributed_connect`: `13.050-13.144 ms`
- `spargio_balanced_stealable_multishot_distributed_connect`: `8.886-8.942 ms`

RTT sanity after harness adjustment:

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: `7.988-8.128 ms`
- `net_echo_rtt_256b/spargio_tcp_echo_qd1`: `5.625-5.793 ms`
- `net_echo_rtt_256b/compio_tcp_echo_qd1`: `6.599-6.704 ms`

### Interpretation

- Primary bottleneck identified earlier (session concentration) is now addressed via runtime API and benchmark-path adoption.
- Session-aligned helpers are in place and show modest additional gains in distributed mode.
- Compact multishot copy reduced copy overhead and improved several A/B lanes, while multishot remains better than read-exact for these workloads.

## Update: separated net A/B scenarios into experimental benchmark target

To keep long-running benchmark reporting focused and stable, imbalanced A/B diagnostic scenarios were moved out of the main net benchmark target.

### What changed

- Added new bench target in `Cargo.toml`:
  - `[[bench]] name = "net_experiments"`
- Main benchmark target `benches/net_api.rs` now includes only product-facing groups:
  - `net_echo_rtt_256b`
  - `net_stream_throughput_4k_window32`
  - `net_stream_imbalanced_4k_hot1_light7`
- Experimental A/B matrix moved to `benches/net_experiments.rs`.
- Experimental group renamed for clarity:
  - `exp_net_stream_imbalanced_ab_4k`

### Usage

- Product-facing benchmark suite:
  - `cargo bench --features uring-native --bench net_api`
- Experimental diagnostic suite:
  - `cargo bench --features uring-native --bench net_experiments`

### Validation

- `cargo check --features uring-native --bench net_api --bench net_experiments`
- Verified no A/B group is exposed from `net_api` target.
- Verified `net_experiments` runs `exp_net_stream_imbalanced_ab_4k` as intended.

## Update: dynamic-imbalance benchmark backlog + pipeline-hotspot implementation

Captured additional benchmark shapes (posterity/backlog) to better probe the `msg_ring` + work-stealing value proposition under dynamic skew:

1. `net_stream_hotspot_rotation`
- rotating hot stream without explicit CPU stage.
2. `net_stream_bursty_tenants`
- many streams with bursty ON/OFF activity and skewed arrivals.
3. `net_pipeline_imbalanced_io_cpu`
- per-frame recv/CPU/send pipeline with rotating hotspot.
4. `fanout_fanin_hotkey_rotation`
- fanout/fanin with moving hot key pressure across shards.
5. `accept_connect_churn_skewed`
- skewed short-lived connection churn including setup path.

Implemented now:

- Added new benchmark group in `benches/net_api.rs`:
  - `net_pipeline_hotspot_rotation_4k_window32`
- Added runtime lanes in the existing Tokio/Spargio/Compio net harness commands:
  - `*_pipeline_hotspot` command + execution path per runtime.
- Workload shape:
  - 8 streams, 4 KiB frames, window 32.
  - hotspot rotates every 64 frames.
  - per-frame CPU stage after echo receive (`heavy` for current hotspot stream, `light` for others).
- Added a shared deterministic CPU stage helper used by all three runtimes to keep the comparison shape aligned.

Validation:

- `cargo fmt`
- `cargo check --features uring-native --bench net_api`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 10`

Quick snapshot (`sample-size 10`):

- `net_pipeline_hotspot_rotation_4k_window32/tokio_tcp_pipeline_hotspot`: `26.075-26.308 ms`
- `net_pipeline_hotspot_rotation_4k_window32/spargio_tcp_pipeline_hotspot`: `32.686-33.156 ms`
- `net_pipeline_hotspot_rotation_4k_window32/compio_tcp_pipeline_hotspot`: `50.496-51.812 ms`

## Update: added `net_stream_hotspot_rotation_4k` (I/O-only rotating hotspot)

Implemented the follow-up benchmark shape requested to isolate dynamic skew effects without an explicit CPU stage.

What was added:

- New benchmark group in `benches/net_api.rs`:
  - `net_stream_hotspot_rotation_4k`
- New runtime command lane across Tokio/Spargio/Compio harnesses:
  - `EchoHotspotRotation`
- Workload definition:
  - 8 streams
  - 4 KiB frames
  - hotspot rotates each step (`step % stream_count`)
  - per-step frame budget:
    - hotspot stream: `32` frames
    - non-hot streams: `2` frames
  - `64` steps total
  - window `32`

Validation:

- `cargo fmt`
- `cargo check --features uring-native --bench net_api`
- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 10`

Quick snapshot (`sample-size 10`):

- `net_stream_hotspot_rotation_4k/tokio_tcp_8streams_rotating_hotspot`: `8.7249-8.7700 ms`
- `net_stream_hotspot_rotation_4k/spargio_tcp_8streams_rotating_hotspot`: `11.499-11.600 ms`
- `net_stream_hotspot_rotation_4k/compio_tcp_8streams_rotating_hotspot`: `16.637-16.766 ms`

## Roadmap update: runtime entry ergonomics moved to the front

To reduce first-use friction, runtime entry ergonomics is now the first item in the upcoming roadmap.

Updated upcoming order:

1. Runtime entry ergonomics:
   - add a simple helper entrypoint (for example `spargio::run(...)`).
   - add optional `#[spargio::main]` proc-macro sugar in a companion proc-macro crate.
   - ensure feature-gated behavior and clear fallback/error messaging on unsupported platforms.
2. Remove blocking APIs from the public runtime surface.
   - replace helper-thread `run_blocking` paths in `fs::OpenOptions::open`, `net::TcpStream::connect`, and `net::TcpListener::bind/accept`.
   - require native/non-blocking paths for these setup operations.
3. Continue ergonomic parity work for fs/net API discoverability and docs.
4. Continue dynamic-imbalance benchmark expansion and optimization loops.
5. Proceed with broader native I/O surface + hardening milestones.

## Update: runtime entry ergonomics slice (helpers + `#[spargio::main]`)

Completed the next runtime-entry ergonomics slice with red/green TDD.

### Red phase

- Added new integration tests in `tests/entry_macro_tdd.rs`:
  - `main_macro_executes_async_body`
  - `main_macro_applies_builder_overrides`
  - `main_macro_panics_on_runtime_build_failure`
- Ran:
  - `cargo test --features macros --test entry_macro_tdd`
- Expected failure observed:
  - package did not yet expose a `macros` feature.

### Green phase

- Added companion proc-macro crate:
  - `spargio-macros/Cargo.toml`
  - `spargio-macros/src/lib.rs`
- Implemented `#[spargio::main]` attribute macro:
  - supports async no-arg function entry wrappers;
  - supports options: `shards = ...`, `backend = "queue" | "io_uring"`;
  - validates unsupported signatures/options at compile time.
- Wired feature-gated export in main crate:
  - `Cargo.toml`: added optional dependency + `macros` feature.
  - `src/lib.rs`: `#[cfg(feature = "macros")] pub use spargio_macros::main;`
- Existing helper entry APIs (`spargio::run`, `spargio::run_with`) remain the non-macro path.

### Validation

- `cargo test --features macros --test entry_macro_tdd`
- `cargo test --test runtime_tdd`
- `cargo test --features macros --tests`
- `cargo fmt`

### Status

- Runtime entry ergonomics roadmap item is now covered by:
  - helper entry (`run`, `run_with`) and
  - optional attribute macro entry (`#[spargio::main]`).
- Next planned item remains removing blocking setup APIs from the public fs/net surface.

## Update: removed blocking setup helpers from fs/net public APIs (Red/Green TDD)

Goal completed:

- Removed helper-thread `run_blocking` setup paths from:
  - `spargio::fs::OpenOptions::open`
  - `spargio::net::TcpStream::connect*`
  - `spargio::net::TcpListener::bind/accept*`

### Red phase

Added/expanded failing tests in `tests/ergonomics_tdd.rs` to lock behavior before implementation:

- `net_tcp_stream_connect_supports_read_write_all` now asserts returned stream fd is nonblocking.
- `net_tcp_listener_bind_accepts_and_wraps_stream` now asserts accepted stream fd is nonblocking.
- Added fs option-compat tests:
  - `fs_open_options_create_new_reports_already_exists`
  - `fs_open_options_append_and_truncate_is_invalid`

Observed red failure before implementation:

- connected/accepted stream nonblocking assertions failed with existing helper-thread setup path.

### Green phase

Implemented native setup operations in the io_uring command pipeline:

- Added new native command flow variants (`NativeAnyCommand`, `LocalCommand`, backend dispatch, driver submission/completion):
  - `OpenAt`
  - `Connect`
  - `Accept`
- Added `UringNativeAny` helpers:
  - `open_at(...)`
  - `connect_on_shard(...)`
  - `accept_on_shard(...)`
- Added driver-side completion handling for new `NativeIoOp` variants.

Public API behavior changes:

- `fs::OpenOptions::open` now uses native `IORING_OP_OPENAT` instead of helper threads.
- `net::TcpStream::connect*` now creates nonblocking sockets and completes with native `IORING_OP_CONNECT` on the chosen shard.
- `net::TcpListener::accept*` now uses native `IORING_OP_ACCEPT` (nonblocking + cloexec accepted sockets).
- `net::TcpListener::bind` now creates/binds/listens via nonblocking socket syscalls (no helper thread).
- `TcpStream::from_std_with_session_policy` now enforces nonblocking mode.

Notes:

- Added sockaddr encode/decode helpers for IPv4/IPv6 setup/completion paths.
- `fs::OpenOptions` flag mapping now validates invalid combinations in-process and uses `openat` flags/mode directly.

### Validation

Executed:

- `cargo fmt`
- `cargo test --features uring-native --test ergonomics_tdd`
- `cargo test --features uring-native --test uring_native_tdd`
- `cargo test --features uring-native`

Result:

- All tests pass.

## Update: benchmark refresh after native setup-path changes

Re-ran the monitored benchmark suites and refreshed README tables.

Command profile used for all runs:

- `--warm-up-time 0.05`
- `--measurement-time 0.05`
- `--sample-size 20`

Commands executed:

- `cargo bench --features uring-native --bench ping_pong -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --features uring-native --bench fanout_fanin -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --features uring-native --bench fs_api -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --features uring-native --bench net_api -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

Highlights from refreshed results:

- Coordination:
  - `steady_ping_pong_rtt`: Tokio `1.4509-1.4888 ms`, Spargio `357.27-378.34 us`.
  - `steady_one_way_send_drain`: Tokio `70.972-75.645 us`, Spargio `66.006-66.811 us`.
  - `cold_start_ping_pong`: Tokio `535.65-601.90 us`, Spargio `262.24-291.99 us`.
  - `fanout_fanin_balanced`: Tokio `1.4625-1.5346 ms`, Spargio `1.3333-1.3496 ms`.
  - `fanout_fanin_skewed`: Tokio `2.4001-2.7005 ms`, Spargio `1.9590-1.9900 ms`.

- Native fs/net:
  - `fs_read_rtt_4k`: Tokio `1.6476-1.7647 ms`, Spargio `0.99148-1.0145 ms`, Compio `1.3893-1.4970 ms`.
  - `fs_read_throughput_4k_qd32`: Tokio `7.4895-7.6145 ms`, Spargio `5.9790-6.4699 ms`, Compio `5.4749-5.8905 ms`.
  - `net_echo_rtt_256b`: Tokio `7.7059-8.0959 ms`, Spargio `5.3708-5.6477 ms`, Compio `6.4743-6.7640 ms`.
  - `net_stream_throughput_4k_window32`: Tokio `11.163-11.324 ms`, Spargio `10.668-10.719 ms`, Compio `7.2779-7.4795 ms`.

- Imbalanced net:
  - `net_stream_imbalanced_4k_hot1_light7`: Tokio `13.426-14.098 ms`, Spargio `13.510-13.911 ms`, Compio `12.221-12.479 ms`.
  - `net_stream_hotspot_rotation_4k`: Tokio `8.6480-8.7488 ms`, Spargio `11.285-11.811 ms`, Compio `16.346-16.702 ms`.
  - `net_pipeline_hotspot_rotation_4k_window32`: Tokio `26.383-26.937 ms`, Spargio `34.962-35.935 ms`, Compio `50.764-51.179 ms`.

Outcome:

- README benchmark tables and interpretation updated to match this refresh.

## Next Plan: remove remaining blocking surfaces (checklist + sequence)

Goal:

- Keep data-plane waits and setup on native nonblocking/io_uring paths.
- Move control-plane APIs to async-first shapes, then deprecate blocking variants.

Remaining blocking surfaces identified:

- Boundary blocking ticket wait:
  - `BoundaryTicket::wait_timeout_blocking`.
- Boundary blocking server/client paths:
  - `BoundaryServer::recv`, `BoundaryServer::recv_timeout`, and blocking `BoundaryClient::call`.
- Timer helper:
  - `sleep` currently spawns a thread and uses `thread::sleep`.
- Hostname resolution path:
  - `to_socket_addrs()` in `first_socket_addr` can block for DNS.
- Synchronous runtime-control entry points:
  - `run_with` (`block_on`) and `shutdown` thread `join` waits.
- Queue-backend shard idle wait:
  - `rx.recv_timeout(idle_wait)` (fallback/control-plane backend).

Execution sequence (prioritized):

1. io_uring timer lane (high impact, low risk)
   - Add native timeout operation (`IORING_OP_TIMEOUT`) and route `sleep` through it on io_uring backend.
   - Keep queue backend fallback behavior unchanged.
   - Add TDD coverage for timer correctness/cancellation semantics.

2. Async-first boundary API (high impact, medium risk)
   - Add async `BoundaryServer::recv_async`/stream-style polling API.
   - Add async-first client call path and keep existing blocking APIs as compatibility wrappers.
   - Mark blocking variants as compatibility APIs in docs (and later deprecate).

3. Address-resolution split (medium impact, low risk)
   - Add `connect_socket_addr`-first API guidance and docs.
   - Keep hostname API but route through explicit resolver boundary so blocking DNS is isolated and optional.
   - Add tests that `SocketAddr` path stays fully nonblocking.

4. Runtime-control async variants (medium impact, medium risk)
   - Add `run_async` and `shutdown_async` (non-blocking caller thread semantics).
   - Keep existing sync entry points for ergonomics/back-compat.

5. Queue backend scope decision (medium impact, design choice)
   - Either:
     - keep queue backend as debug/fallback and accept blocking `recv_timeout`, or
     - reduce queue backend role and push io_uring-only profiles as default perf lane.
   - Record decision in ADR/log before implementation changes.

Acceptance checklist:

- [ ] No data-plane helper-thread blocking waits in io_uring mode.
- [ ] `sleep` uses native timeout path when io_uring backend is active.
- [ ] Boundary APIs have async-first equivalents covering current usage.
- [ ] Hostname resolution path is explicitly isolated from native data plane.
- [ ] README/implementation log reflect which blocking APIs are compatibility-only vs removed.

## Update: queue backend removed from public runtime configuration

Decision implemented from the blocking-surface plan:

- Queue backend is no longer selectable via `BackendKind`.
- `BackendKind` now exposes only `IoUring`.
- `RuntimeBuilder::default()` now defaults to `BackendKind::IoUring`.

Code and harness updates:

- Removed `BackendKind::Queue` usage from tests and benches.
- Updated runtime tests that previously forced queue mode to use io_uring (with existing graceful skip behavior when io_uring init is unavailable).
- Updated `ping_pong` and `fanout_fanin` benches to stop running `spargio_queue` variants.
- Updated README status text to describe io_uring-only backend.

Validation:

- `cargo fmt`
- `cargo test --features uring-native`
- `cargo bench --features uring-native --no-run`

Notes:

- Internal queue-oriented backend code paths remain in `ShardBackend` as dead code at this stage and are no longer instantiated through public builder/backend selection.
- Follow-up cleanup can remove those branches entirely if we want to reduce maintenance surface further.

## Update: internal queue backend branches removed

Follow-up cleanup completed after public queue-backend removal.

Changes:

- Removed internal `ShardBackend::Queue` handling branches from runtime dispatch.
- `ShardBackend` now routes only through io_uring paths in the Linux build.
- Removed queue-branch fallback logic in native submit handlers (`submit_native_*`).
- Removed shard-loop blocking idle wait path (`rx.recv_timeout(...)`), leaving nonblocking poll + cooperative yield behavior.
- Removed `RuntimeBuilder::idle_wait` field/method since it only supported the removed queue idle path.

Related API/harness alignment:

- `#[spargio::main(...)]` macro backend option now accepts only `"io_uring"`.
- Macro tests and examples updated accordingly.
- `ping_pong` and `fanout_fanin` benches no longer include `spargio_queue` variants.

Validation:

- `cargo fmt`
- `cargo test --features "uring-native macros"`
- `cargo bench --features uring-native --no-run`

Result:

- All checks pass.

## Update: blocking-surface plan slice implemented (Red/Green TDD)

Scope completed from the blocking-removal checklist:

- io_uring timer lane:
  - Added native timeout command path (`IORING_OP_TIMEOUT`) to the io_uring driver.
  - Added `UringNativeAny::sleep(Duration)`.
  - Routed top-level `spargio::sleep(...)` to shard-local native timeout path when running inside a Spargio shard; keeps fallback behavior outside shard context.

- Async-first boundary APIs:
  - Added async-first boundary surfaces:
    - `BoundaryClient::call_async(...)`
    - `BoundaryClient::call_async_with_timeout(...)`
    - `BoundaryServer::recv_async(...)`
    - `BoundaryServer::recv_timeout_async(...)`
    - `BoundaryTicket::wait_timeout(...)`
  - Kept blocking methods (`call`, `recv`, `recv_timeout`, `wait_timeout_blocking`) as compatibility wrappers.

- Address-resolution split:
  - Added explicit non-DNS socket-address APIs:
    - `net::TcpStream::connect_socket_addr(...)`
    - `net::TcpStream::connect_socket_addr_round_robin(...)`
    - `net::TcpStream::connect_many_socket_addr_round_robin(...)`
    - `net::TcpStream::connect_many_socket_addr_with_session_policy(...)`
    - `net::TcpStream::connect_socket_addr_with_session_policy(...)`
    - `net::TcpListener::bind_socket_addr(...)`
  - Kept hostname-based APIs as compatibility wrappers around a clearly named resolver path (`resolve_first_socket_addr_blocking`).

- Runtime-control async variants:
  - Added async runtime-entry/control APIs:
    - `run_async(...)`
    - `run_with_async(...)`
    - `Runtime::shutdown_async(...)`
  - Kept sync entry/control APIs (`run`, `run_with`, `shutdown`) for compatibility/ergonomics.

Red tests added:

- `tests/boundary_tdd.rs`
  - `boundary_async_call_and_recv_round_trip`
  - `boundary_async_recv_timeout_reports_timeout`
  - `boundary_ticket_wait_timeout_async_reports_timeout`
- `tests/runtime_tdd.rs`
  - `run_async_helper_executes_top_level_future`
  - `run_with_async_applies_custom_builder`
  - `runtime_shutdown_async_is_idempotent`
- `tests/ergonomics_tdd.rs`
  - `net_tcp_stream_connect_socket_addr_supports_read_write_all`
  - `net_tcp_listener_bind_socket_addr_accepts_and_wraps_stream`
- `tests/uring_native_tdd.rs`
  - `uring_native_unbound_sleep_uses_timeout_path`

Green + validation:

- `cargo fmt`
- `cargo test --features "uring-native macros" --test boundary_tdd --test runtime_tdd --test ergonomics_tdd --test uring_native_tdd`
- `cargo test --features "uring-native macros"`

Acceptance checklist status:

- [x] No data-plane helper-thread blocking waits in io_uring mode.
- [x] `sleep` uses native timeout path when io_uring backend is active on shard context.
- [x] Boundary APIs have async-first equivalents covering current usage.
- [x] Hostname resolution path is explicitly isolated from native data plane.
- [x] README/implementation log reflect which blocking APIs are compatibility-only vs removed.

## Update: removed public sync compatibility wrappers; async APIs are canonical (Red/Green TDD)

Rationale:

- Crate is not yet published; this is the lowest-risk point to make the API async-first and remove blocking wrapper surfaces.

What changed:

- Runtime entry/control API cleanup:
  - `run` is now async (`run(...).await`).
  - `run_with` is now async (`run_with(builder, ...).await`).
  - Removed public `run_async` and `run_with_async` aliases.
  - `Runtime::shutdown` is now async.
  - Removed public sync `Runtime::shutdown`; retained internal blocking shutdown path only for `Drop`.

- Boundary API cleanup:
  - `BoundaryClient::call` and `call_with_timeout` are async-first.
  - `BoundaryServer::recv` and `recv_timeout` are async-first.
  - `BoundaryTicket::wait_timeout` remains async.
  - Removed sync compatibility wrappers:
    - `BoundaryTicket::wait_timeout_blocking`
    - sync `BoundaryServer::recv`/`recv_timeout` wrappers
    - sync `BoundaryClient::call`/`call_with_timeout` wrappers

- Macro compatibility after async rename:
  - `#[spargio::main]` now uses a hidden `spargio::__private::block_on(...)` helper to invoke async `run_with(...)` from generated sync `main`.

- Examples/tests updated to new async API names:
  - boundary TDD switched to async call/recv/timeout paths.
  - runtime TDD switched to async `run`/`run_with`/`shutdown` usage.
  - `examples/network_work_stealing.rs` updated to async `run_with(...).await`.
  - `examples/mixed_mode_service.rs` updated for async boundary call path.

Validation:

- `cargo test --features "uring-native macros"`
- `cargo bench --features uring-native --no-run`

Result:

- Full test suite and benchmark target compilation pass after the async-first API break.

## Update: rotating-hotspot slowdown investigation plan (Tokio vs Spargio)

Question captured:

- Why are `net_stream_hotspot_rotation_4k` and `net_pipeline_hotspot_rotation_4k_window32` still faster on Tokio?

Current code-path findings:

- Both hotspot groups already use distributed stream setup in Spargio (`SpargioNetHarness::new_distributed()`), so this is not the earlier single-context concentration issue.
- Spargio hotspot stream path uses `send_all_batch + recv_multishot_segments (+ fallback read_exact_owned)`; Tokio uses simpler `write_all + read_exact` loops.
- Spargio pipeline hotspot path currently uses `write_all_owned/read_exact_owned` per frame and spawns per-stream jobs with generic `spawn_stealable`, not session-aligned placement.
- Native op submission still pays envelope/oneshot/tracking overhead per op when execution is off the stream session shard.

Working hypotheses for the current gap:

1. Placement mismatch in rotating-hotspot loops:
- per-stream tasks can execute off-session-shard (`spawn_stealable`), adding submit/reply overhead without enough skew persistence to amortize stealing wins.

2. Pipeline I/O method overhead:
- `write_all_owned/read_exact_owned` path has extra owned-buffer/method overhead in tight per-frame loops.

3. Multishot path may be suboptimal for this specific rotating shape:
- for short rotating bursts, multishot setup/segment handling may underperform simple exact-read loops.

4. Benchmark harness overhead differences:
- Tokio path uses a very lean inner loop and may currently benefit from less per-op user-space bookkeeping in this shape.

### Planned A/B matrix

A/B-1: task placement (both hotspot benchmarks)
- A: current `spawn_stealable`.
- B: `stream.spawn_stealable_on_session(...)`.
- C: `stream.spawn_on_session(...)`.

A/B-2: pipeline I/O method
- A: current `write_all_owned/read_exact_owned`.
- B: borrowed `write_all/read_exact` with reusable buffers.

A/B-3: stream-hotspot receive mode
- A: current multishot-first path.
- B: force read-exact path.

Execution plan:

1. Add experimental A/B benchmark lanes (net experiments target), no product-table changes yet.
2. Run targeted A/B for both hotspot benchmarks.
3. Implement only the winning changes into the main benchmark/runtime paths.
4. Keep TDD discipline: add failing tests for any API/runtime behavior changes, then implement to green.

## Update: rotating-hotspot A/B results + adopted optimizations

Executed the planned A/B matrix in `benches/net_experiments.rs`:

- `exp_net_stream_hotspot_rotation_ab_4k`
- `exp_net_pipeline_hotspot_rotation_ab_4k_window32`

Command set:

- `cargo bench --features uring-native --bench net_experiments -- exp_net_stream_hotspot_rotation_ab_4k --sample-size 12`
- `cargo bench --features uring-native --bench net_experiments -- exp_net_pipeline_hotspot_rotation_ab_4k_window32 --sample-size 12`

### A/B findings

`exp_net_stream_hotspot_rotation_ab_4k`:

- `tokio_hotspot_rotation`: `8.7424-8.8669 ms`
- `spargio_hotspot_stealable_multishot`: `11.667-11.801 ms`
- `spargio_hotspot_stealable_session_multishot`: `11.705-11.967 ms`
- `spargio_hotspot_pinned_multishot`: `9.8044-9.9619 ms`
- `spargio_hotspot_pinned_readexact`: `9.5227-9.5928 ms`

Interpretation:

- Session-pinned placement is the main gain for this shape.
- For rotating hotspot stream-only traffic, read-exact outperforms multishot.
- Stealable-session-preferred did not beat pinned here.

`exp_net_pipeline_hotspot_rotation_ab_4k_window32`:

- `tokio_pipeline_hotspot`: `26.473-26.678 ms`
- `spargio_pipeline_stealable_owned`: `32.167-32.563 ms`
- `spargio_pipeline_stealable_session_owned`: `32.356-32.844 ms`
- `spargio_pipeline_pinned_owned`: `29.618-30.016 ms`
- `spargio_pipeline_pinned_borrowed`: `30.080-30.247 ms`

Interpretation:

- Session-pinned placement is again the primary improvement.
- Owned I/O loop stays slightly better than borrowed mode in this pipeline shape.

### Optimizations implemented from A/B

Applied to product benchmark path (`benches/net_api.rs`):

1. `net_stream_hotspot_rotation_4k`:
- per-stream work now runs with `stream.spawn_on_session(...)` (session-pinned placement).
- receive mode switched to read-exact for this rotating stream-hotspot workload.

2. `net_pipeline_hotspot_rotation_4k_window32`:
- per-stream work now runs with `stream.spawn_on_session(...)` (session-pinned placement).
- kept owned I/O loop (`write_all_owned/read_exact_owned`) as the better A/B mode.

3. Kept existing defaults unchanged where A/B did not indicate improvement:
- throughput/imbalanced hot path remains multishot-first.
- generic stealable placement remains for non-hotspot benchmark paths.

### Post-optimization benchmark snapshots (`net_api`)

Commands:

- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 12`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 12`

Results:

- `net_stream_hotspot_rotation_4k/tokio_tcp_8streams_rotating_hotspot`: `8.6989-8.7937 ms`
- `net_stream_hotspot_rotation_4k/spargio_tcp_8streams_rotating_hotspot`: `9.5875-9.8201 ms`
- `net_stream_hotspot_rotation_4k/compio_tcp_8streams_rotating_hotspot`: `16.782-17.053 ms`

- `net_pipeline_hotspot_rotation_4k_window32/tokio_tcp_pipeline_hotspot`: `26.328-26.504 ms`
- `net_pipeline_hotspot_rotation_4k_window32/spargio_tcp_pipeline_hotspot`: `29.411-29.919 ms`
- `net_pipeline_hotspot_rotation_4k_window32/compio_tcp_pipeline_hotspot`: `50.787-51.425 ms`

Net effect vs prior `net_api` snapshots:

- Stream rotating-hotspot: Spargio improved materially (about 14-16% faster) and moved closer to Tokio.
- Pipeline rotating-hotspot: Spargio improved materially (about 8-11% faster) and moved closer to Tokio.
- Both workloads still trail Tokio, but the remaining gap is substantially smaller than before.

## Update: implemented next hotspot optimizations (Red/Green TDD)

Follow-up optimizations implemented from the latest hotspot analysis:

1. Remove extra owned-buffer read/write overhead in stream loops.
2. Add a tighter same-shard native-op fast path for session-stream ops.

### Red phase

Added failing test in `tests/ergonomics_tdd.rs`:

- `net_tcp_stream_spawn_on_session_uses_local_direct_native_fastpath`

Initial failure:

- compile-time red because `RuntimeStats` had no `native_any_local_direct_submitted` field.

### Green phase

Implemented:

- New runtime stat:
  - `RuntimeStats::native_any_local_direct_submitted`
  - tracked in `RuntimeStatsInner` and surfaced via `stats_snapshot()`.

- Session-stream local direct path:
  - in `UringNativeAny::{recv_owned_at_on_shard, send_owned_at_on_shard}`, when running on the same runtime+shard context:
    - enqueue `LocalCommand::SubmitNative{Recv,Send}Owned` directly
    - increment `native_any_local_direct_submitted`
    - avoid `NativeAnyCommand -> LocalCommand` conversion path

- Offset-based native send/recv plumbing:
  - added `offset` to `NativeAnyCommand::{RecvOwned, SendOwned}`
  - added `offset` to `LocalCommand::{SubmitNativeRecvOwned, SubmitNativeSendOwned}`
  - io_uring driver now submits `Recv/Send` against `buf[offset..]` without cloning/splitting buffers.

- Stream owned I/O loop rewrites:
  - `TcpStream::write_all_owned` now advances using `send_owned_from(buf, offset)` (no fallback `send(&buf[sent..])` cloning path).
  - `TcpStream::read_exact_owned` now advances using `recv_owned_from(dst, offset)` (no `read_exact` scratch/copy path).

Validation:

- `cargo test --features uring-native --test ergonomics_tdd`
- `cargo test --features uring-native --tests`

### Post-change benchmark snapshot

Commands:

- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 12`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 12`

Results:

- `net_stream_hotspot_rotation_4k/tokio_tcp_8streams_rotating_hotspot`: `8.7900-8.8664 ms`
- `net_stream_hotspot_rotation_4k/spargio_tcp_8streams_rotating_hotspot`: `9.3389-9.4787 ms`
- `net_stream_hotspot_rotation_4k/compio_tcp_8streams_rotating_hotspot`: `16.661-16.845 ms`

- `net_pipeline_hotspot_rotation_4k_window32/tokio_tcp_pipeline_hotspot`: `26.322-26.549 ms`
- `net_pipeline_hotspot_rotation_4k_window32/spargio_tcp_pipeline_hotspot`: `28.933-29.121 ms`
- `net_pipeline_hotspot_rotation_4k_window32/compio_tcp_pipeline_hotspot`: `51.323-52.073 ms`

Effect:

- Additional improvement in both rotating-hotspot benchmarks.
- Remaining gap to Tokio narrowed again (now roughly ~5-10% depending on exact bound pair).

## Update: local direct native replies now avoid oneshot allocation (Red/Green TDD)

Completed the in-progress local fast-path refactor so same-runtime same-shard
`recv_owned/send_owned` submissions do not allocate/use a oneshot channel.

### Green implementation details

- Added `NativeBufReply::{Oneshot, Local}` and `NativeBufReply::complete(...)`.
- Added local waiter pair:
  - `NativeBufReply::local_pair()`
  - `NativeLocalBufReplySlot` + `NativeLocalBufReplyFuture`
- Wired local-direct branch in:
  - `UringNativeAny::recv_owned_at_on_shard`
  - `UringNativeAny::send_owned_at_on_shard`
  to use the local waiter/future instead of oneshot.
- Updated io_uring native recv/send submit/completion paths to use
  `NativeBufReply` uniformly.

Validation:

- `cargo check --features uring-native`
- `cargo test --features uring-native --test ergonomics_tdd`
- `cargo test --features uring-native --tests`

### Post-change hotspot benchmark snapshot

Commands:

- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 12`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 12`

Results:

- `net_stream_hotspot_rotation_4k/tokio_tcp_8streams_rotating_hotspot`: `8.6940-8.8212 ms`
- `net_stream_hotspot_rotation_4k/spargio_tcp_8streams_rotating_hotspot`: `9.3020-9.4073 ms`
- `net_stream_hotspot_rotation_4k/compio_tcp_8streams_rotating_hotspot`: `16.681-16.812 ms`

- `net_pipeline_hotspot_rotation_4k_window32/tokio_tcp_pipeline_hotspot`: `26.286-26.560 ms`
- `net_pipeline_hotspot_rotation_4k_window32/spargio_tcp_pipeline_hotspot`: `29.025-29.574 ms`
- `net_pipeline_hotspot_rotation_4k_window32/compio_tcp_pipeline_hotspot`: `50.614-50.986 ms`

Effect:

- Refactor is functionally complete and fully green.
- This specific change is mostly neutral on these two benchmark shapes
  (small movement within run-to-run noise).

## Update: keyed-hotspot benchmark follow-up (event-queue/msg path optimization backlog)

Context:

- Added `net_keyed_hotspot_rotation_4k` in `benches/net_api.rs` to stress
  rotating hotspot network I/O plus keyed cross-shard dispatch.
- Current snapshot (`--sample-size 12`):
  - `tokio_tcp_keyed_router_hotspot`: `9.2375-9.3226 ms`
  - `spargio_tcp_keyed_router_hotspot`: `10.061-10.254 ms`
- Interpretation: Tokio is still faster on this shape; likely overhead comes from
  per-message payload queueing, doorbell signaling, and event queue handling in
  Spargio’s ring-msg path.

Planned optimization ideas (highest ROI first):

1. Batch payload enqueue under one lock (high ROI, low risk)
- Problem: `SubmitRingMsgBatch` currently loops through per-message submit calls.
- Cost: lock/unlock and per-item queue overhead in `enqueue_payload` for each msg.
- Plan:
  - add a true backend/io_uring batch enqueue path:
    - one queue lock
    - append all payloads
    - one doorbell when queue transitions empty -> non-empty.
- Expected impact: reduce keyed-hotspot dispatch overhead materially.

2. Batch `EventState` delivery (high ROI, low-medium risk)
- Problem: `drain_payload_queue` pushes one event at a time, each with lock+wake.
- Plan:
  - add `EventState::push_many(...)`
  - queue drained ring-msg events in one critical section
  - wake waiters once per drained batch.
- Expected impact: lower owner-side event ingestion overhead.

3. Lower synchronization cost in `EventState` (medium ROI, medium risk)
- Problem: current queue uses mutex-protected `VecDeque` and per-push wake path.
- Plan options:
  - switch to lighter mutex implementation (e.g. `parking_lot`)
  - split producer-consumer queue/waker paths to reduce contention.
- Expected impact: lower overhead for high ring-msg event rates.

4. Fast path for hot internal ring-msg tags (medium ROI, medium-high risk)
- Problem: hot dispatch tags share same generic `EventState` path as all events.
- Plan:
  - route selected internal tags to dedicated per-shard mailboxes
  - keep `next_event()` for general API compatibility
  - use msg_ring as wake/doorbell only for these hot lanes.
- Expected impact: better keyed-router style throughput under hotspot churn.

5. Direct msg payload mode for tiny control messages (exploratory, medium-high risk)
- Problem: payload-queue + doorbell indirection adds overhead for tiny values.
- Plan:
  - where semantics allow, encode tiny payloads directly in `MSG_RING` CQEs
    (skip intermediate payload queue).
- Expected impact: reduced dispatch overhead for control-heavy micro-messages.

Validation plan for each change:

- Re-run:
  - `cargo bench --features uring-native --bench net_api -- net_keyed_hotspot_rotation_4k --sample-size 12`
  - `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 12`
  - `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 12`
- Track regression guardrails on:
  - `net_stream_throughput_4k_window32`
  - `net_stream_imbalanced_4k_hot1_light7`

## Update: keyed-hotspot optimization pass (batching complete, lock-free payload A/B reverted)

Implemented in this pass:

1. `SubmitRingMsgBatch` now uses a true backend batch path
- `ShardBackend::submit_ring_msg_batch(...)` submits one batch call.
- `IoUringDriver::submit_ring_msg_batch(...)` enqueues in one queue lock section,
  sends at most one doorbell for empty->non-empty transitions, and accounts
  partial acceptance/backpressure once per batch.

2. Event ingress now batches queue+wake
- Added `EventState::push_many(...)` and used it from:
  - io_uring CQE ring-msg reap path
  - payload-queue drain path
- `ring_msgs_completed` accounting now aggregates by batch where applicable.

3. Lowered `EventState` synchronization overhead
- Replaced mutex-protected event queue with `crossbeam_queue::SegQueue<Event>`.
- Kept waiter registration under a small mutex (`Vec<Waker>`).
- `push/push_many` now perform lock-free queue push and only lock to drain waiters.

4. Ran a lock-free payload-queue A/B and reverted it
- Experiment: replaced per-target/per-source payload queues with bounded
  `ArrayQueue`.
- Outcome:
  - no keyed-hotspot improvement
  - rotating-stream hotspot regressed
- Decision: reverted payload-queue `ArrayQueue` experiment; retained
  event-queue synchronization changes above.

Validation:

- `cargo fmt`
- `cargo check --features uring-native`
- `cargo test --features uring-native --tests`

Benchmarks (post-revert baseline, `--sample-size 12`):

- `net_keyed_hotspot_rotation_4k/tokio_tcp_keyed_router_hotspot`: `9.3457-9.3879 ms`
- `net_keyed_hotspot_rotation_4k/spargio_tcp_keyed_router_hotspot`: `10.008-10.062 ms`

- `net_stream_hotspot_rotation_4k/tokio_tcp_8streams_rotating_hotspot`: `8.8285-8.9134 ms`
- `net_stream_hotspot_rotation_4k/spargio_tcp_8streams_rotating_hotspot`: `9.3247-9.5191 ms`
- `net_stream_hotspot_rotation_4k/compio_tcp_8streams_rotating_hotspot`: `16.668-16.808 ms`

- `net_pipeline_hotspot_rotation_4k_window32/tokio_tcp_pipeline_hotspot`: `26.305-26.569 ms`
- `net_pipeline_hotspot_rotation_4k_window32/spargio_tcp_pipeline_hotspot`: `29.010-29.400 ms`
- `net_pipeline_hotspot_rotation_4k_window32/compio_tcp_pipeline_hotspot`: `50.682-51.536 ms`

Interpretation:

- Batching and event-ingress improvements are in place and stable.
- Main remaining gap on keyed-hotspot is not from payload queue lock granularity.
- Highest-ROI remaining ideas are:
  - hot-tag/internal mailbox fast path
  - direct tiny-control-message `MSG_RING` payload mode (selective bypass of doorbell queue)

## Update: direct `MSG_RING` control API (opt-in) + validation

Implemented:

- Added opt-in direct message APIs that bypass the payload queue/doorbell path:
  - `RemoteShard::send_raw_direct_nowait(...)`
  - `RemoteShard::send_many_raw_direct_nowait(...)`
  - `ShardCtx::send_raw_direct_nowait(...)`
  - `ShardCtx::send_many_raw_direct_nowait(...)`
- Runtime wiring:
  - new local command `SubmitRingMsgDirectBatch`
  - backend handler `submit_ring_msg_direct_batch(...)`
  - io_uring submit path `submit_ring_msg_direct_nowait(...)` (one `MSG_RING` SQE per message)

Red/Green tests added:

- `send_raw_direct_nowait_delivers_event`
- `send_many_raw_direct_nowait_delivers_in_order`

Validation:

- `cargo check --features uring-native`
- `cargo test --features uring-native --test runtime_tdd`
- `cargo test --features uring-native --tests`

Notes:

- This direct path is intentionally opt-in and currently best suited for low-volume,
  tiny control messages.
- Attempting to swap keyed-hotspot benchmark traffic to direct mode increased runtime
  significantly (high per-message SQE overhead under that specific load), so benchmark
  default was reverted to the stable batched payload-queue path.

Post-change benchmark sanity snapshot:

- `cargo bench --features uring-native --bench net_api -- net_keyed_hotspot_rotation_4k --sample-size 12`
  - `tokio_tcp_keyed_router_hotspot`: `9.2793-9.3288 ms`
  - `spargio_tcp_keyed_router_hotspot`: `9.9952-10.249 ms`
- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 10`
  - `tokio_tcp_8streams_rotating_hotspot`: `8.7510-8.8628 ms`
  - `spargio_tcp_8streams_rotating_hotspot`: `9.3289-9.6232 ms`
  - `compio_tcp_8streams_rotating_hotspot`: `16.771-16.908 ms`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 10`
  - `tokio_tcp_pipeline_hotspot`: `26.193-26.447 ms`
  - `spargio_tcp_pipeline_hotspot`: `28.856-28.982 ms`
  - `compio_tcp_pipeline_hotspot`: `50.464-51.058 ms`

## Update: hot-tag mailbox lane (msg routing fast path) for keyed dispatch

Implemented:

- Runtime builder hot-tag routing configuration:
  - `RuntimeBuilder::hot_msg_tag(tag)`
  - `RuntimeBuilder::hot_msg_tags(iter)`
- Added dedicated shard-local hot event lane:
  - `ShardCtx::next_hot_event()`
  - internal `hot_event_state` alongside regular `event_state`
- Routed incoming ring messages by tag at ingestion time:
  - io_uring CQE ring-msg path
  - payload-queue drain path
  - external `InjectRawMessage` path
- Keyed benchmark wiring:
  - benchmark runtime now enables hot tags for `KEYED_DISPATCH_TAG`/`KEYED_STOP_TAG`
  - keyed owner tasks consume via `next_hot_event()`

Red/Green TDD:

- Added tests:
  - `hot_msg_tag_routes_to_hot_event_lane`
  - `non_hot_msg_tag_remains_on_regular_event_lane`
- Existing direct-message tests retained and passing.

Validation:

- `cargo fmt`
- `cargo check --features uring-native`
- `cargo test --features uring-native --tests`

Benchmark snapshot after this change:

- `cargo bench --features uring-native --bench net_api -- net_keyed_hotspot_rotation_4k --sample-size 12`
  - `tokio_tcp_keyed_router_hotspot`: `9.4113-9.5537 ms`
  - `spargio_tcp_keyed_router_hotspot`: `9.9657-10.005 ms`
- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 10`
  - `tokio_tcp_8streams_rotating_hotspot`: `8.6508-8.7692 ms`
  - `spargio_tcp_8streams_rotating_hotspot`: `9.4165-9.5420 ms`
  - `compio_tcp_8streams_rotating_hotspot`: `16.692-16.835 ms`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 10`
  - `tokio_tcp_pipeline_hotspot`: `26.336-26.504 ms`
  - `spargio_tcp_pipeline_hotspot`: `29.244-29.392 ms`
  - `compio_tcp_pipeline_hotspot`: `50.869-51.357 ms`

Interpretation:

- Hot-tag lane is now functional and benchmarked.
- Keyed hotspot remains close to prior best range but still behind Tokio.
- Next likely high-ROI step remains value-coalescing for hot dispatch tags
  (aggregate frequent tiny hot-tag increments before queueing/wake).

## Update: coalesced-hot-tag ingestion (batch value aggregation)

Implemented:

- Added explicit coalesced-hot-tag config:
  - `RuntimeBuilder::coalesced_hot_msg_tag(tag)`
  - `RuntimeBuilder::coalesced_hot_msg_tags(iter)`
- Coalesced tags are automatically treated as hot tags.
- Extended ring-msg ingest path to coalesce same `(from, tag)` values within each
  ingest batch before queueing hot events:
  - io_uring CQE ring-msg batch
  - payload-queue drain batch
  - coalescing emits one or more `Event::RingMsg` with summed `val`
    (chunked safely if sum exceeds `u32::MAX`).
- Keyed benchmark harness now enables:
  - hot tags: `KEYED_DISPATCH_TAG`, `KEYED_STOP_TAG`
  - coalesced hot tag: `KEYED_DISPATCH_TAG`

Red/Green TDD:

- Added tests:
  - `coalesced_hot_msg_tag_aggregates_batch_values`
  - `non_coalesced_hot_msg_tag_preserves_batch_events`
- Existing hot-lane tests retained and passing.

Validation:

- `cargo fmt`
- `cargo check --features uring-native`
- `cargo test --features uring-native --tests`

Benchmark snapshot after coalescing:

- `cargo bench --features uring-native --bench net_api -- net_keyed_hotspot_rotation_4k --sample-size 12`
  - `tokio_tcp_keyed_router_hotspot`: `9.3593-9.4503 ms`
  - `spargio_tcp_keyed_router_hotspot`: `9.8008-10.002 ms`
- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 10`
  - `tokio_tcp_8streams_rotating_hotspot`: `8.7586-8.8332 ms`
  - `spargio_tcp_8streams_rotating_hotspot`: `9.4692-9.6138 ms`
  - `compio_tcp_8streams_rotating_hotspot`: `16.851-17.197 ms`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 10`
  - `tokio_tcp_pipeline_hotspot`: `26.303-26.520 ms`
  - `spargio_tcp_pipeline_hotspot`: `29.011-29.267 ms`
  - `compio_tcp_pipeline_hotspot`: `50.880-51.315 ms`

Interpretation:

- Coalescing improved keyed-hotspot path modestly and safely, with no material
  regression on stream/pipeline guardrails.
- Remaining keyed-hotspot gap appears to come from broader per-event control-path
  overhead, not just duplicate dispatch-value churn.

## Update: enqueue-time coalescing for coalesced-hot tags (queue-pressure reduction)

Implemented:

- `IoUringDriver` now carries coalesced-hot-tag lookup and applies it while
  writing payload queues (not only at ingest time).
- For coalesced-hot tags, enqueue path now merges with the queue tail when
  `(tail.tag == tag)`, including safe overflow chunking.
- This allows tight-capacity queues to absorb bursty tiny dispatch increments
  without immediate backpressure.

Red/Green TDD:

- Added `coalesced_hot_tag_absorbs_batch_under_tight_queue_capacity`:
  - runtime with `msg_ring_queue_capacity(1)`
  - coalesced hot tag burst `(59,1),(59,2),(59,3)`
  - verifies success and single hot event with `val=6`
- Full suite remains green.

Validation:

- `cargo fmt`
- `cargo check --features uring-native`
- `cargo test --features uring-native --tests`

Benchmark snapshot after enqueue-time coalescing:

- `cargo bench --features uring-native --bench net_api -- net_keyed_hotspot_rotation_4k --sample-size 12`
  - `tokio_tcp_keyed_router_hotspot`: `9.3417-9.4771 ms`
  - `spargio_tcp_keyed_router_hotspot`: `9.5432-9.6410 ms`
- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 10`
  - `tokio_tcp_8streams_rotating_hotspot`: `8.7407-8.8063 ms`
  - `spargio_tcp_8streams_rotating_hotspot`: `9.3352-9.4076 ms`
  - `compio_tcp_8streams_rotating_hotspot`: `16.536-16.814 ms`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 10`
  - `tokio_tcp_pipeline_hotspot`: `26.361-26.744 ms`
  - `spargio_tcp_pipeline_hotspot`: `29.060-29.326 ms`
  - `compio_tcp_pipeline_hotspot`: `50.503-51.418 ms`

Interpretation:

- Keyed-hotspot improved materially again; this slice appears higher ROI than
  ingest-only coalescing.
- Stream/pipeline guardrails remained stable.

## Update: completed remaining keyed-hotspot optimization slices (counter lane + adaptive wake policy)

Completed slices:

1. Cross-batch hot-counter accumulation
- Coalesced hot tags are now aggregated into shard-local counters (`u16 -> u64`)
  instead of being emitted as per-message hot events.
- Aggregation persists across ingest batches and drains, not only within a single
  batch callback.

2. Hot-counter consume fast path
- Added consume API:
  - `ShardCtx::next_hot_count(tag) -> Future<Output = u64>`
  - `ShardCtx::try_take_hot_count(tag) -> Option<u64>`
- Keyed benchmark owner path now consumes dispatch volume via `next_hot_count`
  and only uses `next_hot_event` for stop/control tags.
- This removes event-object overhead for coalesced dispatch traffic.

3. Adaptive dispatch/wake policy + hardening
- Added tuning knob:
  - `RuntimeBuilder::hot_counter_wake_threshold(u64)`
- Wake policy for waiting hot-counter consumers:
  - wake on 0->nonzero transition
  - or on crossing threshold from below.
- Added hardening tests:
  - `coalesced_hot_count_accumulates_across_batches`
  - `hot_counter_threshold_does_not_starve_first_update`
  - existing coalescing/hot-lane tests retained.
- Kept benchmark gate reruns on:
  - keyed hotspot (target KPI)
  - stream hotspot (guardrail)
  - pipeline hotspot (guardrail)

Validation:

- `cargo fmt`
- `cargo check --features uring-native`
- `cargo test --features uring-native --tests`

Benchmark gate snapshot (post-slices):

- `cargo bench --features uring-native --bench net_api -- net_keyed_hotspot_rotation_4k --sample-size 12`
  - `tokio_tcp_keyed_router_hotspot`: `9.3712-9.4256 ms`
  - `spargio_tcp_keyed_router_hotspot`: `9.5867-9.7558 ms`
- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 10`
  - `tokio_tcp_8streams_rotating_hotspot`: `8.7801-8.8376 ms`
  - `spargio_tcp_8streams_rotating_hotspot`: `9.3909-9.4505 ms`
  - `compio_tcp_8streams_rotating_hotspot`: `16.640-17.098 ms`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 10`
  - `tokio_tcp_pipeline_hotspot`: `26.380-26.482 ms`
  - `spargio_tcp_pipeline_hotspot`: `28.856-29.242 ms`
  - `compio_tcp_pipeline_hotspot`: `50.770-51.273 ms`

Outcome:

- Remaining planned slices for this keyed-hotspot track are now implemented.
- Spargio is now very close to Tokio on keyed-hotspot in this harness, with stable
  guardrails on other hotspot shapes.

## Update: keyed hotspot benchmark now includes Compio

Added `compio` variant to `net_keyed_hotspot_rotation_4k`:

- new bench case: `compio_tcp_keyed_router_hotspot`
- wired through `CompioNetCmd::EchoKeyedHotspot`, harness command handling, and
  `compio_echo_keyed_hotspot_rotation(...)`.

Sanity run (`--sample-size 10`):

- `tokio_tcp_keyed_router_hotspot`: `9.2799-9.3554 ms`
- `spargio_tcp_keyed_router_hotspot`: `9.5718-9.7460 ms`
- `compio_tcp_keyed_router_hotspot`: `16.652-16.712 ms`

## Update: full benchmark refresh + README sync (2026-02-27)

Ran the full benchmark suite with current `uring-native` implementation and
updated README benchmark tables/interpretation to match.

Commands:

- `cargo bench --features uring-native --bench ping_pong -- --sample-size 12`
- `cargo bench --features uring-native --bench fanout_fanin -- --sample-size 12`
- `cargo bench --features uring-native --bench fs_api -- --sample-size 12`
- `cargo bench --features uring-native --bench net_api -- --sample-size 12`

Snapshot:

- Coordination (Tokio vs Spargio):
  - `steady_ping_pong_rtt`: Tokio `1.4911-1.5024 ms`, Spargio `394.83-396.21 us`
  - `steady_one_way_send_drain`: Tokio `68.607-70.859 us`, Spargio `49.232-50.110 us`
  - `cold_start_ping_pong`: Tokio `553.31-561.83 us`, Spargio `284.23-287.50 us`
  - `fanout_fanin_balanced`: Tokio `1.4534-1.4631 ms`, Spargio `1.3426-1.3480 ms`
  - `fanout_fanin_skewed`: Tokio `2.4026-2.4220 ms`, Spargio `1.9979-2.0032 ms`

- Native API (Tokio vs Spargio vs Compio):
  - `fs_read_rtt_4k`: Tokio `1.6174-1.6565 ms`, Spargio `1.0008-1.0188 ms`, Compio `1.4782-1.4978 ms`
  - `fs_read_throughput_4k_qd32`: Tokio `7.8804-8.1672 ms`, Spargio `6.1570-6.2793 ms`, Compio `4.0877-5.0803 ms`
  - `net_echo_rtt_256b`: Tokio `7.7462-7.9687 ms`, Spargio `5.4356-5.5084 ms`, Compio `6.4541-6.5632 ms`
  - `net_stream_throughput_4k_window32`: Tokio `11.142-11.247 ms`, Spargio `10.745-10.813 ms`, Compio `7.0631-7.1570 ms`

- Imbalanced native API:
  - `net_stream_imbalanced_4k_hot1_light7`: Tokio `13.584-13.799 ms`, Spargio `13.191-13.375 ms`, Compio `12.283-12.414 ms`
  - `net_stream_hotspot_rotation_4k`: Tokio `8.7891-8.8560 ms`, Spargio `9.3683-9.4526 ms`, Compio `16.870-16.982 ms`
  - `net_pipeline_hotspot_rotation_4k_window32`: Tokio `26.415-26.654 ms`, Spargio `29.113-29.517 ms`, Compio `50.648-51.210 ms`
  - `net_keyed_hotspot_rotation_4k`: Tokio `9.3152-9.4912 ms`, Spargio `9.5691-9.7957 ms`, Compio `16.781-16.994 ms`

Interpretation updates reflected in README:

- Spargio retains clear lead on coordination-heavy and low-depth latency cases.
- Compio retains lead on sustained balanced stream throughput and static-hotspot imbalance.
- Tokio remains ahead in rotating-hotspot stream/pipeline; keyed routing is near parity.

## Note: do the network optimizations fit Spargio's value proposition?

Question:

- Do the network optimizations we added to close the Tokio gap actually make sense
  for Spargio, and are they realistic for users to adopt?

Answer:

- Yes, primarily when they reduce cross-shard coordination cost (coalesced hot
  tags, hot-counter fast path, adaptive wake policy, keyed ownership routing).
  These directly support Spargio's core value proposition: efficient
  `io_uring` + `msg_ring` work-stealing/steering under coordination-heavy load.
- These optimizations are most relevant for keyable/skewed multi-stream
  workloads (tenant/session/partition keyed routing), where steering and
  aggregation reduce dispatch overhead.
- They should remain opt-in tuning for advanced users. Default paths should
  stay simple and semantically conservative when applications need per-message
  event fidelity and straightforward observability.

Follow-up planned:

- Add user-facing documentation for these knobs (what each knob does, semantic
  trade-offs, recommended workload shapes, and safe defaults), plus a short
  tuning guide in README/docs.

## Update: flaky `uring-native` CI test fixed (2026-02-28)

Observed:

- CI run `22511780569` failed at `Cargo test (uring-native)` with exit code 101.
- Failure was intermittent and initially non-reproducible on a single local run.

Root cause:

- `coalesced_hot_count_accumulates_across_batches` in `tests/runtime_tdd.rs` had
  a race in test logic.
- The receiver polled `try_take_hot_count(61)` in a loop and could consume the
  first coalesced update (`3`) before the second batch (`+3`) arrived, causing
  occasional `left: 3, right: 6`.

Fix:

- Made the test deterministic by introducing a non-coalesced barrier tag and
  waiting for a barrier event before reading the hot counter.
- Updated the test to assert total hot count only after both sends are known to
  have been delivered to the target shard.

Validation:

- `cargo test --features uring-native --test runtime_tdd coalesced_hot_count_accumulates_across_batches`
- 50x stress loop of that single test: all pass.
- `cargo test --features uring-native`: pass.

Outcome:

- Removed known flake in `uring-native` test suite.
- No runtime behavior change; this was a test synchronization fix.

## Update: Compio parity audit snapshot (2026-02-28)

Captured a focused feature-parity snapshot against current Compio docs and
our current public `spargio` surface, with emphasis on practical user-facing
gaps.

### I/O API breadth: present vs missing

Current Spargio public I/O surface:

- `fs`: `OpenOptions` + `File` with `open/create/from_std`, positional
  `read_at`/`read_at_into`/`write_at`/`write_all_at`, `read_to_end_at`, `fsync`.
- `net`: TCP-only (`TcpStream`, `TcpListener`) including session-policy connect/accept,
  owned buffer APIs, and multishot segment receive helpers.
- runtime-native unbound lane methods routed through `io_uring`.

Compared with Compio's documented surface, notable missing breadth in Spargio:

1. Filesystem path-level helpers and metadata APIs
   - examples: `create_dir`, `create_dir_all`, `hard_link`, `metadata`,
     `remove_dir`, `remove_file`, `rename`, `set_permissions`, `symlink`,
     `symlink_metadata`, convenience `read`/`write`.
2. Broader network protocol/socket families
   - UDP and Unix domain socket APIs (`UdpSocket`, `UnixListener`,
     `UnixStream`, `UnixDatagram`) are not currently in Spargio public API.
3. Generic async I/O trait/adaptor layer
   - no public Spargio equivalent to Compio `io` traits and adapters
     (`AsyncRead`/`AsyncWrite` families, buffered wrappers, compat/framed utilities).
4. Higher-level transport/runtime-integrated modules
   - no Spargio public modules corresponding to Compio optional
     `process`/`signal`/`tls`/`ws`/`quic` ecosystem crates.

This aligns with existing README scope note:

- "Broader filesystem and network native-op surface ... not done yet."

### Core runtime parity: what is still missing in Spargio

Core runtime is functional and differentiated (shards, placement APIs,
work-stealing MVP, timers, cancellation/task group, boundary APIs), but gaps
remain versus broader runtime ecosystems:

1. Backend/platform breadth
   - `BackendKind` is currently `IoUring` only.
2. Top-level `!Send` ergonomics
   - public runtime handle spawn paths require `Send`; `!Send` execution is
     currently available only via shard-local `ShardCtx::spawn_local(...)`.
3. Time/runtime utility breadth
   - currently minimal top-level primitives (`sleep`, `timeout`) rather than a
     fuller interval/deadline utility set.
4. Production hardening/tuning depth
   - advanced stealing policy tuning and long-window hardening/observability are
     still listed as pending in project docs.

Conclusion:

- Spargio currently has partial feature overlap with Compio for core
  fs/tcp runtime workflows, but does not yet have Compio-level I/O breadth.
- Current project direction remains valid: keep differentiating on
  cross-shard coordination + placement/stealing, while closing practical
  fs/net/runtime-surface gaps incrementally.

## Update: `!Send` ergonomics slice (`run_local_on` + `spawn_local_on`) (2026-02-28)

Captured and implemented the proposal discussed in review:

- add a first-class local-entry helper that can run `!Send` futures on a chosen shard.
- add a handle-level construct-on-shard API so callers can build `!Send` futures
  on target shard context without requiring a prior `ShardCtx` hop.

### Red phase

Added failing tests in `tests/runtime_tdd.rs`:

- `run_local_on_accepts_non_send_future`
- `runtime_handle_spawn_local_on_accepts_non_send_future`

Red failure signals:

- unresolved import: `spargio::run_local_on`
- missing method: `RuntimeHandle::spawn_local_on`

### Green phase

Implemented public APIs in `src/lib.rs`:

1. New top-level entry helper
   - `run_local_on(builder, shard, entry)`
   - signature accepts `entry: FnOnce(ShardCtx) -> Fut + Send`, with `Fut: Future + 'static`
     (no `Send` bound on `Fut`), and `T: Send`.
2. New runtime-handle API
   - `RuntimeHandle::spawn_local_on(shard, init)`
   - same construct-on-shard shape and `!Send` future support.
3. Internal spawn path
   - added `spawn_local_on_shared(...)`.
   - implementation routes through existing shard command channel (`Command::Spawn`)
     and, on the target shard, constructs the future using live `ShardCtx`,
     then executes it via `ctx.spawn_local(...)`.

Design notes:

- No new scheduler lane or command type was required.
- `!Send` is enabled by constructing the future on the shard and running it via
  local spawner; cross-thread transfer only carries the `Send` initializer closure.
- Return type remains `JoinHandle<T>` with `T: Send` for cross-thread join safety.

### Validation

Commands run:

- `cargo test --features uring-native --test runtime_tdd run_local_on_accepts_non_send_future`
- `cargo test --features uring-native --test runtime_tdd runtime_handle_spawn_local_on_accepts_non_send_future`
- `cargo test --features uring-native --test runtime_tdd`

Result:

- both new tests pass.
- full `runtime_tdd` suite passes (`24 passed`).

### Outcome

- Spargio now supports a direct top-level local entry and handle-level local
  spawn path for `!Send` futures, reducing friction for shard-local state
  patterns (`Rc`, `RefCell`, etc.) while preserving existing shard-safety model.

## Update: low-level unsafe native extension API slice (2026-02-28)

Recorded proposal and implemented it in this slice:

- add a low-level unsafe extension lane so external crates can submit custom
  SQE/CQE workflows without editing Spargio core for each new operation.
- keep high-level fs/net APIs safe and unchanged; isolate risk in explicit
  unsafe extension entry points.

### Red phase

Added new tests in `tests/uring_native_tdd.rs` for extension use-cases:

- `uring_native_unbound_unsafe_extension_supports_custom_nop`
- `uring_native_unbound_unsafe_extension_supports_custom_read_entry`

These encode the intended external-writer workflow:

- provide extension-owned state
- build a custom SQE from that state
- decode CQE into a typed result

### Green phase

Implemented low-level unsafe API on `UringNativeAny`:

- `unsafe submit_unsafe(...)`
- `unsafe submit_unsafe_on_shard(...)`

Added new public completion type:

- `UringCqe { result, flags }`

Internal runtime wiring added:

- new internal native command variant carrying extension op envelopes
- extension op envelope retained in runtime until completion
- SQE built on target shard, user data overridden by runtime tracking key
- completion/failure paths return typed result through oneshot
- dispatch integrated with existing fast path / envelope path and affinity
  violation guardrails

### Validation

Commands run:

- `cargo test --features uring-native --test uring_native_tdd uring_native_unbound_unsafe_extension_supports_custom_nop`
- `cargo test --features uring-native --test uring_native_tdd uring_native_unbound_unsafe_extension_supports_custom_read_entry`
- `cargo test --features uring-native --test runtime_tdd --test uring_native_tdd`

Result:

- new unsafe-extension tests pass.
- full `runtime_tdd` and `uring_native_tdd` suites pass.

### Docs sync

README updated to reflect completed status:

- added done bullets for:
  - `!Send` ergonomics (`run_local_on`, `RuntimeHandle::spawn_local_on`)
  - low-level unsafe extension API (`UringNativeAny::{submit_unsafe, submit_unsafe_on_shard}`)
- reviewed done/not-done sections and adjusted wording:
  - "broader built-in fs/net surface" remains not done
  - added safe-wrapper/cookbook work for unsafe extension API to not-done backlog

## Update: time/runtime utility parity comparison (Compio + monoio, io_uring fit adjusted) (2026-02-28)

Revised the time/runtime parity recommendations to account for whether each gap
is:

- `Direct io_uring`: maps directly to io_uring operations.
- `Hybrid`: io_uring covers the wait/I/O path, while policy/scheduling/control
  remains user-space runtime logic.
- `Not io_uring-native`: mostly scheduler/context/ergonomics API surface above
  kernel I/O.

Context:

- This section is scoped to time/runtime utility APIs (not broader fs/net API
  breadth).
- Spargio today already has: `sleep`, `timeout`, `run`, `run_with`,
  `run_local_on`, `spawn_local_on`, cancellation token, and task group support.

### Compio parity gaps (time/runtime utility scope), io_uring fit, and recommendation

1. Absolute-deadline and interval timer APIs
   - Missing in Spargio:
     - `sleep_until`
     - `timeout_at`
     - `interval` / `interval_at`
     - `Interval::tick`
   - io_uring fit:
     - `Direct io_uring`:
       - `sleep_until` via timeout op on the native lane.
     - `Hybrid`:
       - `timeout_at` as composition over deadline timer + future race.
       - interval/tick as runtime policy on top of timer primitives.
   - Recommendation:
     - Add.
   - Priority:
     - High.
   - Rationale:
     - Strong functional value and clear alignment with io_uring timer path.

2. Rich timer object controls
   - Missing in Spargio:
     - resettable/introspectable timer object shape (`deadline`/`reset`/
       elapsed-style helpers).
   - io_uring fit:
     - `Hybrid` / mostly `Not io_uring-native` (API ergonomics and runtime timer
       bookkeeping over timer ops).
   - Recommendation:
     - Add a minimal version later.
   - Priority:
     - Medium.
   - Rationale:
     - Useful, but secondary to shipping base deadline/interval primitives.

3. `spawn_blocking` bridge
   - Missing in Spargio:
     - explicit runtime blocking bridge API.
   - io_uring fit:
     - `Not io_uring-native` (thread-pool/runtime policy feature).
   - Recommendation:
     - Add with strict bounds and opt-in behavior.
   - Priority:
     - Medium-high.
   - Rationale:
     - Operationally important escape hatch, but not part of io_uring data path.

4. Runtime control surface (`run`/`poll`/`poll_with`/`current_timeout`)
   - Missing in Spargio:
     - explicit low-level runtime control API set comparable to Compio.
   - io_uring fit:
     - `Hybrid`:
       - polling/timeout plumbing can map to io_uring waits, but API shape is
         mostly scheduler-control surface.
   - Recommendation:
     - Do not add full stable parity surface now; keep internal or debugging use.
   - Priority:
     - Low.
   - Rationale:
     - Limited end-user value and higher misuse/maintenance risk.

5. Runtime context API (`enter`/current-runtime access)
   - Missing in Spargio:
     - explicit public context-enter/current-runtime model.
   - io_uring fit:
     - `Not io_uring-native` (TLS/context ergonomics).
   - Recommendation:
     - Defer.
   - Priority:
     - Low-medium.
   - Rationale:
     - Useful only for narrower extension patterns; easy to misuse if overexposed.

6. `attach(fd)`-style extension-author hook
   - Missing in Spargio:
     - public attach hook for custom high-level wrappers.
   - io_uring fit:
     - `Hybrid`:
       - could map to registration/fixed-file strategy, but behavior and benefit
         are workload-dependent.
   - Recommendation:
     - Defer for now.
   - Priority:
     - Low.
   - Rationale:
     - unsafe extension path already exists; add attach semantics only if measured
       wrapper use-cases require it.

7. Builder knobs (`thread_affinity`, scheduler `event_interval`)
   - Missing in Spargio:
     - explicit builder options matching Compio naming/shape.
   - io_uring fit:
     - `Not io_uring-native` (scheduler/thread policy).
   - Recommendation:
     - Partial add, benchmark-gated.
   - Priority:
     - Medium.
   - Rationale:
     - Can help production tuning, but belongs to controlled runtime policy work.

### monoio parity gaps (time/runtime utility scope), io_uring fit, and recommendation

1. Absolute-deadline and interval timer APIs
   - Missing in Spargio:
     - `sleep_until`
     - `timeout_at`
     - `interval` / `interval_at`
     - `Interval::tick`
   - io_uring fit:
     - same split as Compio analysis: direct timer op base + hybrid interval
       policy layer.
   - Recommendation:
     - Add.
   - Priority:
     - High.
   - Rationale:
     - Core utility breadth with direct io_uring timer alignment.

2. Interval policy controls (`MissedTickBehavior`, interval metadata)
   - Missing in Spargio:
     - missed-tick policy controls and period inspection API.
   - io_uring fit:
     - `Not io_uring-native` (runtime policy semantics).
   - Recommendation:
     - Add later (after base interval API).
   - Priority:
     - Medium.
   - Rationale:
     - Valuable for precision semantics, but not required for first parity slice.

3. Resettable/introspectable `Sleep` object
   - Missing in Spargio:
     - `Sleep`-style object with `deadline` / `is_elapsed` / `reset`.
   - io_uring fit:
     - `Hybrid`:
       - backed by timeout ops, but object semantics are runtime/user-space layer.
   - Recommendation:
     - Add later (minimal form).
   - Priority:
     - Medium.
   - Rationale:
     - Power-user utility; should follow stable base timer/deadline APIs.

4. `spawn_blocking` + blocking runtime configuration
   - Missing in Spargio:
     - blocking bridge and policy knobs.
   - io_uring fit:
     - `Not io_uring-native`.
   - Recommendation:
     - Add with constrained configuration.
   - Priority:
     - Medium-high.
   - Rationale:
     - Important operational bridge, but separate from io_uring core mechanics.

### Net decision summary (io_uring-aware)

Add now (direct io_uring base + essential hybrid policy):

- `sleep_until`
- `timeout_at`
- `interval` / `interval_at` / `tick` (minimal first version)

Add next (important, mostly non-kernel policy/runtime features):

- `spawn_blocking` with bounded/opt-in policy
- limited affinity tuning in builder

Add later (power-user timer ergonomics):

- interval missed-tick behavior controls
- resettable/introspectable timer object (`Sleep`-style surface)

Defer/avoid for now:

- broad public low-level runtime polling/control API parity
- explicit runtime context enter/current-runtime API
- `attach(fd)` hook unless concrete, benchmark-backed wrapper demand emerges

## Update: I/O surface parity comparison (Compio + monoio, io_uring fit adjusted) (2026-02-28)

Revised the I/O parity recommendations to explicitly account for whether each
gap is:

- `Direct io_uring`: has a direct opcode path in current `io-uring` crate.
- `Hybrid`: hot path can use io_uring, but setup/orchestration still uses
  regular syscalls or user-space composition.
- `Not io_uring-native`: mostly trait/adaptor/protocol surface above kernel I/O.

Context:

- This section is scoped to I/O API surface (fs/net/io traits/utilities), not
  timer/runtime utilities.
- Spargio today has:
  - `fs::File` + `OpenOptions` and positional file ops (`read_at`, `write_at`,
    `read_to_end_at`, `fsync`).
  - `net::TcpStream` and `net::TcpListener` (session-policy aware APIs).
  - unbound unsafe extension lane for custom raw io_uring operations.

### Compio parity gaps (I/O surface scope), io_uring fit, and recommendation

1. Filesystem path-level helpers and metadata/perms utility breadth
   - Missing in Spargio:
     - path-level helpers like `create_dir`, `create_dir_all`, `remove_file`,
       `remove_dir`, `rename`, convenience `read`/`write`, and broader metadata/
       permissions/symlink/hard-link helpers.
   - io_uring fit:
     - `Direct io_uring` candidates:
       - `create_dir` (`MkDirAt`)
       - `remove_file` / `remove_dir` (`UnlinkAt`)
       - `rename` (`RenameAt`)
       - metadata (`Statx`)
       - symlink/hard-link (`SymlinkAt` / `LinkAt`)
       - convenience `read`/`write` composed from `OpenAt/OpenAt2 + Read/Write + Close`
     - `Hybrid` candidates:
       - `create_dir_all` (userspace recursion + repeated mkdir op)
       - richer convenience wrappers (`read_to_string`, recursive utilities)
       - some permissions/canonicalization helpers that may require syscall or
         userspace fallback paths depending on kernel support
   - Recommendation:
     - Add now for direct-op helpers.
     - Add later for hybrid helpers.
   - Priority:
     - High for direct helpers; Medium for hybrid helpers.
   - Rationale:
     - This adds high-utility API breadth while staying aligned with Spargio's
       io_uring-first performance model.

2. Network protocol/socket family breadth
   - Missing in Spargio:
     - `UdpSocket`
     - Unix domain sockets (`UnixStream`, `UnixListener`, `UnixDatagram`)
   - io_uring fit:
     - `Direct io_uring` hot path:
       - `Socket`, `Accept`, `Connect`, `Send`, `Recv`, `SendMsg`, `RecvMsg`,
         `Shutdown`
     - `Hybrid` setup/control path:
       - socket options, bind/listen, DNS/address resolution, feature probing
   - Recommendation:
     - Add.
   - Priority:
     - High for UDP; Medium-high for Unix sockets.
   - Rationale:
     - Strong fit for io_uring data path and large practical adoption win beyond
       TCP-only coverage.

3. Generic async I/O trait + adapter layer
   - Missing in Spargio:
     - Compio-style traits/extensions (`AsyncRead*` / `AsyncWrite*`) and common
       adapters/utilities (`split`, buffered wrappers, framing/compat layers).
   - io_uring fit:
     - `Not io_uring-native` (user-space abstraction layer).
   - Recommendation:
     - Add, but as companion crate(s), not in core runtime crate.
   - Priority:
     - Medium.
   - Rationale:
     - Important ergonomics/interoperability value, but no kernel-path
       differentiation and substantial maintenance surface.

4. Optional higher-level transport/integration modules
   - Missing in Spargio:
     - Compio optional module breadth (`process`, `signal`, `tls`, `ws`, `quic`).
   - io_uring fit:
     - Mostly `Not io_uring-native` as runtime-level feature sets; some pieces
       may use io_uring underneath but are not core io_uring API-surface gaps.
   - Recommendation:
     - Defer in core; pursue as ecosystem crates after core fs/net/io parity
       baseline is complete.
   - Priority:
     - Low.
   - Rationale:
     - Broad scope with weaker direct alignment to immediate io_uring runtime
       differentiation.

### monoio parity gaps (I/O surface scope), io_uring fit, and recommendation

1. Filesystem path-level helper breadth
   - Missing in Spargio:
     - monoio-style helpers (`read`, `write`, `create_dir`, `create_dir_all`,
       `remove_file`, `remove_dir`, `rename`) and metadata conveniences.
   - io_uring fit:
     - same split as above: direct-op coverage for core helpers, hybrid for
       recursive/convenience wrappers.
   - Recommendation:
     - Add direct-op helpers now; phase in hybrid helpers later.
   - Priority:
     - High for direct-op helpers; Medium for hybrid helpers.
   - Rationale:
     - Baseline parity and migration ergonomics with strong io_uring alignment.

2. Network breadth beyond TCP
   - Missing in Spargio:
     - `UdpSocket`
     - Unix domain socket APIs.
   - io_uring fit:
     - direct-op hot path with hybrid setup path, same as Compio analysis.
   - Recommendation:
     - Add.
   - Priority:
     - High for UDP; Medium-high for Unix sockets.
   - Rationale:
     - Real-world protocol coverage with clear io_uring throughput/latency fit.

3. I/O utility stack (traits + utility wrappers)
   - Missing in Spargio:
     - monoio-style utility stack (`copy`, split halves, buffered wrappers,
       stream/sink adapters, cancelable helpers, zero-copy utility wrappers).
   - io_uring fit:
     - mostly `Not io_uring-native` (API composition layer).
   - Recommendation:
     - Add a practical subset after core direct-op I/O breadth lands; keep larger
       utility surface outside core crate.
   - Priority:
     - Medium.
   - Rationale:
     - Good ergonomics payoff, but should follow direct io_uring-aligned API
       expansion.

### Net decision summary (io_uring-aware)

Add now (direct io_uring or low-risk hybrid):

- path-level fs helpers that map cleanly to io_uring opcodes
  (`create_dir`, `remove_file`, `remove_dir`, `rename`, metadata, basic `read`/`write`)
- UDP socket API

Add next (hybrid or non-kernel surface with strong usability gain):

- Unix domain socket API
- foundational I/O trait/extensions and core helpers (`split`, `copy`) in
  companion crate(s)

Add later (mostly composition layers):

- recursive/richer fs convenience helpers (`create_dir_all`, broader wrappers)
- richer buffered/framed/compat layers

Defer/avoid in core for now:

- large optional integration surfaces (`process`, `signal`, `tls`, `ws`, `quic`)
  until core io_uring-aligned fs/net parity goals are met

## Update: parity execution sweep (time/runtime + I/O breadth) with red/green TDD (2026-02-28)

Executed the requested implementation sweep for all previously marked
`add now`, `add next`, and `add later` items in the time/runtime and I/O parity
sections, then validated with full `uring-native` test pass.

### Red phase

Added failing tests first:

1. Time/runtime primitives (`tests/primitives_tdd.rs`)
   - `sleep_until_waits_for_deadline`
   - `timeout_at_returns_err_when_deadline_expires`
   - `interval_ticks_with_configurable_missed_tick_behavior`
   - `interval_at_uses_requested_start_deadline`
   - `sleep_object_supports_deadline_reset_and_elapsed_state`
   - `runtime_handle_spawn_blocking_executes_closure`

2. Runtime builder tuning (`tests/runtime_tdd.rs`)
   - `runtime_builder_thread_affinity_option_builds_runtime`

3. I/O breadth (`tests/ergonomics_tdd.rs`)
   - `fs_path_helpers_cover_common_workflows`
   - `fs_link_helpers_support_symlink_and_hard_link`
   - `net_udp_socket_supports_send_recv_and_send_to_recv_from`
   - `net_unix_stream_listener_and_datagram_cover_core_paths`
   - `io_helpers_split_copy_and_framed_work`

Red failures were expected:

- unresolved time/runtime symbols (`sleep_until`, `timeout_at`, `interval*`,
  `Sleep`, `MissedTickBehavior`, `spawn_blocking`, `thread_affinity`).
- unresolved I/O symbols (`fs` path helpers, `UdpSocket`, `Unix*`, `io` module).

### Green phase

Implemented in `src/lib.rs`:

1. Time/runtime utility breadth
   - Added:
     - `sleep_until(Instant)`
     - `timeout_at(Instant, fut)`
     - `Sleep` (`new`, `until`, `deadline`, `is_elapsed`, `reset`, `Future`)
     - `interval(period)`, `interval_at(start, period)`
     - `Interval::tick`, `Interval::period`,
       `Interval::{missed_tick_behavior,set_missed_tick_behavior}`
     - `MissedTickBehavior::{Burst, Delay, Skip}`

2. Runtime utilities/tuning
   - Added `RuntimeHandle::spawn_blocking(...) -> Result<JoinHandle<_>, RuntimeError>`.
   - Added `RuntimeBuilder::thread_affinity(...)`.
   - Wired per-shard thread affinity application during shard thread startup
     (best-effort, Linux `sched_setaffinity`).

3. Filesystem API breadth
   - Added path-level async helpers in `spargio::fs`:
     - `create_dir`, `create_dir_all`, `remove_file`, `remove_dir`, `rename`
     - `hard_link`, `symlink`
     - `metadata`, `symlink_metadata`, `set_permissions`, `canonicalize`
     - convenience `read`, `read_to_string`, `write`
   - Added internal blocking bridge helper in fs module using
     `RuntimeHandle::spawn_blocking`.

4. Network API breadth
   - Added `spargio::net::UdpSocket`:
     - `bind`, `from_std`, `local_addr`, `connect`
     - `send`, `recv`, `send_to`, `recv_from`
   - Added `spargio::net::UnixStream`:
     - `connect`, `connect_with_session_policy`, `from_std`
     - `send`/`recv`, owned buffer variants, `write_all`/`read_exact`
   - Added `spargio::net::UnixListener`:
     - `bind`, `from_std`, `local_addr`, `accept`
   - Added `spargio::net::UnixDatagram`:
     - `bind`, `from_std`, `local_addr`, `connect`
     - `send`, `recv`, `send_to`, `recv_from`

5. Foundational I/O utility layer
   - Added `spargio::io` module:
     - traits: `AsyncRead`, `AsyncWrite` + extension traits
     - `split(...)` with `ReadHalf` / `WriteHalf`
     - `copy_to_vec(...)`
     - lightweight wrappers: `BufReader`, `BufWriter`
     - framed helper: `io::framed::LengthDelimited::{new, write_frame, read_frame}`

### Validation

Executed and passing:

- `cargo test --features uring-native --test primitives_tdd`
- `cargo test --features uring-native --test ergonomics_tdd`
- `cargo test --features uring-native --test runtime_tdd --test uring_native_tdd`
- `cargo test --features uring-native`

Result:

- full `uring-native` test suite passes after the parity sweep.

## Proposal: syscall migration to io_uring for fs path helpers (2026-02-28)

Goal:

- Remove remaining helper-thread `spawn_blocking(std::fs::...)` usage from the
  high-value `spargio::fs` path APIs where direct io_uring opcodes exist.
- Keep low-value/hard cases as compatibility paths for now.

Proposed execution model:

1. Add direct unbound native commands + opcodes for path operations:
   - `MkDirAt` (`create_dir`)
   - `UnlinkAt` (`remove_file`, `remove_dir` via `AT_REMOVEDIR`)
   - `RenameAt` (`rename`)
   - `LinkAt` (`hard_link`)
   - `SymlinkAt` (`symlink`)
2. Migrate corresponding `spargio::fs` helpers to native io_uring submission.
3. Keep these deferred as compatibility wrappers:
   - `create_dir_all`:
     - recursive user-space semantics and error behavior matching require extra
       traversal/orchestration logic; not a single direct opcode operation.
   - `canonicalize`:
     - path-resolution semantics are better handled by libc/kernel resolver
       paths; no direct single-op parity target in current surface.
   - `metadata`, `symlink_metadata`, `set_permissions`:
     - current public return/argument types are std wrappers
       (`std::fs::Metadata` / `Permissions`) not directly constructible from
       raw `statx` payloads without additional compatibility syscall layers.
4. Keep red/green TDD workflow:
   - add failing native fs-op tests first,
   - implement op plumbing + fs helper migration,
   - run targeted tests then full `cargo test --features uring-native`.

Acceptance criteria:

- No helper-thread path for: `create_dir`, `remove_file`, `remove_dir`,
  `rename`, `hard_link`, `symlink`.
- Deferred items remain clearly documented as compatibility paths.
- Full `uring-native` test suite remains green.

## Update: syscall migration to io_uring (fs path helpers) implemented (Red/Green TDD) (2026-02-28)

Implemented the proposal slice for direct-op fs path helpers, with explicit
kernel-support fallback behavior for unsupported opcode errors.

### Red phase

Added failing tests first in `tests/uring_native_tdd.rs`:

- `uring_native_unbound_fs_path_ops_cover_mkdir_rename_link_symlink_and_unlink`

Observed expected red failure:

- compile errors for missing `UringNativeAny` methods:
  - `mkdir_at`
  - `unlink_at`
  - `rename_at`
  - `link_at`
  - `symlink_at`

### Green phase

Implemented native io_uring path-op helpers on `UringNativeAny` (in `src/lib.rs`)
using the existing unsafe extension submission lane internally:

- `mkdir_at(path, mode)` -> `opcode::MkDirAt`
- `unlink_at(path, is_dir)` -> `opcode::UnlinkAt` (+ `AT_REMOVEDIR` for dirs)
- `rename_at(from, to)` -> `opcode::RenameAt`
- `link_at(original, link)` -> `opcode::LinkAt`
- `symlink_at(target, linkpath)` -> `opcode::SymlinkAt`

Then migrated high-level `spargio::fs` helpers to these native operations:

- `create_dir`
- `remove_file`
- `remove_dir`
- `rename`
- `hard_link`
- `symlink`

Compatibility behavior kept intentionally:

- For unsupported opcode errors (`EINVAL`, `ENOSYS`, `EOPNOTSUPP`), the above
  high-level helpers transparently fall back to prior blocking helper-thread
  implementations to preserve functionality on older kernels.

Deferred (unchanged, by proposal):

- `create_dir_all`
- `canonicalize`
- `metadata`
- `symlink_metadata`
- `set_permissions`

### Validation

Executed and passing:

- `cargo test --features uring-native --test uring_native_tdd`
- `cargo test --features uring-native --test ergonomics_tdd`
- `cargo test --features uring-native`

## Update: higher-level ecosystem parity check (Compio vs monoio) (2026-02-28)

Context:

- Follow-up comparison for higher-level "not done yet" surfaces:
  `process`, `signal`, `tls`, `ws`, `quic`.

### Feature presence snapshot

1. Compio
   - `process`:
     - available through first-party Compio family surface.
   - `signal`:
     - available through first-party Compio family surface.
   - `tls`:
     - available through first-party Compio family surface (feature-gated).
   - `ws`:
     - available in Compio ecosystem integrations (not a minimal runtime-core primitive).
   - `quic`:
     - available in Compio ecosystem integrations (not a minimal runtime-core primitive).
   - Assessment:
     - broad coverage with strong first-party/feature-gated story.

2. monoio
   - `process`:
     - not a primary monoio-core built-in surface; typically external integration.
   - `signal`:
     - available in monoio core behind feature gating.
   - `tls`:
     - primarily ecosystem crates/integrations.
   - `ws`:
     - primarily ecosystem crate coverage.
   - `quic`:
     - primarily ecosystem crate coverage.
   - Assessment:
     - slim core with higher-level surfaces mostly delegated to ecosystem crates.

### Implication for Spargio

- Current Spargio direction (core runtime + io_uring-aligned fs/net/io breadth,
  higher-level protocol/process surfaces deferred) is closer to monoio layering
  than to Compio's broader first-party family.
- Recommendation remains:
  - keep `process/signal/tls/ws/quic` out of `spargio` core for now;
  - deliver these as extension/companion crates after core fs/net/io parity
    and stability milestones;
  - if one is pulled forward, `signal` is the lowest-risk first candidate.

### Recommendation tags

- `process`: add later via companion crate, not core now.
- `signal`: consider next in companion form; optional core later if justified.
- `tls`: companion crate target, not core.
- `ws`: companion crate target, not core.
- `quic`: companion crate target, not core.

## Update: prioritized roadmap as concrete milestones (2026-02-28)

Converted the current roadmap direction into execution milestones with explicit
acceptance criteria.

### Milestone M1: production hardening + observability (highest priority)

Scope:

- Add stress/soak/failure-injection coverage for scheduler, boundary, and native
  fs/net paths.
- Expand runtime observability for queue depth, steal rates, in-flight native op
  counts, and timeout/cancellation outcomes.
- Add long-window p95/p99 benchmark tracking and guardrails.

Acceptance criteria:

- New stress/failure suites pass under `uring-native` in CI/nightly runs.
- Benchmark guardrail workflow reports p50/p95/p99 for key suites and enforces
  no-regression thresholds.
- At least one documented regression triage loop exists (capture -> compare ->
  bisect -> fix).

### Milestone M2: safe extension API wrappers + cookbook

Scope:

- Define and publish safe wrapper patterns for common unsafe native extension
  use-cases (ownership, lifetime, affinity, cancellation, fallback strategy).
- Add cookbook-quality examples for custom opcode submission and validation.

Acceptance criteria:

- Cookbook/examples compile and test in CI.
- At least one end-to-end extension example avoids direct user-facing unsafe
  blocks outside the wrapper boundary.
- Invariants/checklist for extension authors are documented and versioned.

### Milestone M3: docs and API-selection guidance

Scope:

- Expand docs for API selection (`fs/net/io` helpers vs native ops), placement
  policy choice, and benchmark interpretation.
- Add migration notes for users coming from Tokio/Compio/monoio surfaces.
- Stand up an in-repo `mdBook` as the long-form documentation home.
- Keep root `README.md` content/length stable for now, and add book links once
  the initial book structure is published.

Acceptance criteria:

- mdBook skeleton and initial chapters are in-repo and build in CI.
- `README.md` remains concise/current and links to the book after publish.
- README + guide pages clearly map common tasks to preferred APIs.
- Placement and latency/throughput tradeoffs are documented with concrete
  examples.
- Benchmark methodology is reproducible from documented commands.

### Milestone M4: measured core refinements (only clear-ROI changes)

Scope:

- Evaluate deferred fs helper migration items (`create_dir_all`,
  `canonicalize`, `metadata`, `symlink_metadata`, `set_permissions`) only when
  there is measured benefit.
- Tune work-stealing heuristics based on M1 telemetry, not ad-hoc changes.

Acceptance criteria:

- Each migration/tuning change ships with before/after benchmark data and no
  correctness regressions.
- No low-value complexity is added for cases with no measurable user impact.
- `cargo test`, `cargo test --features uring-native`, and benchmark guardrails
  remain green.

### Milestone M5: higher-level ecosystem parity via companion crates

Scope:

- Deliver higher-level surfaces outside core in this order:
  1) `signal` companion crate
  2) `tls` / `ws` / `quic` integrations
  3) `process` companion crate
- Implement companion crates as workspace subcrates in this repository (shared
  CI, tests, versioning, and release flow).
- Keep core focused on runtime + io_uring-aligned fs/net/io fundamentals.

Acceptance criteria:

- Companion crates have docs, tests, and minimal examples.
- Companion crates are wired as workspace members and participate in standard
  workspace CI checks.
- Core crate API remains stable/lean and does not absorb large optional stacks.
- Integration ergonomics are comparable to current core APIs for common use.

### Milestone M6: optional readiness-emulation track (deprioritized backlog)

Scope:

- Explicitly deprioritized for now.
- Reconsider optional Tokio-compat readiness shim (`IORING_OP_POLL_ADD`) only
  after M1-M5 are stable and after concrete demand is demonstrated.

Acceptance criteria:

- Not planned in the current execution window.
- Implemented behind explicit opt-in feature gate.
- Benchmark data shows practical value for targeted readiness-centric workloads.
- Does not regress default core paths or increase default runtime complexity.

## Update: Milestone M1 implemented (hardening + observability) with Red/Green TDD (2026-02-28)

Executed Milestone M1 scope with explicit red tests first, then implementation
and validation.

### Red phase

Added failing tests:

1. Boundary outcome observability (`tests/boundary_tdd.rs`)
   - `boundary_stats_capture_timeout_cancel_and_overload`
   - expected red failure:
     - missing `boundary::stats_snapshot`
     - missing `boundary::reset_stats_for_tests`.

2. Runtime stats helper observability (`tests/slices_tdd.rs`)
   - extended `stats_snapshot_tracks_messages_and_spawns` to require:
     - `RuntimeStats::total_command_depth`
     - `RuntimeStats::max_command_depth`
     - `RuntimeStats::max_pending_native_ops_by_shard`
     - `RuntimeStats::steal_success_rate`
   - expected red failure:
     - unresolved methods on `RuntimeStats`.

3. Percentile guardrail tooling (`tests/bench_tail_guardrail_tdd.rs`)
   - `percentile_guardrail_passes_for_fixture_profile`
   - `percentile_guardrail_fails_when_threshold_is_too_strict`
   - expected red failure:
     - missing `scripts/bench_tail_guardrail.sh`.

### Green phase

Implemented:

1. Boundary outcome stats API in `spargio::boundary`
   - new `BoundaryStats` snapshot struct:
     - `overloaded`, `timed_out`, `canceled`, `closed`
   - new APIs:
     - `boundary::stats_snapshot()`
     - `boundary::reset_stats_for_tests()`
   - instrumented boundary paths to record outcomes:
     - enqueue/try-enqueue overload and closed cases
     - ticket wait timeout and recv timeout
     - cancel paths (`respond` with dropped receiver, ticket poll canceled).

2. Runtime observability helper methods
   - added on `RuntimeStats`:
     - `total_command_depth()`
     - `max_command_depth()`
     - `max_pending_native_ops_by_shard()`
     - `steal_success_rate()`.

3. p50/p95/p99 guardrail script
   - added `scripts/bench_tail_guardrail.sh`
     - consumes Criterion `sample.json`
     - computes per-iteration p50/p95/p99
     - enforces `MAX_P50_RATIO`, `MAX_P95_RATIO`, `MAX_P99_RATIO`.
   - integrated into `scripts/bench_kpi_guardrail.sh`.
   - added fixture-backed tests under `tests/bench_tail_guardrail_tdd.rs`
     and `tests/fixtures/criterion/...`.

4. Hardening/soak coverage and nightly execution
   - added soak tests in `tests/stress_tdd.rs` (ignored by default):
     - `soak_stealable_burst_completes_without_dropping_tasks`
     - `soak_boundary_timeout_cancel_overload_paths_accumulate_stats`
   - CI workflow updates in `.github/workflows/ci.yml`:
     - added percentile guardrail steps
     - added nightly scheduled trigger
     - added `nightly-soak` job running ignored soak tests.

5. Regression triage loop documentation
   - added `docs/perf_regression_triage.md` with capture/compare/bisect/fix loop.

### Validation

Executed and passing:

- `cargo test --test boundary_tdd --test slices_tdd --test bench_tail_guardrail_tdd`
- `cargo test --test stress_tdd -- --ignored`

## Update: Milestone M2 implemented (safe extension wrappers + cookbook) with Red/Green TDD (2026-02-28)

Executed Milestone M2 scope with red-first tests and wrapper/docs delivery.

### Red phase

Added failing test in `tests/uring_native_tdd.rs`:

- `uring_native_safe_extension_statx_wraps_unsafe_submission`

Expected red failure:

- unresolved safe wrapper API:
  - `spargio::extension::fs::statx_on_shard`
  - `spargio::extension::fs::StatxOptions`
  - `spargio::extension::fs::statx_or_metadata`.

### Green phase

Implemented a first safe-wrapper extension surface in `src/lib.rs`:

- new module: `spargio::extension::fs`
- new safe APIs:
  - `statx(native, path)`
  - `statx_on_shard(native, shard, path, options)`
  - `statx_or_metadata(handle, path)` (kernel-support fallback)
- new typed outputs/options:
  - `StatxMetadata`
  - `StatxOptions`

Implementation details:

- wrappers encapsulate all `unsafe` usage internally and keep extension state
  owned until CQE completion (`CString` + output buffer in owned state struct).
- unsupported native-op errors (`EINVAL`/`ENOSYS`/`EOPNOTSUPP`) fall back to
  blocking `std::fs::metadata` through `RuntimeHandle::spawn_blocking`.
- explicit-shard and selector-driven variants both provided.

Cookbook/docs/examples:

- added `docs/native_extension_cookbook.md`:
  - ownership/lifetime pattern
  - affinity pattern
  - fallback pattern
  - explicit safety checklist for extension authors.
- added example `examples/native_extension_statx.rs` showing end-to-end usage
  with no user-facing `unsafe`.

### Validation

Executed and passing:

- `cargo test --features uring-native --test uring_native_tdd uring_native_safe_extension_statx_wraps_unsafe_submission`
- `cargo test --features uring-native --test uring_native_tdd`

## Update: Milestone M3 implemented (mdBook docs track) with Red/Green TDD (2026-02-28)

Executed Milestone M3 scope with red-first docs tests, then book scaffold and
CI integration. Root `README.md` content/length was intentionally left unchanged
per current decision.

### Red phase

Added failing tests in `tests/docs_tdd.rs`:

- `mdbook_scaffold_exists_with_summary`
- `mdbook_summary_links_resolve_to_existing_files`

Expected red failure:

- missing `book/book.toml`
- missing `book/src/SUMMARY.md` and chapter files.

### Green phase

Added in-repo mdBook scaffold:

- `book/book.toml`
- `book/src/SUMMARY.md`
- initial chapters:
  - `introduction.md`
  - `runtime_entry.md`
  - `placement.md`
  - `io_surface.md`
  - `native_extensions.md`
  - `benchmarking.md`
  - `migration.md`

CI integration:

- `.github/workflows/ci.yml` now installs `mdbook` and runs:
  - `mdbook build book`

### Validation

Executed and passing:

- `cargo test --test docs_tdd`
- `mdbook build book`


## Update: Milestone M4 implemented (measured core refinements) with Red/Green TDD (2026-02-28)

Executed a low-risk, measured refinement for one deferred fs area without
forcing full std-type migration complexity.

### Red phase

Extended `tests/ergonomics_tdd.rs` in:

- `fs_path_helpers_cover_common_workflows`

New assertion required unresolved API:

- `spargio::fs::metadata_lite(...)`

Expected red failure:

- missing `metadata_lite` helper in `spargio::fs`.

### Green phase

Implemented in core:

1. New measured helper
   - `spargio::fs::metadata_lite(handle, path)` in `src/lib.rs`.
   - Returns `spargio::extension::fs::StatxMetadata`.
   - Uses native-first safe wrapper (`statx_or_metadata`) with unsupported-op
     fallback to blocking metadata path.

2. Benchmark instrumentation for ROI tracking
   - added `fs_metadata_rtt` group in `benches/fs_api.rs`:
     - `tokio_spawn_blocking_metadata`
     - `spargio_metadata_lite`
   - extended fs harnesses with metadata command path to keep benchmark setup
     comparable with existing harness style.

### Measurement snapshot

Executed:

- `cargo bench --features uring-native --bench fs_api fs_metadata_rtt -- --warm-up-time 0.10 --measurement-time 0.10 --sample-size 20`

Observed:

- `fs_metadata_rtt/tokio_spawn_blocking_metadata`: `6.8858-7.2658 ms`
  (`140.93-148.71 Kelem/s`)
- `fs_metadata_rtt/spargio_metadata_lite`: `4.6598-4.8596 ms`
  (`210.72-219.75 Kelem/s`)

Interpretation:

- native-first `metadata_lite` shows a clear throughput and latency win in this
  short-run metadata workload while preserving compatibility fallback.
- retained the prior decision to defer full std-wrapper migration for
  `metadata`/`symlink_metadata`/`set_permissions` itself until broader,
  benchmark-backed conversion is justified.

### Validation

Executed and passing:

- `cargo test --features uring-native --test ergonomics_tdd fs_path_helpers_cover_common_workflows`
- `cargo bench --features uring-native --bench fs_api --no-run`

## Update: Milestone M5 implemented (companion crates as workspace subcrates) with Red/Green TDD (2026-02-28)

Executed Milestone M5 by wiring companion crates into the workspace and adding
initial tested APIs for `signal`, protocol integrations (`tls/ws/quic`
blocking bridges), and `process`.

### Red phase

Added failing workspace test in `tests/workspace_companions_tdd.rs`:

- `workspace_lists_companion_subcrates`

Expected red failure:

- root `Cargo.toml` had no `[workspace]`.
- companion crate paths were not present.

### Green phase

Workspace wiring:

- root `Cargo.toml` now defines workspace members:
  - `.`
  - `spargio-macros`
  - `crates/spargio-signal`
  - `crates/spargio-protocols`
  - `crates/spargio-process`

Companion subcrates added:

1. `spargio-signal`
   - API:
     - `signal(...) -> SignalStream`
     - `ctrl_c() -> SignalStream`
     - `SignalStream::recv().await`
   - implementation:
     - `signal-hook` listener thread + async-facing receive loop.
   - tests:
     - construction test
     - raised-signal receive test.

2. `spargio-protocols`
   - API:
     - `tls_blocking(...)`
     - `ws_blocking(...)`
     - `quic_blocking(...)`
   - implementation:
     - explicit `RuntimeHandle::spawn_blocking` bridges for protocol ecosystem
       integration points.
   - tests:
     - closure execution test across all three helpers.

3. `spargio-process`
   - API:
     - `status(handle, Command)`
     - `output(handle, Command)`
     - `CommandBuilder::{new,arg,args,status,output}`
   - implementation:
     - async process execution via runtime blocking bridge.
   - tests:
     - builder status path
     - function status path.

Companion docs/examples:

- added crate-level docs and minimal example files under each companion crate.

Repository status docs:

- updated `README.md` done/not-done section to reflect:
  - companion crates now present
  - safe extension wrapper slice done
  - `mdBook` scaffold done
  - higher-level parity still maturing.

### Validation

Executed and passing:

- `cargo test --test workspace_companions_tdd`
- `cargo test --workspace`
- `cargo test --workspace --features uring-native`

## Update: execution breakdown to reach deep protocol adapters + polished APIs (2026-02-28)

Captured a concrete implementation plan for completing the remaining higher-level
ecosystem maturity work.

Bridge-first principle for these phases:

- prefer proven upstream protocol/runtime crates and build thin `spargio-*`
  adapters around them.
- keep runtime integration value in `spargio` (timeouts/cancellation,
  instrumentation, placement ergonomics), while avoiding protocol reimplementation
  in core.
- keep protocol engines swappable behind stable companion-crate APIs.

### Phase 1: foundation layer (2-3 weeks)

Scope:

- freeze public API contracts for companion crates (`signal`, `process`, `tls`,
  `ws`, `quic`).
- define shared error taxonomy and conversions.
- add/finish `spargio::io` compatibility adapters needed by protocol crates.
- standardize timeout/cancellation/close semantics across companion crates.
- decide and document upstream bridge backends:
  - TLS: `rustls` + `futures-rustls`.
  - WS: `async-tungstenite` as default path; optional high-performance path via
    `fastwebsockets` where fit is proven.
  - QUIC: `quinn` first.
  - process: `async-process` bridge.
  - signal: `signal-hook`/`async-signal` style bridge model.

Done criteria:

- RFC-style contract docs checked in.
- compile-tested API skeletons.
- conformance tests for shared semantics.
- backend-selection rationale and compatibility policy documented.

### Phase 2: TLS deep adapter (3-5 weeks)

Scope:

- add `spargio-tls` companion crate (thin wrapper over
  `rustls`/`futures-rustls`, not a new TLS engine).
- implement connector/acceptor/stream APIs.
- implement handshake timeout/cancel semantics and ALPN/SNI config surface.
- add client/server interop tests.

Done criteria:

- stable TLS API for common client/server flows.
- interop + stress tests passing.
- cookbook/example coverage for common TLS service patterns.

### Phase 3: WebSocket deep adapter (2-4 weeks)

Scope:

- add `spargio-ws` companion crate.
- bridge to `async-tungstenite` first for broad interop; evaluate
  `fastwebsockets` as an optional backend for high-throughput paths.
- implement handshake APIs for client/server.
- implement frame/message API (`text`, `binary`, `ping/pong`, close).
- add fragmentation/backpressure/size-limit controls.

Done criteria:

- interoperable ws client/server examples.
- conformance tests for close/ping/pong and framing paths.
- documented limits and backpressure behavior.

### Phase 4: QUIC deep adapter (6-12 weeks)

Scope:

- add `spargio-quic` companion crate (quinn-first path).
- keep transport/protocol core in `quinn`; `spargio-quic` provides runtime
  integration and ergonomic API shaping.
- implement endpoint/connect/accept lifecycle APIs.
- implement uni/bi streams + datagram APIs.
- add config builder surface (timeouts/flow-control/congestion knobs).

Done criteria:

- stable endpoint/connection/stream/dgram APIs.
- interop/load tests passing.
- operational docs for tuning and shutdown semantics.

### Phase 5: process/signal maturity pass (2-3 weeks)

Scope:

- evolve current `spargio-process` and `spargio-signal` from minimal bridges to
  richer production APIs.
- process path: solidify `async-process`-style bridge ergonomics with cancellation
  and stdio behavior consistency.
- signal path: `signal-hook`/`async-signal` bridge with robust subscription and
  shutdown semantics.
- process: lifecycle + stdio handling polish.
- signal: richer subscription ergonomics and graceful-shutdown recipes.

Done criteria:

- expanded APIs with tests for lifecycle/race cases.
- cookbook examples for service shutdown and child-process orchestration.

### Phase 6: hardening + operations (3-5 weeks, overlaps phases 2-5)

Scope:

- failure-injection, stress/soak suites for companion protocol paths.
- p50/p95/p99 guardrail expansion for protocol benchmarks.
- observability hooks and regression triage workflow maturity.
- upstream compatibility matrix in CI (selected backend versions) to catch
  bridge drift early.

Done criteria:

- nightly/CI hardening lanes in place and stable.
- measurable long-window tail-latency tracking with gates.

### Phase 7: docs + polish (2-3 weeks, overlaps late phases)

Scope:

- expand mdBook protocol coverage and API selection guidance.
- migration docs and production checklists.
- semver/deprecation policy for companion crates.
- explicit "use direct upstream crate vs use `spargio-*` adapter" guidance for
  each protocol domain.

Done criteria:

- publish-grade docs for all companion crates.
- clear migration paths and stability guarantees documented.

### Recommended sequencing

1. Foundation
2. TLS + WS in parallel
3. QUIC
4. process/signal maturity
5. hardening/docs finalization across all crates

### Effort estimate

- single engineer: ~4-6 months
- 2-3 engineers in parallel: ~8-12 weeks for a strong first production-grade cut

## Update: Phase 1 implemented (foundation contracts + semantics + io compatibility) with Red/Green TDD (2026-02-28)

Executed a concrete Phase 1 slice with red-first tests, then implementation and
green verification.

### Red tests added first

- `crates/spargio-protocols/tests/foundation_tdd.rs`
  - `blocking_options_enforce_timeout`
  - `futures_io_adapter_roundtrip_over_tcp_stream` (Linux + `uring-native`)
- `crates/spargio-process/tests/foundation_tdd.rs`
  - `status_with_options_enforces_timeout`

Observed expected red state before implementation:

- unresolved imports/APIs:
  - `BlockingOptions`, `tls_blocking_with_options`
  - `CommandOptions`, `status_with_options`
  - `io_compat::FuturesTcpStream`

### Implementation delivered

`spargio-protocols`:

- added `BlockingOptions` with optional timeout policy.
- added optioned API variants:
  - `tls_blocking_with_options`
  - `ws_blocking_with_options`
  - `quic_blocking_with_options`
- kept existing `*_blocking` helpers as defaults over optioned APIs.
- standardized timeout semantics with `spargio::timeout(...)` -> `io::ErrorKind::TimedOut`.
- added Linux `uring-native` `futures::io` adapter:
  - `io_compat::FuturesTcpStream` implements
    `futures::io::{AsyncRead, AsyncWrite}` over `spargio::net::TcpStream`.
- added crate feature forwarding:
  - `uring-native = ["spargio/uring-native"]`.

`spargio-process`:

- added `CommandOptions` with optional timeout policy.
- added optioned API variants:
  - `status_with_options`
  - `output_with_options`
  - `CommandBuilder::{status_with_options, output_with_options}`
- kept existing `status`/`output` APIs delegating to default options.
- standardized timeout semantics with `spargio::timeout(...)` -> `io::ErrorKind::TimedOut`.

Contracts/docs:

- added `docs/companion_contracts.md` to capture baseline shared semantics for
  companion crates (error mapping, cancellation, timeout, io compatibility).

### Green validation

Executed and passing:

- `cargo test -p spargio-process --test foundation_tdd`
- `cargo test -p spargio-protocols --test foundation_tdd`
- `cargo test -p spargio-protocols --features uring-native --test foundation_tdd`

## Update: Phase 2 implemented (TLS deep adapter bridge) with Red/Green TDD (2026-02-28)

Executed a red-first TLS companion crate implementation over
`rustls` + `futures-rustls`.

### Red tests added first

- created new workspace crate: `crates/spargio-tls`
- added `crates/spargio-tls/tests/tls_tdd.rs` with:
  - `tls_connector_connect_socket_addr_timeout_is_enforced`
  - `tls_connector_and_acceptor_interop_roundtrip`

Observed expected red state:

- unresolved API imports in `spargio_tls`:
  - `HandshakeOptions`
  - `TlsConnector`

### Implementation delivered

Workspace wiring:

- added `crates/spargio-tls` to root workspace members.
- crate deps include:
  - `futures-rustls`
  - `rustls`
  - `spargio` (`uring-native`)
  - `spargio-protocols` (`uring-native` io adapter bridge)

Public API:

- handshake options:
  - `HandshakeOptions` (optional timeout)
- connector/acceptor wrappers:
  - `TlsConnector` with:
    - `connect(...)`
    - `connect_socket_addr(...)`
  - `TlsAcceptor` with:
    - `accept(...)`
- free functions:
  - `connect`, `connect_with_options`
  - `connect_socket_addr`, `connect_socket_addr_with_options`
  - `accept`, `accept_with_options`
- stream aliases:
  - `ClientTlsStream`
  - `ServerTlsStream`

Semantics:

- TLS handshakes are timeout-governed via `spargio::timeout(...)`.
- timeout maps to `io::ErrorKind::TimedOut`.
- transport layer is a thin bridge over
  `spargio-protocols::io_compat::FuturesTcpStream`.

Related compatibility improvement:

- implemented `Debug` for `spargio-protocols::io_compat::FuturesTcpStream`
  to satisfy downstream stream debug bounds.

### Green validation

Executed and passing:

- `cargo test -p spargio-tls --test tls_tdd`

## Update: Phase 3 implemented (WebSocket deep adapter bridge) with Red/Green TDD (2026-02-28)

Executed a red-first WebSocket companion crate implementation over
`async-tungstenite`.

### Red tests added first

- created new workspace crate: `crates/spargio-ws`
- added `crates/spargio-ws/tests/ws_tdd.rs` with:
  - `ws_client_connect_timeout_is_enforced`
  - `ws_client_server_roundtrip_text_message`

Observed expected red state:

- unresolved API imports in `spargio_ws`:
  - `WsOptions`
  - `accept_with_options`
  - `connect_socket_addr_with_options`

### Implementation delivered

Workspace wiring:

- added `crates/spargio-ws` to root workspace members.
- crate deps include:
  - `async-tungstenite`
  - `spargio` (`uring-native`)
  - `spargio-protocols` (`uring-native` io adapter bridge)

Public API:

- options and wrappers:
  - `WsOptions` (timeout + frame/message limit knobs)
  - `WsConnector`
  - `WsAcceptor`
- stream aliases:
  - `WsStream`
  - `WsResponse`
- functions:
  - `connect`, `connect_with_options`
  - `connect_socket_addr`, `connect_socket_addr_with_options`
  - `accept`, `accept_with_options`

Semantics:

- handshake timeout enforced with `spargio::timeout(...)`.
- timeout maps to `io::ErrorKind::TimedOut`.
- tungstenite protocol errors map to `io::Error` for uniform bridge behavior.
- uses `spargio-protocols::io_compat::FuturesTcpStream` transport adapter.

### Green validation

Executed and passing:

- `cargo test -p spargio-ws --test ws_tdd`

## Update: Phase 4 implemented (QUIC companion bridge, quinn-first) with Red/Green TDD (2026-02-28)

Executed a red-first `quinn` companion bridge implementation focused on
runtime integration and execution semantics.

### Red tests added first

- created new workspace crate: `crates/spargio-quic`
- added `crates/spargio-quic/tests/quic_tdd.rs` with:
  - `quic_bridge_runs_async_work`
  - `quic_bridge_timeout_is_enforced`

Observed expected red state:

- unresolved API imports in `spargio_quic`:
  - `QuicBridge`
  - `QuicOptions`

### Implementation delivered

Workspace wiring:

- added `crates/spargio-quic` to root workspace members.
- crate deps include:
  - `quinn`
  - `tokio` (current-thread runtime execution lane)
  - `spargio`

Public API:

- options and wrappers:
  - `QuicOptions` (optional timeout)
  - `QuicBridge`
- execution entrypoints:
  - `run(...)`
  - `run_with_options(...)`
  - `QuicBridge::run(...)`
  - `QuicBridge::with_endpoint(...)` (quinn endpoint lifecycle bridge helper)
- explicit re-export:
  - `pub use quinn;`

Semantics:

- bridge executes async quinn workflows on a Tokio current-thread runtime built
  inside `RuntimeHandle::spawn_blocking(...)`.
- timeout enforced via `spargio::timeout(...)` -> `io::ErrorKind::TimedOut`.
- runtime rejection/cancel mapped to `io::Error` consistently with companion
  bridge behavior.

### Green validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd`

## Update: Phase 5 implemented (process/signal maturity pass) with Red/Green TDD (2026-02-28)

Executed a red-first process/signal maturity pass expanding lifecycle and
subscription ergonomics.

### Red tests added first

Process:

- `crates/spargio-process/tests/maturity_tdd.rs`
  - `command_builder_spawn_and_wait_lifecycle`
  - `spawned_child_wait_timeout_is_enforced`

Signal:

- `crates/spargio-signal/tests/maturity_tdd.rs`
  - `signal_hub_broadcasts_to_multiple_subscribers`
  - `signal_stream_recv_timeout_returns_none`
  - `ctrl_c_stream_still_constructs`

Observed expected red state:

- missing process APIs:
  - `CommandBuilder::spawn`
  - spawned child wait timeout APIs
- missing signal APIs:
  - `SignalHub`
  - `SignalStream::recv_timeout`

### Implementation delivered

`spargio-process` maturity additions:

- added spawn APIs:
  - `spawn(...)`
  - `spawn_with_options(...)`
  - `CommandBuilder::{spawn, spawn_with_options}`
- added `ChildHandle` with lifecycle methods:
  - `id()`
  - `wait()`
  - `wait_with_options(...)`
  - `try_wait()`
  - `kill()`
  - `output()`
  - `output_with_options(...)`
- all blocking process operations routed through shared timeout/cancel-aware
  `run_blocking(...)` semantics.

`spargio-signal` maturity additions:

- introduced `SignalHub`:
  - `SignalHub::new(...)`
  - `SignalHub::subscribe()`
- `SignalStream` now supports:
  - `recv()`
  - `recv_timeout(...)`
  - `recv_matching(...)`
  - `try_recv()`
- `signal(...)` now composes via `SignalHub` + `subscribe`, preserving prior
  API behavior while enabling broadcast-style subscriptions.

### Green validation

Executed and passing:

- `cargo test -p spargio-process --test maturity_tdd`
- `cargo test -p spargio-signal --test maturity_tdd`
- `cargo test -p spargio-process`
- `cargo test -p spargio-signal`

## Update: Phase 6 implemented (hardening + operations lanes) with Red/Green TDD (2026-02-28)

Executed an operations-focused hardening slice for companion crates and CI
coverage.

### Red tests added first

- added root test file: `tests/companion_ops_tdd.rs`
  - `companion_ci_smoke_script_exists_and_targets_companion_crates`
  - `ci_workflow_has_companion_matrix_lane`

Observed expected red state:

- missing `scripts/companion_ci_smoke.sh`
- missing `companion-matrix` CI job wiring in `.github/workflows/ci.yml`

### Implementation delivered

Companion smoke script:

- added `scripts/companion_ci_smoke.sh`:
  - `cargo test -p spargio-protocols --features uring-native`
  - `cargo test -p spargio-tls --test tls_tdd`
  - `cargo test -p spargio-ws --test ws_tdd`
  - `cargo test -p spargio-quic --test quic_tdd`
  - `cargo test -p spargio-process`
  - `cargo test -p spargio-signal`

CI workflow hardening:

- added `companion-matrix` job in `.github/workflows/ci.yml` that runs:
  - `./scripts/companion_ci_smoke.sh`

Operational intent:

- catch protocol bridge drift and companion crate regressions in a dedicated CI
  lane, independent of core runtime test jobs.

### Green validation

Executed and passing:

- `cargo test --test companion_ops_tdd`
- `./scripts/companion_ci_smoke.sh`

## Update: Phase 7 implemented (docs + polish) with validation (2026-02-28)

Executed documentation/polish updates to reflect delivered companion-crate
work and provide explicit API-selection guidance.

### README updates

Updated done/not-done sections to match current implementation state:

- done:
  - companion crate suite now explicitly includes:
    - `spargio-process`
    - `spargio-signal`
    - `spargio-protocols` (legacy blocking bridge)
    - `spargio-tls`
    - `spargio-ws`
    - `spargio-quic`
  - companion hardening lane (`scripts/companion_ci_smoke.sh` + CI job).
  - docs scaffold note updated to include protocol/API-selection coverage.
- not done:
  - clarified remaining maturity gaps as advanced tuning/surface depth and
    long-window operational hardening, not absence of companion crates.

Companion dependency polish:

- pinned `spargio-tls` rustls/futures-rustls features to a single crypto
  provider (`ring`) to avoid workspace-wide TLS/QUIC provider ambiguity during
  unified test runs.

### mdBook coverage expansion

Added new chapters:

- `book/src/companion_protocols.md`
  - direct upstream vs `spargio-*` adapter selection guidance.
  - explicit scope boundary (thin adapters, no protocol engine rewrite).
- `book/src/companion_stability.md`
  - semver/deprecation baseline policy for companion crates.
  - CI/operations expectations for compatibility maintenance.

Updated:

- `book/src/SUMMARY.md` links to new chapters.
- `book/src/migration.md` includes protocol companion migration guidance.

### Validation

Executed and passing:

- `cargo test --test docs_tdd`
- `mdbook build book`

## Update: QUIC final-form target and acceptance checklist (2026-03-01)

Decision recorded: favor the long-term QUIC integration shape based on a
native `quinn-proto` driver owned by Spargio, instead of a permanent Tokio
bridge path.

### Target architecture (long-term form)

- endpoint ownership is shard-affine and explicit (one owning execution
  context per UDP socket/endpoint lifecycle).
- packet I/O is driven by Spargio runtime tasks with `io_uring` as preferred
  backend where it is a clear win; retain fallback paths where kernel/platform
  constraints require.
- timers, pacing, loss-recovery wakeups, and cancellation are mapped to
  Spargio primitives (no embedded Tokio runtime per operation).
- high-level API is provided by `spargio-quic`; protocol core comes from
  `quinn-proto` with Spargio-managed driver loops.

### Acceptance checklist

1. Runtime/driver correctness
- no `spawn_blocking + tokio::runtime::Builder` path in steady-state QUIC I/O.
- endpoint driver loop integrates send/recv/timer progression without busy spin.
- cancellation and drop semantics are deterministic for endpoint, connections,
  streams, and datagrams.

2. API completeness
- endpoint lifecycle: bind, client connect, server accept, graceful close, and
  draining shutdown.
- stream surface: open/accept uni + bi streams, ordered reads/writes, finish,
  reset/stop semantics.
- datagram surface: send/recv with documented size/error behavior.
- configuration surface: TLS config, ALPN, transport tuning pass-through,
  version-negotiation visibility.

3. Ergonomics and placement
- session-local (`!Send`-friendly) handles for fast same-thread workflows.
- explicit cross-thread handoff wrapper for `Send`-required hops.
- clear docs for shard/session ownership and expected placement behavior.

4. Performance and resource behavior
- benchmark lane compares current bridge baseline vs native driver for
  throughput, tail latency, and CPU under representative profiles.
- no material regressions in memory growth under long-lived high-concurrency
  workloads.
- backpressure behavior validated (bounded queues and predictable overload
  failure modes).

5. Interop and reliability
- interoperability matrix includes at least quinn and one non-quinn QUIC peer.
- fault-injection coverage for loss/reorder/duplication/timeout and migration
  edge cases where supported.
- soak tests validate stability across long-duration connection churn.

6. Observability and operations
- counters/histograms for handshake outcomes, retransmits, PTO events, stream
  errors, and datagram drops.
- structured events for connection lifecycle and terminal error reasons.
- CI lane covers native-QUIC smoke + targeted regression suite.

7. Migration and compatibility
- bridge mode retained only as transitional compatibility path until native
  coverage reaches checklist thresholds.
- migration docs describe API parity status and behavior deltas between bridge
  and native modes.

### Add-now vs later guidance

- add now: runtime/driver skeleton, endpoint lifecycle parity, stream basics,
  cancellation guarantees, smoke+interop tests, and baseline metrics.
- add next: datagram depth, transport tuning breadth, richer observability,
  backpressure tuning, and broader fault injection.
- add later: advanced features requiring significant protocol-policy surface
  (only when demand and maintenance budget justify).

## Update: QUIC add-now/add-next implementation slice delivered with Red/Green TDD (2026-03-01)

Implemented the currently sensible "add now" and selected "add next" items in
`spargio-quic`, while preserving backward-compatible bridge entrypoints and
keeping the long-term `quinn-proto` native-driver direction as the target.

### Red phase

Expanded `crates/spargio-quic/tests/quic_tdd.rs` with failing tests for:

- endpoint lifecycle + stream exchange:
  - `quic_endpoint_connects_and_exchanges_uni_stream_data`
- datagram surface + metrics:
  - `quic_endpoint_datagram_roundtrip_updates_metrics`
- bounded in-flight backpressure:
  - `quic_endpoint_accept_backpressure_is_enforced`
- `!Send` local ergonomics + explicit send handoff:
  - `quic_connection_local_to_send_handoff_preserves_identity`
- metrics snapshot baseline:
  - `quic_endpoint_metrics_snapshot_has_expected_counters`

Observed expected red state:

- missing `QuicEndpoint`, `QuicEndpointOptions`, and `QuicMetricsSnapshot`
- missing local/send connection wrappers and endpoint/connection APIs
- missing QUIC test cert dependencies

### Green phase

Implemented new `spargio-quic` API surface in `crates/spargio-quic/src/lib.rs`:

Runtime/driver and cancellation behavior:

- replaced per-operation `spawn_blocking + tokio runtime build` with a shared
  persistent bridge executor (`OnceLock`-backed Tokio multithread runtime).
- bridge task timeouts now abort in-flight join handles on timeout.
- retained existing `QuicBridge::{run, with_endpoint}` and free `run*` APIs.

Endpoint and connection API completeness:

- added `QuicEndpoint`:
  - constructors:
    - `server(...)`
    - `server_with_options(...)`
    - `client(...)`
    - `client_with_options(...)`
    - `from_endpoint*`
  - lifecycle/config:
    - `local_addr()`
    - `set_default_client_config(...)`
    - `set_server_config(...)`
    - `close(...)`
    - `wait_idle().await`
  - connection paths:
    - `connect(...)`
    - `connect_with(...)`
    - `accept().await`
- added `QuicConnection`:
  - stream ops:
    - `open_uni/open_bi`
    - `accept_uni/accept_bi`
  - datagram ops:
    - `send_datagram(...)`
    - `read_datagram().await`
  - lifecycle/introspection:
    - `close(...)`
    - `closed().await`
    - `stable_id()`
    - `stats()`
    - `max_datagram_size()`
    - `datagram_send_buffer_space()`

Ergonomics and placement:

- added explicit send-handoff wrapper: `QuicSendConnection`.
- added local `!Send` wrapper: `LocalQuicConnection` (`Rc`-backed).
- added conversion helpers:
  - `QuicConnection::to_local()`
  - `QuicConnection::to_send_handle()`
  - `LocalQuicConnection::to_send_handle()`

Backpressure and observability:

- added `QuicEndpointOptions` with:
  - `connect_timeout`
  - `accept_timeout`
  - `operation_timeout`
  - `max_inflight_ops`
- added bounded in-flight guardrails (`WouldBlock` on limit saturation).
- added per-endpoint metrics with snapshots:
  - `QuicMetrics`
  - `QuicMetricsSnapshot`
  - counters include connect/accept starts/success/fail/timeouts, stream and
    datagram activity, close events, operation timeouts, and backpressure hits.

Cargo updates:

- `crates/spargio-quic/Cargo.toml`:
  - Tokio features now include `rt-multi-thread` (shared executor runtime).
  - test deps added: `rcgen`, `rustls`.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic`

## Update: R1 native cutover continuation (connection-op dispatch) with Red/Green TDD (2026-03-01)

Implemented the next R1 slice by routing native-backend connection async
operations through a persistent connection dispatcher task, and by making
connection-level backend dispatch visible in metrics.

### Red phase

Added failing tests in `crates/spargio-quic/tests/quic_tdd.rs`:

- `quic_connection_native_backend_dispatches_connection_ops`
- `quic_connection_bridge_backend_dispatches_connection_ops`

Expected red failures:

- connection operations (`open_*`, `accept_*`, etc.) were not incrementing
  backend dispatch counters, so before/after metric deltas stayed flat.

### Green phase

Implemented in `crates/spargio-quic/src/lib.rs`:

- Added `NativeConnectionDispatch` actor:
  - persistent Tokio task per accepted/connected native connection
  - command loop for async connection operations:
    - `closed`
    - `open_uni` / `open_bi`
    - `accept_uni` / `accept_bi`
    - `read_datagram`
  - bounded command/reply semantics via unbounded mpsc + oneshot replies
  - deterministic `BrokenPipe` error when dispatcher is closed.
- Updated `QuicEndpoint::wrap_connection(...)`:
  - now initializes native connection dispatch for `QuicBackend::Native`
  - now returns `io::Result<QuicConnection>` to surface dispatcher init errors.
- Updated connect/accept call sites to handle fallible wrapping and keep metrics
  (`connects_failed` / `accepts_failed`) consistent on wrap failures.
- Updated `QuicConnection` operation dispatch:
  - native backend async ops route through `NativeConnectionDispatch`
  - bridge backend keeps direct path
  - both backends now increment backend dispatch counters for connection ops
  - timeout accounting (`operation_timeouts`) preserved.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd quic_connection_` (red then green)
- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic --test interop_tdd`
- `cargo test -p spargio-quic --test soak_tdd`

## Update: R1 cutover guardrails expanded (native path bridge-spawn exclusion) with Red/Green TDD (2026-03-01)

Added explicit cutover tests that assert native backend data-path operations do
not go through bridge task spawning, while bridge backend still does.

### Red phase

Added failing tests in new file `crates/spargio-quic/tests/native_cutover_tdd.rs`:

- `native_backend_data_path_avoids_bridge_task_spawn`
- `bridge_backend_data_path_uses_bridge_task_spawn`

Initial failures:

- native test failed due premature close ordering causing stream read abort.
- lock poisoning cascaded into the second test.

### Green phase

Adjusted test choreography and synchronization:

- serialized counter-sensitive tests with process-local lock
  (`BRIDGE_COUNT_TEST_LOCK`).
- moved connection close calls to post-exchange phase before `wait_idle`.
- recovered lock from poison safely for deterministic reruns.

Cutover assertions now enforced:

- native backend (`QuicBackend::Native`) exchange + `wait_idle` path leaves
  `bridge_runtime_spawn_count() == 0`.
- bridge backend (`QuicBackend::Bridge`) exchange + `wait_idle` path yields
  `bridge_runtime_spawn_count() >= 1`.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test native_cutover_tdd`
- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic --test interop_tdd`
- `cargo test -p spargio-quic --test soak_tdd`

## Update: R1 cutover continuation (native endpoint lifecycle without bridge runtime context entry) with Red/Green TDD (2026-03-01)

Implemented the next R1 slice by removing native endpoint constructor/drop
dependence on `with_bridge_runtime_context(...)`, while keeping bridge backend
compatibility behavior unchanged.

### Red phase

Extended `crates/spargio-quic/tests/native_cutover_tdd.rs` with failing tests:

- `native_backend_endpoint_lifecycle_avoids_bridge_runtime_context_entry`
- `bridge_backend_endpoint_lifecycle_uses_bridge_runtime_context_entry`

Expected red failure before implementation:

- native endpoint lifecycle (`server/client` + drop) still went through
  `with_bridge_runtime_context(...)`.

### Green phase

Implemented in `crates/spargio-quic/src/lib.rs`:

- Added bridge-runtime context-entry counters:
  - `bridge_runtime_context_enter_count()`
  - `reset_bridge_runtime_context_enter_count()`
  - internal counter increment in `with_bridge_runtime_context(...)`.
- Added native endpoint runtime adapter:
  - `BridgeTokioRuntime` implementing `quinn::Runtime` with explicit
    `tokio::runtime::Handle` (spawn/timer/socket wrapping without relying on
    thread-local runtime context entry).
  - `BridgeUdpSocket` and `BridgeUdpPoller` implementing
    `quinn::AsyncUdpSocket` / `quinn::UdpPoller`.
- Added native constructor helpers:
  - `native_server_endpoint(...)`
  - `native_client_endpoint(...)`
- Updated endpoint constructors:
  - `QuicBackend::Native` now uses native constructor helpers.
  - `QuicBackend::Bridge` retains `with_bridge_runtime_context(...)` path.
- Updated `Drop for QuicEndpoint`:
  - bridge backend keeps runtime-context drop guard.
  - native backend drops endpoint directly (no bridge context entry).

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test native_cutover_tdd`
- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic --test interop_tdd`
- `cargo test -p spargio-quic --test soak_tdd`

## Update: R8 deferred-fs progression (`create_dir_all` native-first) with Red/Green TDD (2026-03-01)

Implemented a concrete deferred-fs migration slice by removing direct
`spawn_blocking(std::fs::create_dir_all)` usage from the common path.

### Red phase

Extended `tests/deferred_items_tdd.rs` with a new assertion in:

- `deferred_fs_helpers_execute_and_metadata_lite_is_available`

New contract:

- simple nested `create_dir_all` paths should not use direct blocking fallback
  in `spargio::fs::create_dir_all`.

### Green phase

Implemented in `src/lib.rs` (`spargio::fs` module):

- `create_dir_all(...)` now uses native-first iterative creation via
  `create_dir(...)` for straightforward path forms.
- preserved compatibility fallback to `std::fs::create_dir_all` for complex
  relative path forms (`.` / `..` / platform prefix components).
- added test instrumentation helpers:
  - `create_dir_all_blocking_fallback_count_for_test()`
  - `reset_create_dir_all_blocking_fallback_count_for_test()`

Docs/status sync:

- updated README deferred-fs wording to reflect:
  - `create_dir_all` now native-first for straightforward paths
  - still-deferred helpers remain `canonicalize`, `metadata`,
    `symlink_metadata`, `set_permissions`.

### Validation

Executed and passing:

- `cargo test --features uring-native --test deferred_items_tdd`
- `cargo test --test deferred_items_tdd`

## Update: R2 native-proto progression (`connect_for_test` + protocol transmit pump) with Red/Green TDD (2026-03-01)

Implemented an additional R2 slice to move native driver behavior beyond
placeholder queue semantics by wiring real `quinn-proto` connection bootstrap
and transmit progression in the owner loop.

### Red phase

Added failing test in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_connect_for_test_generates_initial_transmit`

Red expectation:

- driver lacked a real client connection bootstrap path that produced protocol
  transmits from `quinn-proto::Connection::poll_transmit(...)`.

### Green phase

Implemented in `crates/spargio-quic/src/lib.rs`:

- Added new command/API:
  - `NativeProtoDriver::{connect_for_test(...)}`
  - parity on local/send wrappers.
- Added owner-loop command:
  - `NativeProtoCommand::ConnectForTest`
- Added protocol connection state in owner loop:
  - `HashMap<ConnectionHandle, quinn_proto::Connection>`
  - per-handle queued connection-event mailbox
  - synthetic-id -> proto-handle mapping for close-path cleanup.
- Added protocol progression helper:
  - `drive_native_proto_connections(...)`
  - processes queued `ConnectionEvent`s
  - forwards endpoint events via `Endpoint::handle_event(...)`
  - drains `Connection::poll_transmit(...)` into native transmit queue with
    existing backpressure/fault accounting.
- Integrated progression helper into:
  - `SubmitDatagram` (connection-event driven progression)
  - `AdvanceClockForTest` (timeout-driven progression)
  - `ConnectForTest` bootstrap path.
- Added deterministic synthetic-time conversion helper:
  - `native_proto_now(epoch, now_duration)`.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd native_proto_driver_connect_for_test_generates_initial_transmit`
- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic --test interop_tdd`
- `cargo test -p spargio-quic --test native_cutover_tdd`
- `cargo test --features uring-native --test deferred_items_tdd`

## Update: R1-R9 implementation sweep completed with Red/Green TDD (2026-03-01)

Completed the roadmap milestones `R1` through `R9` with concrete tests,
implementation slices, and CI/docs wiring.

### R1: QUIC backend cutover controls (`Native` default, `Bridge` explicit fallback)

Red phase:

- Added failing tests in `crates/spargio-quic/tests/quic_tdd.rs`:
  - `quic_endpoint_options_default_to_native_backend`
  - `quic_endpoint_default_backend_dispatches_native_ops`
  - `quic_endpoint_bridge_backend_dispatches_bridge_ops`

Green phase:

- Added `QuicBackend` (`Native`, `Bridge`) and plumbed it through
  `QuicEndpointOptions`.
- Added dispatch metrics:
  - `QuicMetricsSnapshot::{native_ops_dispatched, bridge_ops_dispatched}`
- Routed endpoint operation dispatch by backend mode; `Native` is default.
- Added controlled endpoint drop path to preserve quinn runtime-context safety.

### R2: Native driver progression beyond bare skeleton semantics

Red phase:

- Added failing tests in `crates/spargio-quic/tests/quic_tdd.rs`:
  - `native_proto_driver_closed_connection_rejects_stream_ops`
  - `native_proto_driver_connection_datagram_roundtrip_tracks_state`

Green phase:

- Added `NativeProtoConnectionState`.
- Added connection lifecycle/datagram APIs:
  - `close_connection_for_test`
  - `connection_state`
  - `send_datagram_on_connection_for_test`
  - `recv_datagram_on_connection_for_test`
- Extended owner-loop connection pump with close-state guards and per-connection
  datagram queues/counters.
- Mirrored these APIs on local/send driver wrappers.

### R3: Native QUIC interop matrix

Red phase:

- Added failing interop suite `crates/spargio-quic/tests/interop_tdd.rs`.

Green phase:

- Added interop tests:
  - `interop_spargio_client_to_raw_quinn_server_bi_stream`
  - `interop_raw_quinn_client_to_spargio_server_bi_stream`
- Added `scripts/quic_interop_matrix.sh`.
- Added CI wiring in `.github/workflows/ci.yml` (`companion-matrix` job).

### R4: Long-window soak + fault qualification

Red phase:

- Added failing soak/fault qualification suite
  `crates/spargio-quic/tests/soak_tdd.rs`.

Green phase:

- Added ignored soak tests:
  - `soak_connection_churn_roundtrip_stays_stable`
  - `soak_native_fault_injection_keeps_egress_queue_bounded`
- Added `scripts/quic_soak_fault.sh`.
- Wired nightly CI soak invocation in `.github/workflows/ci.yml`.

### R5: Performance gate integration for rollout

Red phase:

- Added failing QUIC perf-gate harness tests `tests/quic_perf_guardrail_tdd.rs`.

Green phase:

- Added `scripts/quic_perf_gate.sh` (p95/p99 regression + throughput floor).
- Added fixture profile:
  - `tests/fixtures/quic_perf/native_vs_bridge.json`
- Added CI wiring for fixture-based perf gate in `.github/workflows/ci.yml`.
- Added CI/script guard test `tests/quic_ops_tdd.rs`.

### R6: README/status sync

Red phase:

- Added failing docs guards in `tests/docs_tdd.rs` for QUIC status wording and
  helper script references.

Green phase:

- Updated `README.md` done/not-done sections:
  - backend selector/rollout status
  - explicit pending full tokio-free `quinn-proto` cutover note
  - QUIC interop/perf/soak helper scripts listed
- Added docs assertions:
  - `readme_tracks_quic_rollout_done_and_not_done_status`
  - `implementation_log_contains_r1_to_r9_breakdown_sections`

### R7: Companion hardening beyond smoke

Red phase:

- Added failing broader maturity tests across companion crates.

Green phase:

- Added companion hardening tests:
  - `crates/spargio-process/tests/maturity_tdd.rs`
  - `crates/spargio-signal/tests/maturity_tdd.rs`
  - `crates/spargio-protocols/tests/foundation_tdd.rs`
  - `crates/spargio-tls/tests/tls_tdd.rs`
  - `crates/spargio-ws/tests/ws_tdd.rs`
- Added `scripts/companion_ci_hardening.sh`.
- Wired CI (`companion-matrix`) to run hardening lane.
- Extended `tests/companion_ops_tdd.rs` to assert hardening script + CI wiring.

### R8: DNS and deferred fs items encoded as explicit contracts

Red phase:

- Added failing contract/behavior tests in `tests/deferred_items_tdd.rs`.

Green phase:

- Added README contract assertions for:
  - DNS `ToSocketAddrs` caveat and `SocketAddr` alternatives
  - deferred fs helper list + `metadata_lite`
- Added feature-gated behavior tests (`uring-native` Linux lane) for:
  - hostname connect and socket-addr connect behavior
  - deferred fs helper execution (`create_dir_all`, `canonicalize`, `metadata`,
    `symlink_metadata`, `set_permissions`, `metadata_lite`)

### R9: Scheduler/docs maturity

Red phase:

- Added failing runtime test for scheduler tuning knob visibility.

Green phase:

- Added scheduler knob:
  - `RuntimeBuilder::steal_victim_stride(...)`
- Plumbed victim stride through work-stealing loop and stats snapshot:
  - `RuntimeStats::steal_victim_stride`
- Added runtime test:
  - `runtime_builder_steal_victim_stride_is_reported_and_clamped`
- Added mdBook chapter:
  - `book/src/scheduler_tuning.md`
- Updated book summary:
  - `book/src/SUMMARY.md`

### Validation

Executed and passing during this sweep:

- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic --test interop_tdd`
- `cargo test -p spargio-quic --test soak_tdd`
- `cargo test --test quic_perf_guardrail_tdd`
- `cargo test --test quic_ops_tdd`
- `cargo test --test docs_tdd`
- `cargo test --test runtime_tdd`
- `cargo test -p spargio-process --test maturity_tdd`
- `cargo test -p spargio-signal --test maturity_tdd`
- `cargo test -p spargio-protocols --test foundation_tdd --features uring-native`
- `cargo test -p spargio-tls --test tls_tdd`
- `cargo test -p spargio-ws --test ws_tdd`
- `cargo test --test deferred_items_tdd`
- `cargo test --features uring-native --test deferred_items_tdd`
- `cargo test --test companion_ops_tdd`

## Update: Remaining work breakdown after N1-N8 (2026-03-01)

This captures the concrete work still required for current "not done yet"
items after completing native QUIC skeleton milestones N1-N8.

### R1: QUIC native data-path cutover (bridge replacement)

Scope:

- route `QuicEndpoint::{connect, connect_with, accept, wait_idle}` and
  `QuicConnection` operations through `NativeProtoDriver` instead of
  `spawn_on_bridge_runtime`.
- keep bridge path only as explicit compatibility fallback.

Red tests first:

- assert public `QuicEndpoint` operations do not require Tokio bridge runtime.
- assert API behavior parity versus current bridge-path semantics.

Green acceptance:

- default QUIC path is native-driver-backed.
- bridge path is opt-in and clearly documented.

### R2: Real protocol progression over native loop (beyond skeleton semantics)

Scope:

- replace placeholder stream/datagram progression logic with true
  `quinn-proto` connection/event handling and transmit scheduling.
- map connection lifecycle and stream transitions to protocol-driven state.

Red tests first:

- protocol-level stream open/accept/finish/reset behavior fails under skeleton.
- datagram and close semantics fail under protocol-correct expectations.

Green acceptance:

- protocol-driven tests pass with deterministic behavior under concurrency.

### R3: Native QUIC interop matrix

Scope:

- add interop suite: native Spargio QUIC endpoint vs quinn peer.
- add at least one non-quinn peer lane where practical.

Red tests first:

- handshake/data exchange against peer(s) fails before interop wiring.

Green acceptance:

- CI interop lane passes for all selected peers and profiles.

### R4: Long-window soak + fault qualification

Scope:

- extend current fault hooks into soak lanes (loss/reorder/drop over duration).
- add connection churn and memory-growth assertions.

Red tests first:

- soak/fault lanes expose regressions in retries or queue growth.

Green acceptance:

- no unbounded queue/memory growth in long-window runs.
- fault scenarios meet defined success/error-rate thresholds.

### R5: Performance gate integration for rollout

Scope:

- integrate `NativeProtoPerfGate` into repeatable benchmark guardrail workflow.
- produce native-vs-bridge verdicts for p95/p99 and throughput.

Red tests first:

- guardrail fails when synthetic/fixture regressions exceed thresholds.

Green acceptance:

- documented threshold policy and passing perf-gate lane in CI tooling.

### R6: README/status sync for new QUIC reality

Scope:

- update `README.md` done/not-done to reflect native-driver milestones N1-N8.
- explicitly separate "native skeleton done" vs "full default cutover pending".

Red tests first:

- docs/status tests fail when README stale relative to implementation log.

Green acceptance:

- README done/not-done sections accurately mirror implementation state.

### R7: Companion hardening beyond smoke lanes (repo-wide)

Scope:

- deepen failure-injection/soak coverage across companion protocol crates.
- add broader p95/p99 operational gates where meaningful.

Red tests first:

- dedicated hardening tests expose missing coverage and drift.

Green acceptance:

- companion CI includes deeper operational coverage, not smoke only.

### R8: DNS and fs deferred items (repo-wide)

Scope:

- evaluate nonblocking DNS strategies for `ToSocketAddrs` paths or keep explicit
  `SocketAddr` requirement with stronger docs/contracts.
- decide and implement remaining deferred fs helper migration cases where
  value/complexity tradeoff is justified.

Red tests first:

- DNS-path behavior and deferred fs helper behavior encoded in explicit tests.

Green acceptance:

- each deferred item either implemented with tests or explicitly documented as
  intentionally deferred with rationale.

### R9: Scheduler/docs maturity (repo-wide)

Scope:

- advance work-stealing policy tuning beyond MVP heuristics.
- expand mdBook operations/placement/API-selection guidance to current depth.

Red tests first:

- scheduler tuning guardrails and docs-link/coverage tests for new chapters.

Green acceptance:

- measurable scheduler improvements in targeted workloads.
- book coverage aligned with current feature set and operational guidance.

### Suggested execution order from here

1. R1 native cutover.
2. R2 protocol-correct progression.
3. R3 interop matrix.
4. R4 soak/fault qualification.
5. R5 perf-gate integration.
6. R6 README/status sync.
7. R7 companion hardening.
8. R8 DNS/fs deferred decisions.
9. R9 scheduler/docs maturity.

## Update: Phase N8 implemented (fault injection + rollout/perf gates) with Red/Green TDD (2026-03-01)

Implemented N8 qualification primitives: deterministic fault injection controls,
fault stats, and explicit rollout/performance gate APIs.

### Red phase

Added failing tests in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_fault_injection_drops_ingress_and_tracks_stats`
- `native_proto_driver_reorders_egress_when_fault_enabled`
- `native_proto_perf_gate_marks_material_regression_as_fail`
- `native_proto_rollout_stage_is_experimental_for_now`

Expected red failures:

- missing fault spec/stats APIs and behaviors
- missing rollout/performance gate types

### Green phase

Added fault-injection types and APIs:

- `NativeProtoFaultSpec`
- `NativeProtoFaultStats`
- `NativeProtoDriver::{set_fault_spec, fault_stats}`

Owner-loop fault behaviors:

- optional inbound drop mode (`drop_inbound`)
- optional egress drop mode (`drop_egress`)
- optional egress reorder mode (`reorder_egress` on drain)
- tracked fault counters:
  - inbound drops
  - egress drops
  - egress reorder operations

Added rollout/performance gate types:

- `NativeProtoRolloutStage` with current stage:
  - `NativeProtoDriver::rollout_stage() == Experimental`
- `NativeProtoPerfGate`
- `NativeProtoPerfVerdict`
- regression evaluation helper:
  - `NativeProtoPerfGate::evaluate(...)`

Wrapper parity:

- local/send native wrappers delegate fault spec/stats APIs.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic`

## Update: Phase N7 implemented (native observability surface) with Red/Green TDD (2026-03-01)

Implemented native-driver stats snapshots and structured event logging with
bounded event retention.

### Red phase

Added failing tests in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_stats_track_key_operations`
- `native_proto_driver_event_log_captures_timeout_and_backpressure`

Expected red failures:

- missing `stats()` and `drain_events()` APIs
- missing `NativeProtoEvent` and operation counters

### Green phase

Added observability types:

- `NativeProtoStats`
- `NativeProtoEvent`

Added native driver APIs:

- `stats().await`
- `drain_events(max).await`

Owner-loop observability behavior:

- tracks operation totals and key domain counters:
  - connection registrations
  - stream opens (uni/bi)
  - datagram ingest/oversize rejections
  - backpressure hits
  - timer fires
- emits structured events for:
  - connection registration
  - timeout firing
  - oversized datagram rejection
  - backpressure events
- retains events in bounded FIFO buffer (`NATIVE_EVENT_CAPACITY`).

Wrapper parity:

- local/send native wrappers delegate `stats()` and `drain_events(...)`.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic`

## Update: Phase N5 implemented (datagram limits + transport tuning surface) with Red/Green TDD (2026-03-01)

Implemented datagram-size enforcement and transport tuning roundtrip APIs on
the native driver surface.

### Red phase

Added failing tests in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_transport_tuning_roundtrip`
- `native_proto_driver_rejects_oversized_datagram_per_tuning`

Expected red failures:

- missing transport tuning type and setter/getter APIs
- no max-datagram-size enforcement in datagram ingest path

### Green phase

Added tuning type:

- `NativeProtoTransportTuning`
  - `max_datagram_size`
  - `send_window`
  - `receive_window`
  - `keep_alive_interval`
  - `mtu_discovery_enabled`
  - builder-style `with_*` methods

Added native driver methods:

- `set_transport_tuning(...).await`
- `transport_tuning().await`

Owner-loop behavior:

- tracks active tuning config.
- validates `max_datagram_size > 0` on update.
- `submit_datagram` rejects oversized payloads with `InvalidInput`.

Wrapper parity:

- local/send native wrappers delegate tuning setter/getter as well.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic`

## Update: Phase N6 implemented (native local/send ergonomics mapping) with Red/Green TDD (2026-03-01)

Implemented `!Send` local and explicit send-handoff wrappers for the native
driver command surface.

### Red phase

Added failing tests in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_local_send_handoff_preserves_identity`
- `native_proto_driver_send_handle_respects_shutdown`

Expected red failures:

- missing `to_local()` / `to_send_handle()` on `NativeProtoDriver`
- missing local/send wrapper types

### Green phase

Added wrapper types:

- `NativeProtoDriverLocal` (`Rc`-backed local handle)
- `NativeProtoDriverSend` (`Send` handoff handle)

Added conversions:

- `NativeProtoDriver::to_local()`
- `NativeProtoDriver::to_send_handle()`
- `NativeProtoDriverLocal::to_send_handle()`

Delegated native-driver operations through wrappers (probe/shutdown/connection
and stream APIs) while preserving endpoint identity and closed-state behavior.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic`

## Update: Phase N4 implemented (connection/stream pump skeleton) with Red/Green TDD (2026-03-01)

Implemented a deterministic native connection/stream event-pump skeleton in the
owner task to model connection registration and stream lifecycle transitions.

### Red phase

Added failing tests in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_open_uni_roundtrips_to_accept_uni`
- `native_proto_driver_open_bi_roundtrips_to_accept_bi`
- `native_proto_driver_finish_and_reset_stream_are_observable`

Expected red failures:

- missing connection registration and stream open/accept APIs
- missing stream finish/reset state tracking

### Green phase

Added native connection/stream pump surface:

- new stream-state type:
  - `NativeProtoStreamState { finished, reset }`
- new driver methods:
  - `register_connection_for_test().await`
  - `open_uni_on_connection(...).await`
  - `accept_uni_on_connection(...).await`
  - `open_bi_on_connection(...).await`
  - `accept_bi_on_connection(...).await`
  - `finish_stream(...).await`
  - `reset_stream(...).await`
  - `stream_state(...).await`

Owner-loop internals:

- per-connection registry (`HashMap`) with:
  - pending uni accept queue
  - pending bi accept queue
  - per-stream terminal state
- deterministic error behavior:
  - unknown connection/stream => `NotFound`
  - accept with no pending stream => `WouldBlock`

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic`

## Update: Phase N3 implemented (timer/wake progression skeleton) with Red/Green TDD (2026-03-01)

Implemented deterministic timer progression primitives in the native driver
loop to support deadline scheduling and stale-deadline supersession semantics.

### Red phase

Added failing tests in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_timers_fire_when_deadline_passes`
- `native_proto_driver_newer_deadline_supersedes_older`

Expected red failures:

- missing timeout scheduling/clock-advance APIs
- missing timeout fire accounting and generation tracking

### Green phase

Extended native driver with timer-state APIs:

- new type:
  - `NativeProtoTimerState`
- new methods:
  - `schedule_timeout(after).await -> generation`
  - `advance_clock_for_test(by).await -> NativeProtoTimerState`
  - `timer_state().await -> NativeProtoTimerState`

Owner-loop behavior:

- maintains synthetic monotonic `now`.
- tracks single active deadline with generation ID.
- newer deadline supersedes older deadline.
- timeout fires increment counter and record last fired generation.
- deadline state is queryable after each progression step.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic`

## Update: Phase N2 implemented (native UDP ingress/egress skeleton) with Red/Green TDD (2026-03-01)

Implemented bounded UDP ingress/egress command plumbing in the native driver
loop so the owner task can ingest datagrams and emit queued transmits.

### Red phase

Added failing tests in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_ingests_datagrams_and_supports_bounded_drain`
- `native_proto_driver_egress_queue_applies_backpressure`
- `native_proto_driver_drain_is_fifo_and_batch_limited`

Expected red failures:

- missing `submit_datagram`, `drain_transmits`, and queue backpressure methods
- missing `NativeProtoTransmit` and N2-specific options

### Green phase

Extended native driver API in `crates/spargio-quic/src/lib.rs`:

- new options:
  - `NativeProtoDriverOptions::with_max_pending_transmits(...)`
- new types:
  - `NativeProtoTransmit`
  - `NativeProtoIngressReport`
- new driver methods:
  - `submit_datagram(remote, payload).await`
  - `drain_transmits(max).await`
  - `enqueue_transmit_for_test(...).await` (deterministic queue-path test hook)

Owner-loop integration details:

- owner loop now maintains:
  - `quinn_proto::Endpoint`
  - bounded `VecDeque<NativeProtoTransmit>` egress queue
- `submit_datagram` path feeds payload into `Endpoint::handle(...)`.
- response/new-connection outputs are converted into queued transmits.
- queue saturation returns deterministic `WouldBlock`.
- drain path is FIFO and batch-limited.

Cargo updates:

- `crates/spargio-quic/Cargo.toml` adds:
  - `bytes = "1"` (for `BytesMut` ingress feed)

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic`

### Notes on long-term direction

- This slice removes the highest-friction bridge behavior (per-call runtime
  creation) and adds the targeted API/ergonomics/metrics groundwork.
- The deeper final-form goal remains: replace bridge-centric data-path handling
  with a native shard-owned `quinn-proto` endpoint driver over Spargio
  primitives.

## Update: Native `quinn-proto` next-step breakdown (2026-03-01)

Added concrete execution plan for the next major step: moving QUIC data-plane
ownership from bridge mode to a Spargio-native `quinn-proto` endpoint driver.

### Phase N1: Driver skeleton and ownership model

Scope:

- add a shard-affine endpoint task that owns `quinn_proto::Endpoint`.
- define command mailbox + response channels for app API calls.
- define stable internal IDs for endpoint/connection/stream handles.

Red tests first:

- endpoint task boots and accepts command loop.
- commands are rejected after endpoint shutdown with deterministic errors.
- connection/stream IDs remain stable across handle clones.

Green acceptance:

- no Tokio runtime creation per endpoint operation.
- one owner task per endpoint socket lifecycle.

### Phase N2: UDP ingress/egress integration

Scope:

- wire native UDP recv/send loops to feed `Endpoint::handle(...)`.
- emit and send all required transmits from endpoint/connection progression.
- support bounded batching and clear overload behavior.

Red tests first:

- received UDP datagram drives handshake progress.
- generated transmits are flushed and peer receives expected payload.
- bounded queue overflow yields deterministic backpressure errors.

Green acceptance:

- no busy-spin loops.
- sustained traffic does not leak buffers/queues.

### Phase N3: Timer and wake progression

Scope:

- map `poll_timeout`/`handle_timeout` onto Spargio timers.
- implement endpoint wake scheduling for retransmit/PTO/deadline updates.

Red tests first:

- timeout-driven retransmit path is exercised under packet loss.
- stale timer update does not regress newer deadline scheduling.

Green acceptance:

- driver sleeps until next meaningful deadline.
- timer races do not produce duplicated work loops.

### Phase N4: Connection and stream event pump

Scope:

- map `quinn-proto` connection events to public `QuicConnection` operations.
- implement uni/bi stream open/accept/read/write/finish/reset plumbing.
- preserve current `QuicConnection` API behavior and error shape.

Red tests first:

- bi/uni stream open+echo paths pass under concurrent connections.
- finish/reset/stop semantics match expected transport behavior.

Green acceptance:

- no API regression relative to current `spargio-quic` tests.
- deterministic cancellation/drop semantics.

### Phase N5: Datagram and transport tuning depth

Scope:

- complete datagram send/recv behavior with size-limit enforcement.
- expose practical tuning pass-throughs (transport windows, keepalive, MTU).

Red tests first:

- oversized datagrams fail predictably.
- tuning knobs are plumbed and affect runtime-observable behavior.

Green acceptance:

- datagram paths are parity-complete for common workloads.

### Phase N6: Local `!Send` and cross-thread handoff mapping

Scope:

- keep `LocalQuicConnection` and `QuicSendConnection` on native backend.
- enforce ownership/thread invariants with explicit handoff boundaries.

Red tests first:

- local-to-send handoff preserves stable identity and operation correctness.
- invalid post-shutdown/local misuse yields deterministic errors.

Green acceptance:

- current ergonomics tests remain green without bridge fallback.

### Phase N7: Observability and operations gates

Scope:

- emit native-path counters and structured lifecycle/error events.
- add p50/p95/p99 and retransmit/PTO visibility hooks for CI and soak lanes.

Red tests first:

- counters advance for connects/accepts/streams/datagrams/timeouts.
- error events include terminal reason classes.

Green acceptance:

- companion CI lane includes native-QUIC smoke targets.
- soak lane validates no unbounded growth.

### Phase N8: Interop/fault/perf qualification and rollout

Scope:

- interop against at least quinn peer + one non-quinn peer where practical.
- fault-injection matrix: loss/reorder/duplication/timeout.
- benchmark A/B against current bridge path.

Red tests first:

- forced-loss and reorder scenarios fail without reliability fixes.
- A/B harness asserts no material regressions versus baseline thresholds.

Green acceptance:

- native backend meets or exceeds checklist thresholds for default use.
- bridge backend retained as compatibility fallback until native lane is
  sufficiently hardened in CI/soak.

### Immediate execution order

1. N1 driver skeleton and ownership model.
2. N2 UDP integration.
3. N3 timers/wakes.
4. N4 connection/stream pump.
5. N6 ergonomics mapping.
6. N5 datagram/tuning depth.
7. N7 observability/ops.
8. N8 interop/fault/perf rollout gates.

## Update: Phase N1 implemented (native driver skeleton + ownership model) with Red/Green TDD (2026-03-01)

Implemented the first native `quinn-proto` milestone as a dedicated driver
skeleton API while preserving existing `QuicEndpoint` behavior.

### Red phase

Added failing tests in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_runs_on_owner_shard`
- `native_proto_driver_stable_ids_are_monotonic`
- `native_proto_driver_rejects_commands_after_shutdown`

Expected red failures:

- missing `NativeProtoDriver` and `NativeProtoDriverOptions`
- no owner-shard task mailbox or stable-id allocation surface

### Green phase

Added native-driver skeleton in `crates/spargio-quic/src/lib.rs`:

- new options:
  - `NativeProtoDriverOptions` (`owner_shard`)
- new probe snapshot:
  - `NativeProtoDriverProbe`
- new driver handle:
  - `NativeProtoDriver::start(&RuntimeHandle, options)`
  - `probe()`
  - `allocate_connection_id()`
  - `allocate_stream_id()`
  - `shutdown()`
  - `is_closed()`
  - `endpoint_id()`
  - `owner_shard()`

Ownership and mailbox semantics:

- driver loop is spawned via `RuntimeHandle::spawn_local_on(owner_shard, ...)`.
- loop owns a `quinn_proto::Endpoint` instance and processes command mailbox
  messages serially.
- stable endpoint IDs are generated globally (`NEXT_NATIVE_ENDPOINT_ID`).
- connection/stream IDs are generated monotonically within the owner task.
- post-shutdown commands are rejected with `BrokenPipe`.

Cargo updates:

- `crates/spargio-quic/Cargo.toml` adds direct dependency:
  - `quinn-proto = "0.11"`

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic`

## Update: R2 continuation (proto-backed command semantics for connected handles) with Red/Green TDD (2026-03-01)

Implemented a follow-up R2 slice that routes `connect_for_test`-backed
stream/datagram commands through real `quinn-proto::Connection` APIs instead
of only synthetic queue behavior.

### Red phase

Added failing tests in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_connect_for_test_open_uni_respects_proto_stream_credit`
- `native_proto_driver_connect_for_test_open_bi_respects_proto_stream_credit`

Red expectation:

- with synthetic fallback still active for connected handles, `open_uni`/`open_bi`
  incorrectly succeeded even when protocol stream credit had not been granted.

### Green phase

Updated owner-loop command handlers in `crates/spargio-quic/src/lib.rs`:

- proto-connected path (`connection_id -> ConnectionHandle`) now uses
  `quinn_proto::Connection` operations for:
  - `open_uni_on_connection` / `open_bi_on_connection`
  - `accept_uni_on_connection` / `accept_bi_on_connection`
  - `send_datagram_on_connection_for_test` / `recv_datagram_on_connection_for_test`
  - `finish_stream` / `reset_stream`
- added conversion/error helpers:
  - `proto_stream_id_from_u64(...)`
  - `proto_send_datagram_error_to_io(...)`
  - `proto_finish_error_to_io(...)`
- after mutating proto-backed stream/datagram state, the loop now drives
  `drive_native_proto_connections(...)` to flush resulting endpoint/transmit work.
- synthetic fallback behavior remains for explicitly synthetic test connections
  created by `register_connection_for_test`.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd native_proto_driver_connect_for_test_open_`
- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic --test native_cutover_tdd`
- `cargo test -p spargio-quic --test interop_tdd`
- `cargo test -p spargio-quic`

## Update: R2 continuation (proto close-path emit on connected handles) with Red/Green TDD (2026-03-01)

Implemented close-path progression for `connect_for_test` protocol-backed
connections so `close_connection_for_test(...)` produces close transmits before
connection teardown.

### Red phase

Added failing test in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_close_connection_for_test_emits_close_transmit_for_proto_connection`

Red expectation:

- close command removed proto connection state immediately, so draining transmits
  after close yielded no close packet output.

### Green phase

Updated `NativeProtoCommand::CloseConnectionForTest` handling in
`crates/spargio-quic/src/lib.rs`:

- for protocol-backed connection IDs:
  - call `quinn_proto::Connection::close(...)` with an app close code/reason.
  - run `drive_native_proto_connections(...)` to flush close-path transmits.
  - then remove handle mappings and stored protocol state.
- retained synthetic-connection cleanup behavior for non-proto test handles.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd native_proto_driver_close_connection_for_test_emits_close_transmit_for_proto_connection`
- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic --test native_cutover_tdd`
- `cargo test -p spargio-quic --test interop_tdd`

## Update: R2 continuation (payload-carrying transmits + server-accept path) with Red/Green TDD (2026-03-01)

Implemented the next native-proto progression slice so driver transmits include
actual datagram payload bytes, and so a driver configured with server config can
accept client `connect_for_test` traffic over the same command surface.

### Red phase

Added failing test in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_server_config_accepts_client_transmits`

Initial red failure surfaced a protocol gap:

- `submit_datagram(...)` incorrectly enforced app-datagram tuning limits on raw
  protocol ingress datagrams, rejecting valid Initial packets.

### Green phase

Updated `crates/spargio-quic/src/lib.rs`:

- `NativeProtoDriverOptions` now supports optional server mode:
  - added `server_config: Option<quinn::ServerConfig>`
  - added `with_server_config(...)`
- owner loop now initializes endpoint with optional server config and allows
  incoming accepts when configured.
- `NativeProtoTransmit` now carries `payload: Vec<u8>` in addition to metadata.
- `push_native_transmit(...)` and all transmit producers now preserve payload
  bytes from scratch buffers (`transmit_payload(...)` helper).
- `SubmitDatagram` `NewConnection` handling:
  - when server-configured, uses `Endpoint::accept(...)` and registers the new
    protocol connection handle + synthetic connection ID mapping.
  - otherwise preserves explicit `refuse(...)` behavior.
- corrected datagram-size semantics:
  - removed tuning max-size enforcement from raw `submit_datagram(...)` ingress.
  - kept/enforced tuning max-size on app datagram API
    `send_datagram_on_connection_for_test(...)`, with stats/event accounting.

Test updates:

- updated `NativeProtoTransmit` test fixtures to include payload bytes.
- adjusted oversized-datagram test to validate app-datagram path:
  - `native_proto_driver_rejects_oversized_datagram_per_tuning` now uses
    `send_datagram_on_connection_for_test(...)`.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd native_proto_driver_server_config_accepts_client_transmits`
- `cargo test -p spargio-quic --test quic_tdd native_proto_driver_rejects_oversized_datagram_per_tuning`
- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic --test native_cutover_tdd`
- `cargo test -p spargio-quic --test interop_tdd`
- `cargo test -p spargio-quic`
- `cargo test --test docs_tdd`

## Update: R2 continuation (post-handshake stream open/accept contract) with Red/Green TDD (2026-03-01)

Added executable contract coverage for a protocol-correct post-handshake stream
path across two native drivers.

### Red phase

Added failing test in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_post_handshake_bi_stream_open_is_accepted_by_server`

Initial red behavior:

- server-side accept immediately after client `open_bi` failed with
  `WouldBlock` because no peer-visible stream signal had been transmitted yet.

### Green phase

Adjusted the test flow to align with protocol semantics:

- after client `open_bi`, call `finish_stream` to emit stream signaling.
- exchange transmit payloads between client/server drivers.
- assert server `accept_bi_on_connection(...)` observes the opened stream.

Also factored reusable driver-exchange helper in test module:

- `exchange_driver_transmits(...)`

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd native_proto_driver_post_handshake_bi_stream_open_is_accepted_by_server`
- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic --test native_cutover_tdd`
- `cargo test -p spargio-quic --test interop_tdd`
- `cargo test -p spargio-quic`
- `cargo test --test docs_tdd`

## Update: R2 continuation (remote close propagation into native connection state) with Red/Green TDD (2026-03-02)

Implemented another R2 protocol-progression slice so a peer-initiated close is
observable through `NativeProtoConnectionState.closed` on the remote side.

### Red phase

Added failing test in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_remote_close_marks_peer_connection_closed`

Initial red behavior:

- server `close_connection_for_test(...)` produced close traffic, but the client
  driver never reflected `ConnectionLost` into its tracked synthetic connection
  state, so `connection_state(...).closed` remained `false`.

### Green phase

Updated `crates/spargio-quic/src/lib.rs`:

- wired reverse mapping `connection_id_by_handle` for all protocol-backed
  connection registrations (`connect_for_test` and server accept path).
- extended `drive_native_proto_connections(...)` to poll application events via
  `quinn_proto::Connection::poll()` and handle `Event::ConnectionLost`.
- on connection-lost:
  - mark corresponding `NativeProtoConnectionState.closed = true`.
  - clear pending synthetic queues for that connection.
  - remove handle mappings and protocol connection state while preserving
    synthetic connection ID visibility for state queries.
- ensured explicit close-path teardown also removes reverse handle mappings.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd native_proto_driver_remote_close_marks_peer_connection_closed -- --exact`
- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic --test native_cutover_tdd`
- `cargo test -p spargio-quic --test interop_tdd`
- `cargo test -p spargio-quic --test soak_tdd`
- `cargo test -p spargio-quic`

## Update: R2 continuation (connection-closed lifecycle event) with Red/Green TDD (2026-03-02)

Implemented another native-proto lifecycle slice so remote close transitions are
observable through the event stream, not only via polled connection state.

### Red phase

Added failing test in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_remote_close_emits_connection_closed_event`

Initial red behavior:

- peer close updated no lifecycle event; client-side event drain only contained
  prior events such as registration, with no explicit close transition signal.

### Green phase

Updated `crates/spargio-quic/src/lib.rs`:

- extended `NativeProtoEvent` with:
  - `ConnectionClosed { connection_id: u64 }`
- explicit close command path now emits `ConnectionClosed` exactly once when a
  tracked connection transitions from open to closed.
- protocol-driven close path in `drive_native_proto_connections(...)` now emits
  `ConnectionClosed` on `quinn_proto::Event::ConnectionLost` before handle
  retirement and mapping cleanup.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd native_proto_driver_remote_close_emits_connection_closed_event -- --exact`
- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic --test native_cutover_tdd`
- `cargo test -p spargio-quic --test interop_tdd`
- `cargo test -p spargio-quic`

## Update: R2 continuation (closed-connection stats accounting) with Red/Green TDD (2026-03-02)

Implemented another native-proto observability slice so connection-close
transitions are tracked in stats alongside lifecycle events.

### Red phase

Added failing test in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_close_transitions_increment_closed_stats`

Initial red behavior:

- `NativeProtoStats` exposed no close counter, so close transitions were not
  measurable through stats snapshots.

### Green phase

Updated `crates/spargio-quic/src/lib.rs`:

- extended `NativeProtoStats` with:
  - `connections_closed: u64`
- incremented `connections_closed` on first transition to closed in both paths:
  - explicit command close (`CloseConnectionForTest`)
  - protocol-driven peer close (`Event::ConnectionLost`)
- preserved saturation semantics (`saturating_add`) and no double-counting on
  repeated close attempts.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd native_proto_driver_close_transitions_increment_closed_stats -- --exact`
- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic --test native_cutover_tdd`
- `cargo test -p spargio-quic --test interop_tdd`
- `cargo test -p spargio-quic`

## Update: R2 continuation (post-handshake datagram roundtrip contract coverage) (2026-03-02)

Added explicit regression coverage for protocol-backed app datagram traffic
across two native drivers after handshake.

### Contract test added

In `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_post_handshake_datagram_roundtrip_tracks_state`

Coverage validates:

- client->server and server->client app datagram exchange over protocol-backed
  connection IDs (`connect_for_test` + server accept path).
- payload integrity on both directions.
- per-connection datagram state accounting (`datagrams_sent` /
  `datagrams_received`) on both peers.

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd native_proto_driver_post_handshake_datagram_roundtrip_tracks_state -- --exact`
- `cargo test -p spargio-quic --test quic_tdd`

## Update: Full QUIC native cutover execution plan (multi-agent parallel breakdown) (2026-03-02)

Captured an explicit agent-by-agent plan for finishing full native QUIC
integration (`QuicEndpoint`/`QuicConnection` on `NativeProtoDriver`, bridge as
explicit fallback only).

### Agent A (critical path): public native-path cutover spine

Scope:

- replace `QuicEndpoint` native backend internals so
  `connect`/`connect_with`/`accept`/`wait_idle` route through
  `NativeProtoDriver` instead of `NativeEndpointDispatch`.
- rewire `QuicConnection` native backend internals so
  `closed`/`open_uni`/`open_bi`/`accept_uni`/`accept_bi`/datagram ops route
  through `NativeProtoDriver` commands.
- preserve current timeout, backpressure, and error-shape contracts.

Red/green slices:

1. endpoint op dispatch red tests against bridge-runtime counters.
2. connection op dispatch red tests against bridge-runtime counters.
3. green implementation for endpoint + connection command routing.

Primary files:

- `crates/spargio-quic/src/lib.rs`
- `crates/spargio-quic/tests/native_cutover_tdd.rs`
- `crates/spargio-quic/tests/quic_tdd.rs`

Dependencies:

- none (first mover).

### Agent B: stream abstraction and API-coupling migration

Scope:

- remove hard native-path reliance on concrete `quinn` stream types for public
  operations while preserving ergonomic API shape.
- implement a stable stream wrapper strategy compatible with driver-owned stream
  IDs and command-based progression.
- keep `LocalQuicConnection`/`QuicSendConnection` behavior equivalent.

Red/green slices:

1. red tests for stream open/accept/read/write/finish/reset parity under native
   backend without direct Tokio actor usage.
2. green wrapper + plumbing implementation.

Primary files:

- `crates/spargio-quic/src/lib.rs`
- `crates/spargio-quic/tests/quic_tdd.rs`

Dependencies:

- starts after Agent A selects/lands endpoint/connection command contracts.

### Agent C: lifecycle, metrics, and close-state parity hardening

Scope:

- ensure native cutover preserves lifecycle semantics (`close`, `closed`,
  `wait_idle`, remote-loss propagation).
- ensure metric counters and event emission remain parity-correct
  (connect/accept/streams/datagrams/timeouts/close transitions).

Red/green slices:

1. red parity tests for close/idle and metric snapshots.
2. green implementation updates for metric increments and event mapping.

Primary files:

- `crates/spargio-quic/src/lib.rs`
- `crates/spargio-quic/tests/quic_tdd.rs`

Dependencies:

- can start test authoring in parallel; final green depends on Agent A changes.

### Agent D: interop + soak + perf gate re-qualification

Scope:

- re-run and adjust interop matrix against native-default public path.
- expand soak/fault assertions for native cutover regressions.
- validate and refresh perf-gate fixtures/threshold notes only where
  materially justified.

Red/green slices:

1. red on interop/soak/perf scripts when cutover shifts behavior.
2. green script/test/fixture updates with documented rationale.

Primary files:

- `crates/spargio-quic/tests/interop_tdd.rs`
- `crates/spargio-quic/tests/soak_tdd.rs`
- `scripts/quic_interop_matrix.sh`
- `scripts/quic_soak_fault.sh`
- `scripts/quic_perf_gate.sh`
- `tests/quic_perf_guardrail_tdd.rs`

Dependencies:

- runs after Agent A/B/C stabilization.

### Agent E: docs and rollout-status sync

Scope:

- update README done/not-done QUIC language to reflect full native cutover.
- sync implementation log summary and operations notes.
- ensure docs tests for status consistency remain green.

Red/green slices:

1. docs status tests red for stale statements.
2. green README/book/log updates.

Primary files:

- `README.md`
- `IMPLEMENTATION_LOG.md`
- `tests/docs_tdd.rs`
- `book/src/*` (if needed)

Dependencies:

- runs after Agent A-D conclusions.

### Parallel execution graph

- lane 1 (critical): Agent A.
- lane 2 (prep parallel): Agent C test authoring.
- lane 3 (prep parallel): Agent B design + test scaffolding.
- lane 4 (post-cutover): Agent B implementation + Agent C green fixes.
- lane 5 (qualification): Agent D.
- lane 6 (final sync): Agent E.

### Merge order recommendation

1. Agent A foundational cutover PR.
2. Agent B stream/wrapper parity PR.
3. Agent C lifecycle/metrics parity PR.
4. Agent D qualification/perf PR.
5. Agent E docs/status PR.

### Exit criteria for "full QUIC native integration"

- native backend public API path no longer depends on Tokio bridge runtime
  constructs for endpoint/connection operations.
- bridge backend remains explicit compatibility fallback only.
- interop, soak, and perf gates pass with updated baselines and rationale.
- README and docs no longer list QUIC native cutover as in-progress.

## Update: Agent A+B milestone (public native-path cutover to NativeProtoDriver + stream wrappers) with Red/Green TDD (2026-03-02)

Implemented the critical cutover slice so default native `QuicEndpoint`/`QuicConnection`
operations route through `NativeProtoDriver` (with UDP ingress/egress/timer pump), while
bridge backend remains explicit fallback. Added stream wrapper types so the public API keeps
`write_all` / `read_to_end` / `finish` ergonomics without exposing Tokio-bound internals.

### Red phase

Added failing coverage in `crates/spargio-quic/tests/quic_tdd.rs`:

- `native_proto_driver_connected_event_marks_connection_established`
- `native_proto_driver_stream_write_read_roundtrip_over_proto_connection`

Initial red surfaced missing native-proto capabilities:

- no connection-established signal/state for handshake completion gating
- no stream payload write/read command surface on driver-backed connections

### Green phase

Updated `crates/spargio-quic/src/lib.rs`:

- `NativeProtoConnectionState` now tracks `established`.
- `NativeProtoEvent` now includes `ConnectionEstablished`.
- `drive_native_proto_connections(...)` now maps `quinn_proto::Event::Connected` into
  state/event transitions.
- added native stream payload command surface:
  - `WriteStreamOnConnection`
  - `ReadStreamOnConnection`
  - public driver helpers:
    - `write_stream_on_connection(...)`
    - `read_stream_on_connection(...)`
- added no-wait driver helpers used by sync API points (`finish/reset/close/send_datagram`) to
  avoid nested executor re-entry.
- introduced `NativeProtoEndpointBackend`:
  - owns per-endpoint spargio runtime + native driver + UDP socket
  - runs ingress/egress/timer pump tasks
  - tracks accept queue / known connection IDs
  - provides handshake/idle/closed wait helpers with timeout semantics
- native `QuicEndpoint::server/client` constructors now initialize `NativeProtoEndpointBackend`
  and route native operations through driver-backed flow.
- native `QuicConnection` now supports driver-backed mode via `NativeProtoConnectionHandle`.
- introduced stream wrappers:
  - `QuicSendStream`
  - `QuicRecvStream`
- updated connection APIs to return wrappers (bridge and native) while preserving ergonomic calls
  used by tests (`write_all`, `read_to_end`, `finish`).

### Validation

Executed and passing:

- `cargo test -p spargio-quic --test quic_tdd`
- `cargo test -p spargio-quic --tests`
- `cargo test --test docs_tdd`

Interop + cutover tests now pass with native default path using driver-backed backend:

- `interop_tdd` (raw quinn <-> spargio)
- `native_cutover_tdd`

## Update: Agent C+D+E follow-through (lifecycle parity, qualification re-check, docs sync) (2026-03-02)

Completed the remaining parallel-plan slices after native cutover.

### Lifecycle/metrics parity checks (Agent C)

Validated native cutover behavior against existing parity tests:

- native/bridge dispatch counters and lifecycle assertions in `native_cutover_tdd`
- connection op dispatch metrics in `quic_tdd`
- close/closed/wait-idle behavior under native default path

No additional metric-shape changes were required beyond the native-driver routing.

### Qualification re-check (Agent D)

Re-ran QUIC qualification-oriented suites on the cutover implementation:

- `interop_tdd` (raw quinn interop both directions)
- `native_cutover_tdd`
- `quic_tdd`
- `soak_tdd` lane remains intentionally ignored in regular runs (nightly lane)

All executed suites passed.

### Docs/status sync (Agent E)

Updated project status to reflect completed native cutover and revised not-done scope:

- `README.md`
  - added explicit done statement for driver-backed native QUIC path
  - replaced old “full native cutover not finished yet” note with remaining rollout/hardening work
- `tests/docs_tdd.rs`
  - updated docs assertion to track new README status wording

Validation:

- `cargo test --test docs_tdd` passes with updated status expectations.

## Update: Work-stealing scheduler optimization roadmap (2026-03-03)

This roadmap is a dedicated track for scheduler policy and cache-behavior
improvements. It is intentionally separate from earlier project-wide milestones.

### Milestone WS0: baseline + red tests (entry gate)

Scope:

- Add red tests for skew/hotspot/fairness behavior and starvation bounds.
- Lock benchmark baselines for scheduler-heavy workloads (`fanout_fanin`,
  `net_api` skewed/hotspot lanes).
- Add required scheduler counters for tuning (`failed_steal_streak`,
  local-hit ratio, stolen-per-scan).
- Capture initial profiler baselines (`callgrind`/`cachegrind`) for the same
  workloads.

Acceptance criteria:

- Red tests fail for missing behavior before implementation changes.
- Baseline benchmark and profiler artifacts are checked in or documented in the
  log with reproducible commands.

### Milestone WS1: low-risk cache-line hygiene

Scope:

- Add cache-line padding for hot shared scheduler state with high false-sharing
  risk (per-shard counters/metadata touched concurrently).
- Keep runtime API unchanged.

Acceptance criteria:

- All correctness tests stay green.
- Benchmark + profiler comparison shows no regression, and ideally reduced
  cache-pressure signals.

### Milestone WS2: adaptive steal gating

Scope:

- Introduce adaptive steal gating/backoff using recent local-work and
  steal-success history.
- Keep conservative defaults so behavior remains stable for existing users.

Acceptance criteria:

- Reduced low-value steal scans in low-contention paths.
- Throughput/latency stays neutral-or-better on baseline workloads.

### Milestone WS3: victim selection upgrade

Scope:

- Improve victim selection beyond static stride (cursor + spread/randomization
  or lightweight pressure hints).
- Preserve deterministic fallback mode for reproducible tests.

Acceptance criteria:

- Better steal-success ratio under skew/hotspot loads.
- No starvation regressions in fairness tests.

### Milestone WS4: batch stealing + wake policy refinement

Scope:

- Tune batch size policy (latency-friendly small bursts vs throughput-friendly
  bigger drains).
- Refine wake behavior to avoid unnecessary cross-shard wake traffic.

Acceptance criteria:

- p95/p99 latency does not regress materially in latency-sensitive lanes.
- Throughput improves or remains neutral in throughput-heavy lanes.

### Milestone WS5: optional queue backend experiment (ROI-gated)

Scope:

- Prototype lower-contention queue backend only if WS0-WS4 evidence indicates
  mutex queue contention remains a dominant bottleneck.

Acceptance criteria:

- Ship only on clear benchmark + profiler win with manageable complexity.
- If no clear win, document decision and keep current queue path.

### Milestone WS6: rollout, docs, and CI guardrails

Scope:

- Publish scheduler tuning guidance in README/book.
- Add benchmark + profiler guardrail workflow for scheduler changes.
- Define release-note format for scheduler policy changes and tradeoffs.

Acceptance criteria:

- CI/docs guardrails are green.
- Scheduler changes require paired correctness + benchmark + profiler evidence.

### Parallelizable execution plan

- Lane A (runtime): implement scheduler/padding changes behind red/green tests.
- Lane B (profiling): run `callgrind`/`cachegrind` before/after each milestone
  candidate and capture deltas.
- Lane C (bench validation): run criterion guardrails (`throughput`, `p95/p99`)
  and validate profiler deltas map to user-visible impact.
- Lane D (docs/ops): update tuning docs and milestone logs in parallel after
  each green slice.

### Milestone status update (as of 2026-03-03)

- `WS0` planned (not started).
- `WS1` planned (blocked on WS0 baselines).
- `WS2` planned (blocked on WS0/WS1 evidence).
- `WS3` planned (blocked on WS2 telemetry/profiler evidence).
- `WS4` planned (blocked on WS2/WS3 outcomes).
- `WS5` backlog/ROI-gated (only if contention remains dominant).
- `WS6` planned (runs continuously as milestones land).

## Update: WS0-WS6 implemented with red/green slices (2026-03-03)

Completed the dedicated work-stealing roadmap end-to-end.

### WS0 (baseline diagnostics + red tests) - implemented

Delivered scheduler diagnostics and tests:

- runtime stats now expose:
  - `steal_scans`
  - `steal_failed_streak_max`
  - `stealable_local_hits`
  - `RuntimeStats::local_hit_ratio()`
  - `RuntimeStats::stolen_per_scan()`
- new scheduler diagnostics coverage:
  - `steal_stats_expose_scan_and_locality_diagnostics`
  - builder/reporting coverage for new knobs.

### WS1 (cache-line hygiene) - implemented

Applied cache-line padding to hot shared scheduler structures:

- added `CachePadded<T>` (`#[repr(align(64))]`).
- padded per-shard command-depth and native-op-depth arrays.
- padded wake flags and queue internals where relevant.

### WS2 (adaptive steal gating) - implemented

Implemented adaptive gating/backoff:

- added policy knobs:
  - `steal_locality_margin`
  - `steal_fail_cost`
  - `steal_backoff_min`
  - `steal_backoff_max`
- steal loop now applies local-vs-migration gate and adaptive cooldown after
  repeated low-value scans.

### WS3 (victim selection upgrade) - implemented

Implemented probe-based victim selection:

- added `steal_victim_probe_count`.
- each steal scan samples multiple candidates and targets the largest estimated
  backlog victim (deterministic cursor/stride progression).

### WS4 (batch stealing + wake refinement) - implemented

Implemented dynamic batch steals and wake coalescing:

- added `steal_batch_size`.
- steal loop steals batches under high backlog.
- wake policy now coalesces redundant wakeups via per-shard atomic wake flags.
- added wake diagnostics:
  - `stealable_wake_sent`
  - `stealable_wake_coalesced`
- added coverage:
  - `stealable_wake_coalescing_tracks_bursty_submissions`.

### WS5 (optional backend experiment) - implemented

Added optional lower-contention queue backend (default unchanged):

- new public enum: `StealableQueueBackend::{Mutex, SegQueueExperimental}`.
- new builder API: `RuntimeBuilder::stealable_queue_backend(...)`.
- default remains `Mutex` for compatibility.
- added coverage:
  - `runtime_builder_supports_experimental_stealable_queue_backend`.

### WS6 (rollout/docs/CI guardrails) - implemented

Added profiler lane tooling, CI wiring, and docs updates:

- scripts:
  - `scripts/bench_scheduler_profile.sh` (callgrind + cachegrind capture).
  - `scripts/scheduler_profile_guardrail.sh` (ratio checks against baseline).
- CI:
  - nightly profile lane wired in `.github/workflows/ci.yml`.
- docs:
  - updated `README.md` done/not-done scheduler statements.
  - expanded `book/src/scheduler_tuning.md` with new knobs/metrics.
- ops TDD:
  - `tests/scheduler_profile_ops_tdd.rs` verifies script presence and CI wiring.

### Validation executed

- `cargo test --test runtime_tdd --test slices_tdd --test scheduler_profile_ops_tdd --test docs_tdd`
- `SUMMARY_JSON=target/scheduler_profiles/dev_summary.json WARMUP=0.005 MEASURE=0.01 SAMPLES=10 ./scripts/bench_scheduler_profile.sh fanout_fanin_skewed spargio_io_uring`
- `MAX_CALLGRIND_IR_RATIO=2.5 MAX_CACHEGRIND_D1MR_RATIO=2.5 MAX_CACHEGRIND_D1MW_RATIO=2.5 ./scripts/scheduler_profile_guardrail.sh tests/fixtures/scheduler_profile/fanout_fanin_skewed_spargio_io_uring.json target/scheduler_profiles/dev_summary.json`

### Updated milestone status

- `WS0` completed.
- `WS1` completed.
- `WS2` completed.
- `WS3` completed.
- `WS4` completed.
- `WS5` completed (experimental backend added; default unchanged).
- `WS6` completed.

## Follow-up: calibration + rollout quality backlog (2026-03-03)

Post-implementation quality work items for scheduler policy stabilization:

- Run broader A/B matrix beyond `fanout_fanin`:
  - `net_api` hotspot/rotation/pipeline shapes.
  - repeated runs and fixed CPU affinity to reduce noise.
- Tune defaults for adaptive knobs from measured data:
  - `steal_locality_margin`
  - `steal_fail_cost`
  - `steal_backoff_min` / `steal_backoff_max`
  - `steal_victim_probe_count`
  - `steal_batch_size`
- Decide status of `StealableQueueBackend::SegQueueExperimental`:
  - keep experimental vs promote as default/primary option.
- Harden profiler guardrails:
  - keep/update scheduler baseline fixture(s).
  - tighten ratio thresholds once variance is well-characterized.
- Validate on longer soak runs for sustained-skew tail-latency behavior.

## Update: calibration + rollout quality execution (2026-03-03)

Executed the post-WS calibration backlog end-to-end.

### 1) Broader A/B matrix with fixed affinity and repeats

Added `scripts/bench_scheduler_calibration.sh` and ran fixed-affinity (`taskset
0-3`) repeated A/B (`REPEATS=3`) on the requested `net_api` shapes against
pre-WS baseline (`43a0462`) vs current WS implementation:

- `net_stream_hotspot_rotation_4k/spargio_tcp_8streams_rotating_hotspot`
- `net_pipeline_hotspot_rotation_4k_window32/spargio_tcp_pipeline_hotspot`
- `net_keyed_hotspot_rotation_4k/spargio_tcp_keyed_router_hotspot`

Calibration summary (`target/scheduler_profiles/net_api_calibration_ws.json`):

- stream hotspot rotation: `+0.095%` (flat)
- pipeline hotspot rotation: `+0.043%` (flat)
- keyed hotspot rotation: `-0.053%` (flat)

Interpretation: scheduler changes are neutral on these skewed `net_api` shapes
under the selected harness settings.

### 2) Default tuning sweep and decision

Tested an aggressive default profile candidate:

- `steal_victim_probe_count=3`
- `steal_batch_size=6`
- `steal_locality_margin=0`
- `steal_backoff_max=16`

Against current defaults, this profile remained mixed/flat
(`target/scheduler_profiles/net_api_tuning_profile_a.json`):

- stream hotspot rotation: `-0.221%` (flat)
- pipeline hotspot rotation: `+0.480%` (flat, slight regression)
- keyed hotspot rotation: `-0.306%` (flat)

Decision: keep runtime defaults unchanged (no clear all-shapes win).

### 3) `SegQueueExperimental` promotion decision

Used benchmark env override hooks (`SPARGIO_BENCH_*`) added to
`benches/net_api.rs` and `benches/fanout_fanin.rs` to compare queue backend
profiles without changing runtime defaults.

`net_api` calibration with `SPARGIO_BENCH_STEALABLE_QUEUE_BACKEND=segqueue`
(`target/scheduler_profiles/net_api_tuning_segqueue.json`) was mixed/flat:

- stream hotspot rotation: `-0.266%`
- pipeline hotspot rotation: `-0.686%`
- keyed hotspot rotation: `+0.224%`

Sequential `fanout_fanin_balanced/spargio_io_uring` sanity check (fixed
affinity) showed slight regression for segqueue lane:

- default (mutex): ~`1.2306 ms`
- segqueue experimental: ~`1.2560 ms` (~`+2.1%` slower)

Decision: keep `StealableQueueBackend::SegQueueExperimental` as experimental;
do not promote to default.

### 4) Guardrail hardening

Refreshed scheduler profiler fixtures and tightened nightly guardrails:

- fixtures:
  - `tests/fixtures/scheduler_profile/fanout_fanin_skewed_spargio_io_uring.json`
  - `tests/fixtures/scheduler_profile/fanout_fanin_balanced_spargio_io_uring.json`
- nightly CI scheduler profiling now covers both skewed + balanced fanout
  shapes.
- thresholds tightened from permissive values to:
  - `MAX_CALLGRIND_IR_RATIO=1.35`
  - `MAX_CACHEGRIND_D1MR_RATIO=1.35`
  - `MAX_CACHEGRIND_D1MW_RATIO=1.35`

### 5) Soak validation

Executed sustained-skew soak lane:

- `cargo test --features uring-native --test stress_tdd -- --ignored`

Result: both ignored soak tests passed.

### 6) Rollout summary

- Broader matrix executed with fixed affinity and repeat controls.
- Runtime defaults intentionally kept stable based on measured neutrality.
- Experimental queue backend remains non-default by measured outcome.
- Profiling guardrails hardened and expanded in nightly CI.
- Soak lane validated and passing.

## Roadmap: full `du` metadata parity (2026-03-03)

Objective: close the `README` "not done" gap for native directory traversal and
metadata completeness needed for a production-grade `du`-style implementation.

### Target parity outcomes

- Native async directory traversal API (no blocking traversal in hot path).
- Metadata surface sufficient for `du` semantics:
  - allocated-size accounting (`stx_blocks`-based).
  - hardlink dedupe keys (`dev` + `ino`).
  - mode/file-type and symlink policy decisions.
- Stable policy surface for:
  - apparent size vs allocated size.
  - follow vs no-follow symlinks.
  - one-filesystem boundary behavior.
  - error-policy behavior (`skip` / `fail-fast` style).
- Correctness coverage for sparse files, hardlinks, symlink cycles, mount
  boundaries, and permission-denied paths.

### Milestones

#### DU0: contract freeze + API sketch

- Define public API contracts for traversal and metadata fields:
  - low-level native wrappers.
  - high-level `fs` traversal helpers.
  - optional `du` helper API.
- Lock behavior for edge policies (links, mounts, errors).
- Add red tests that assert planned API symbols/docs references.

#### DU1: metadata parity extension (`statx` field completion)

- Extend `StatxMetadata` beyond current lite subset to include fields required
  for `du` correctness:
  - inode, device ids, allocated blocks, block size, file type bits, and
    relevant attribute masks/flags.
- Add explicit mask/options controls and typed fallbacks.
- Red/green tests:
  - field population on supported kernels.
  - deterministic fallback behavior when native support is unavailable.

#### DU2: native directory enumeration wrapper (`getdents64`)

- Add low-level unsafe-op wrapper and safe boundary for directory entry fetch.
- Return typed entries with name + inode + file type (+ cookie/offset where
  useful).
- Red/green tests for:
  - normal traversal batches.
  - end-of-directory semantics.
  - invalid/unsupported kernel behavior.

#### DU3: high-level async `read_dir` surface

- Build ergonomic `spargio::fs` traversal API on top of DU2.
- Add iterator/stream-style consumption suitable for recursive walkers.
- Red/green tests for:
  - complete enumeration.
  - stable error propagation behavior.
  - symlink handling mode toggles.

#### DU4: `du` accounting engine core

- Implement recursive walker that consumes DU3 + DU1 metadata.
- Add accounting modes:
  - `allocated` (default, `blocks * 512`-style semantics).
  - `apparent` (`size`-style semantics).
- Add hardlink dedupe set keyed by `(dev, ino)`.
- Red/green tests for sparse files and hardlink counting correctness.

#### DU5: filesystem-boundary + symlink policy completion

- Add root-device capture and one-filesystem boundary filtering.
- Add explicit symlink-follow mode with cycle protection.
- Red/green tests for:
  - cross-device skip behavior.
  - symlink loops and bounded traversal.
  - mixed trees (file/dir/link/device-boundary).

#### DU6: fallback and capability model hardening

- Define capability gates for kernels lacking full native opcode support.
- Ensure graceful degraded path behavior remains correct (even if slower).
- Add red/green tests validating identical semantics across native and fallback
  paths for representative fixtures.

#### DU7: correctness corpus + differential checks

- Build reusable filesystem fixture corpus:
  - sparse, hardlink fanout, symlink chains/loops, deep trees, permission
    barriers.
- Add differential checks versus a reference implementation (`du`-style expected
  outputs) for deterministic fixture trees.
- Add long-running traversal stability tests.

#### DU8: performance, profiling, and guardrails

- Add targeted traversal/metadata benchmarks.
- Add profiler lanes (`callgrind`/`cachegrind`) and guardrail thresholds for new
  traversal paths.
- Track hotspot regressions before enabling "default recommended" guidance.

#### DU9: docs + rollout

- Update README/book with:
  - API usage.
  - semantics matrix (`allocated` vs `apparent`, links, mounts, errors).
  - kernel capability notes.
- Add migration guidance for existing users currently doing blocking traversal.
- Final "done/not done" sync and acceptance checklist closeout.

### Parallel execution plan (multi-agent)

- Lane A (Metadata): DU1 + DU6 metadata capability pieces.
- Lane B (Traversal primitives): DU2.
- Lane C (High-level API): DU3 (starts once DU2 API shape is stable).
- Lane D (Accounting semantics): DU4 + DU5 (starts once DU1+DU3 land).
- Lane E (Quality): DU7 fixture corpus and differential tests (can start early,
  final assertions after DU4/DU5).
- Lane F (Perf/docs): DU8 + DU9 (starts once DU4 baseline is functional).

### Dependency graph (for scheduling)

- DU0 first.
- DU1 and DU2 can run in parallel after DU0.
- DU3 depends on DU2.
- DU4 depends on DU1 + DU3.
- DU5 depends on DU4.
- DU6 depends on DU1 + DU2 (and can continue while DU4/DU5 progress).
- DU7 can start fixture scaffolding early; full differential checks depend on
  DU4 + DU5 + DU6.
- DU8 depends on DU4 minimum functionality.
- DU9 finalizes after DU7 + DU8.

## Update: DU roadmap execution (2026-03-03)

Implemented DU0–DU9 execution slices with parallel lane scheduling (metadata,
dirent primitives, high-level API, accounting semantics, and quality/docs).

### DU0: contract freeze + red tests

- Added red contract coverage in:
  - `tests/du_parity_tdd.rs`
- Initial failures validated missing APIs/fields before implementation.

### DU1: metadata parity extension

- Expanded `StatxMetadata` in `src/lib.rs` with du-relevant fields:
  - `ino`, `blocks`, `blksize`, `dev`, `rdev`, `attributes`,
    `attributes_mask`.
- Added file-type helpers:
  - `StatxMetadata::{is_dir,is_file,is_symlink}`.
- `metadata_lite` parity assertions now verify inode/block population.

### DU2: low-level directory enumeration wrapper

- Added low-level extension surface:
  - `spargio::extension::fs::{DirEntryType, DirEntry, read_dir_entries(...)}`.
- Implementation uses `getdents64` parsing (`SYS_getdents64`) with compatibility
  fallback to `std::fs::read_dir` when unsupported.
- Added dedicated coverage:
  - `extension_read_dir_entries_exposes_low_level_dirent_surface`.

### DU3: high-level async `read_dir`

- Added high-level API:
  - `spargio::fs::{DirEntryType, DirEntry, read_dir(...)}`.
- Wires to extension lane and returns typed entry data (name/path/inode/type).

### DU4: `du` accounting core

- Added API and policies:
  - `spargio::fs::{du(...), DuOptions, DuSummary, DuSizeMode}`.
- Implemented accounting modes:
  - `Allocated` (`blocks * 512`) and `Apparent` (`size`).
- Implemented hardlink dedupe keyed by `(dev, ino)` (configurable via
  `hardlink_dedupe(bool)`).

### DU5: symlink + filesystem-boundary policy

- Added `DuSymlinkMode::{NoFollow, Follow}` with loop-safe traversal behavior.
- Added `one_file_system(bool)` policy and cross-device skip tracking in
  `DuSummary::skipped_cross_device`.
- Added tests for looped symlink traversal and one-filesystem behavior.

### DU6: fallback and capability hardening

- Directory enumeration path now degrades deterministically:
  - `getdents64` -> `std::fs::read_dir` fallback on unsupported kernels.
- Added `DuErrorMode::{FailFast, Skip}` and skip counters
  (`DuSummary::skipped_errors`).
- Added tests covering fail-fast vs skip behavior on broken symlink targets.

### DU7: fixture/correctness corpus expansion

- Expanded DU correctness tests to include:
  - sparse files
  - hardlink dedupe
  - symlink loops
  - broken symlink error-policy behavior
  - cross-device skip behavior
- Current corpus lives in `tests/du_parity_tdd.rs` and executes in CI test lane.

### DU8: traversal benchmark lane

- Added benchmark target:
  - `benches/du_api.rs`
- Added Cargo bench registration:
  - `Cargo.toml` -> `[[bench]] name = "du_api"`.
- Bench covers:
  - `fs_du_allocated`
  - `fs_du_apparent`
  - `fs_read_dir_root`

### DU9: docs/rollout sync

- Updated README done/not-done sections:
  - done: built-in `read_dir`/`du` APIs and low-level extension dirent surface.
  - not-done: clarified remaining gap is full in-ring traversal submission path.

### Validation run set

- `cargo test --features uring-native --test du_parity_tdd`
- `cargo test --features uring-native`
- `cargo bench --features uring-native --bench du_api --no-run`

## Update: exploratory benchmark expansion (2026-03-03)

Expanded and documented exploratory `net_api` workloads to cover queue-depth-
insensitive coordination shapes and mixed fs/net deadline-churn shapes where
dispatch/runtime behavior is often the bottleneck.

### Benchmarks added (documented + implemented)

Previously added in this benchmark lane and now documented in one place:

- `net_keyed_hotspot_rotation_4k_window64_cpu`
- `ingress_dispatch_to_workers_rr_256b_ack`
- `fs_net_microservice_4k_read_then_256b_reply_qd1`
- `fanout_fanin_rotating_hot_partition_4k_window32`
- `session_owner_with_spillover_4k`
- `net_burst_flip_imbalance_4k`
- `fanin_barrier_micro_batches_1k`
- `serial_dep_chain_rpc_256b`
- `keyed_hotspot_flip_p99_4k`
- `fanin_barrier_rounds_1k`
- `wakeup_sparse_event_rtt_64b`
- `timer_cancel_reschedule_storm`
- `mixed_control_data_plane_4k_plus_64b`
- `bounded_pipeline_backpressure_4k_window2`
- `post_io_cpu_locality_4k_window1`
- `fs_net_microservice_deadline_dispatch_4k_read_256b_reply`

Newly implemented variant set from the follow-up request:

- `net_echo_rtt_deadline_routing_256b`
- `net_stream_multitenant_4k_window8`
- `net_stream_hotflip_4k`
- `net_pipeline_barrier_4k_window4`
- `keyed_router_with_session_owner_spillover_4k`
- `fs_metadata_then_reply_qd1`

### Harness updates

`benches/net_api.rs`:

- Added benchmark constants and groups for the 6 new variants above.
- Added `FsBenchFixture::metadata_qd1(...)` for metadata-heavy request-path
  shapes.
- Added/kept deadline-churn mixed loops using existing timer-storm command path
  across Tokio/Spargio/Compio harnesses.
- Registered all new groups in `criterion_group!(benches, ...)`.

`Cargo.toml`:

- Enabled Compio `time` feature for timer-storm workloads:
  - `compio` features now include `"time"`.

### Run commands

- Build verification:
  - `cargo fmt --all`
  - `cargo bench --bench net_api --features uring-native --no-run`
- Exploratory benchmark runs:
  - `cargo bench --bench net_api --features uring-native -- --noplot --sample-size 20 fs_net_microservice_deadline_dispatch_4k_read_256b_reply`
  - `cargo bench --bench net_api --features uring-native -- --noplot --sample-size 20 net_echo_rtt_deadline_routing_256b`
  - `cargo bench --bench net_api --features uring-native -- --noplot --sample-size 20 net_stream_multitenant_4k_window8`
  - `cargo bench --bench net_api --features uring-native -- --noplot --sample-size 20 net_stream_hotflip_4k`
  - `cargo bench --bench net_api --features uring-native -- --noplot --sample-size 20 net_pipeline_barrier_4k_window4`
  - `cargo bench --bench net_api --features uring-native -- --noplot --sample-size 20 keyed_router_with_session_owner_spillover_4k`
  - `cargo bench --bench net_api --features uring-native -- --noplot --sample-size 20 fs_metadata_then_reply_qd1`

### Notable outcomes (p99 speedup: baseline/spargio)

- Strong Spargio wins on deadline-churn microservice variants:
  - `fs_net_microservice_deadline_dispatch_4k_read_256b_reply`:
    - vs Tokio: `10.9x`
    - vs Compio: `1.6x`
  - `net_echo_rtt_deadline_routing_256b`:
    - vs Tokio: `8.4x`
    - vs Compio: `1.5x`
  - `fs_metadata_then_reply_qd1`:
    - vs Tokio: `11.6x`
    - vs Compio: `1.2x`
- Moderate/near-parity outcomes on several other variants:
  - `net_stream_multitenant_4k_window8`: ~parity vs Tokio, better than Compio.
  - `net_pipeline_barrier_4k_window4`: slight win vs Tokio, clear win vs
    Compio.
- Some hotspot-flip shapes still favor Compio:
  - `net_stream_hotflip_4k`.

README was updated to consolidate exploratory workload results into one table
covering all entries above.

## Update: high-depth exploratory suite + p99-only format shift (2026-03-03)

Implemented the requested high-depth workload set and refreshed the consolidated
exploratory benchmark table format.

### New high-depth workloads

Added to `benches/net_api.rs`:

- `high_depth_fanout_first_k_cancel_256b_window64`
- `high_depth_multitenant_keyed_router_4k_window64`
- `high_depth_barriered_pipeline_4k_window64`
- `high_depth_deadline_gateway_256b_window64`
- `high_depth_fs_net_admission_control_4k_read_256b_reply_window64`

Supporting harness updates:

- Added high-depth constants for fanout, keyed routing, barrier pipeline,
  deadline gateway, and fs+net admission-control scenarios.
- Generalized `run_fs_net_deadline_loop(...)` with `reads_per_epoch` parameter
  so the same helper can serve multiple fs+net workload shapes.
- Registered all new groups in `criterion_group!(benches, ...)`.

### Validation and run set

- `cargo fmt --all`
- `cargo bench --bench net_api --features uring-native --no-run`
- `cargo bench --bench net_api --features uring-native -- --noplot --sample-size 20 high_depth_fanout_first_k_cancel_256b_window64`
- `cargo bench --bench net_api --features uring-native -- --noplot --sample-size 20 high_depth_multitenant_keyed_router_4k_window64`
- `cargo bench --bench net_api --features uring-native -- --noplot --sample-size 20 high_depth_barriered_pipeline_4k_window64`
- `cargo bench --bench net_api --features uring-native -- --noplot --sample-size 20 high_depth_deadline_gateway_256b_window64`
- `cargo bench --bench net_api --features uring-native -- --noplot --sample-size 20 high_depth_fs_net_admission_control_4k_read_256b_reply_window64`

### README format change

Benchmark result tables in `README.md` now report:

- runtime latencies as `p99`.
- speedups as `baseline_p99 / spargio_p99`.

Exploratory benchmark section is now located under benchmark interpretation in
README with title:

- `Exploratory Benchmarks (Subject to Change, May Be Removed)`.

### Notable high-depth outcomes (p99)

- `high_depth_fanout_first_k_cancel_256b_window64`:
  - vs Tokio: `1.7x`
  - vs Compio: `1.7x`
- `high_depth_deadline_gateway_256b_window64`:
  - vs Tokio: `3.6x`
  - vs Compio: `1.1x`
- `high_depth_fs_net_admission_control_4k_read_256b_reply_window64`:
  - vs Tokio: `4.0x`
  - vs Compio: `1.7x`

### Consolidated exploratory run command (moved from README)

```bash
for bench in \
  net_keyed_hotspot_rotation_4k_window64_cpu \
  ingress_dispatch_to_workers_rr_256b_ack \
  fs_net_microservice_4k_read_then_256b_reply_qd1 \
  fanout_fanin_rotating_hot_partition_4k_window32 \
  session_owner_with_spillover_4k \
  net_burst_flip_imbalance_4k \
  fanin_barrier_micro_batches_1k \
  serial_dep_chain_rpc_256b \
  keyed_hotspot_flip_p99_4k \
  fanin_barrier_rounds_1k \
  wakeup_sparse_event_rtt_64b \
  timer_cancel_reschedule_storm \
  mixed_control_data_plane_4k_plus_64b \
  bounded_pipeline_backpressure_4k_window2 \
  post_io_cpu_locality_4k_window1 \
  fs_net_microservice_deadline_dispatch_4k_read_256b_reply \
  net_echo_rtt_deadline_routing_256b \
  net_stream_multitenant_4k_window8 \
  net_stream_hotflip_4k \
  net_pipeline_barrier_4k_window4 \
  keyed_router_with_session_owner_spillover_4k \
  fs_metadata_then_reply_qd1 \
  high_depth_fanout_first_k_cancel_256b_window64 \
  high_depth_multitenant_keyed_router_4k_window64 \
  high_depth_barriered_pipeline_4k_window64 \
  high_depth_deadline_gateway_256b_window64 \
  high_depth_fs_net_admission_control_4k_read_256b_reply_window64; do
  cargo bench --bench net_api --features uring-native -- --noplot --sample-size 20 "$bench"
done
```

## Update: README benchmark reporting switched to mean iteration latency (2026-03-04)

Rationale:

- The previous README table format used p99 over Criterion sample iterations.
- Those p99 values are not request-level tails; they are distribution tails of
  per-iteration benchmark samples (`sample.json`), which can be misleading for
  readers expecting request-level percentile semantics.

What changed:

- Benchmark tables in `README.md` now report Criterion `mean` wall-clock
  iteration latency (`estimates.json` point estimates).
- Speedup columns now use `baseline_mean / spargio_mean`.
- Existing benchmark table values were refreshed from local Criterion artifacts
  under `target/criterion/*/new/estimates.json`.

Notes:

- This keeps comparisons stable and easier to interpret until/if we add
  explicit request-level latency histograms inside the benchmark harnesses.

## Update: docs.rs coverage hardening for user-facing core API (2026-03-04)

Implemented a focused documentation pass for the public runtime/boundary core
API and verified docs coverage at 100% for the default docs.rs feature set.

### What was added

- User-focused rustdoc for:
  - `ShardId`
  - `boundary` module (`BoundaryClient`, `BoundaryServer`, tickets, errors,
    stats, request envelope helpers)
  - Time/cancellation primitives (`sleep`, `sleep_until`, `Sleep`, `timeout*`,
    `Interval`, `CancellationToken`, `TaskGroup`)
  - Core placement/message/runtime surface (`Event`, `RingMsg`,
    `TaskPlacement`, `RuntimeBuilder`, `Runtime`, `RuntimeHandle`,
    `RemoteShard`, `ShardCtx`, errors, join/ticket futures)
  - `RuntimeStats` fields and helper ratios.

### Guardrails

- Added lint enforcement for the default (non-`uring-native`) API surface:
  - `#![cfg_attr(not(feature = "uring-native"), deny(missing_docs))]`
- This keeps docs.rs-default coverage strict without breaking current
  `uring-native` CI/test lanes that still have broader undocumented surfaces.

### Verification

- `RUSTDOCFLAGS='-Dmissing-docs' cargo +nightly doc --no-deps`
- `cargo +nightly rustdoc --lib -- -Zunstable-options --show-coverage`
  - Result: `src/lib.rs` documented `218/218` (`100.0%`)
- `cargo test`
- `cargo test --features uring-native`

## Update: Planned QUIC stream-continuity + copy-reduction wave from sparsync findings (2026-03-04)

Context from sparsync profiling:

- First-sync overhead remains dominated by encrypted transport and stream/control churn.
- `sparsync` currently benefits from control-frame batching but still needs lower-overhead long-lived framed streams to reduce stream setup and buffering overhead further.

Planned scope in `spargio-quic` (this wave):

1. Add incremental receive APIs to wrapper streams.
   - Introduce `QuicRecvStream::read_chunk(max_bytes)` returning incremental bytes (EOF-aware) instead of forcing `read_to_end` framing.
   - Keep `read_to_end` as a compatibility helper built on incremental reads.
2. Add owned-bytes stream I/O methods in native driver to reduce copy churn.
   - Add `NativeProtoDriver::{write_stream_bytes_on_connection, read_stream_bytes_on_connection}`.
   - Keep existing `Vec<u8>` APIs as compatibility wrappers.
3. Reduce native stream write copy amplification in `QuicSendStream::write_all`.
   - Use a single owned buffer and sliced `Bytes` views across partial writes, rather than allocating a new `Vec<u8>` on each retry/write attempt.
4. Add integration tests for incremental stream reads.
   - Validate incremental chunk behavior in endpoint client/server bi-stream exchange.

Out-of-scope for this wave (tracked next):

- Full long-lived framed control/data protocol in sparsync (multi-frame per stream loop).
- Deeper transport internals (pacing/ACK/scheduler tuning) beyond stream wrapper + driver payload-path changes.
- Non-crypto transport mode.

Execution note:

- Implement these upstream APIs first, then sparsync can adopt long-lived stream protocol without requiring `read_to_end`-bounded request framing.

## Update: Completed QUIC stream-continuity + copy-reduction implementation (2026-03-04)

Implemented against `crates/spargio-quic` following the plan above.

### 1) Incremental stream receive API added

- Added:
  - `QuicRecvStream::read_chunk(max_bytes) -> io::Result<Option<bytes::Bytes>>`
- Updated:
  - `QuicRecvStream::read_to_end(...)` now composes on top of `read_chunk(...)`.
- Result:
  - callers no longer need `read_to_end`-bounded framing to consume stream payloads.
  - enables long-lived framed protocols (e.g. sparsync control stream loops) with incremental decode.

### 2) Owned-bytes native stream I/O in driver

Added new compatibility-preserving APIs:

- `NativeProtoDriver::write_stream_bytes_on_connection(...)`
- `NativeProtoDriver::read_stream_bytes_on_connection(...)`
- mirrored on `NativeProtoDriverSend` and `NativeProtoDriverLocal`.

Existing `Vec<u8>` methods remain and now delegate to the bytes-based methods.

Internal driver changes:

- `NativeProtoCommand::WriteStreamOnConnection` now carries `bytes::Bytes`.
- `NativeProtoCommand::ReadStreamOnConnection` now replies with `Option<bytes::Bytes>`.
- native fallback stream queues now store `bytes::Bytes` instead of `Vec<u8>`.

### 3) Native write path copy amplification reduced

- `QuicSendStream::write_all(...)` native branch now:
  - allocates one owned `Bytes` buffer from input,
  - retries using zero-copy `Bytes` slicing across partial writes,
  - avoids repeated `data.to_vec()` allocation/copy per retry loop.
- Added:
  - `QuicSendStream::write_bytes(bytes::Bytes)` for owned-chunk writes.

### 4) Tests for incremental stream reads

`crates/spargio-quic/tests/quic_tdd.rs`:

- `quic_recv_stream_read_chunk_supports_incremental_reads_native`
- `quic_recv_stream_read_chunk_supports_incremental_reads_bridge`

Both validate incremental chunk consumption over bi-stream exchange.

### Validation

Executed successfully:

- `cargo fmt --all`
- `cargo test -p spargio-quic`

Notes:

- Existing non-fatal warnings about unused internal bridge helper types/functions remain unchanged from prior baseline.

### Follow-on integration target

- sparsync can now adopt a long-lived framed stream protocol using `read_chunk(...)` instead of per-request `read_to_end(...)`/new stream pairs, which is the next step to directly reduce first-sync stream/control churn.

## Update: Additional transport hot-path pass after sparsync long-lived stream adoption (2026-03-04)

Context:

- After switching sparsync to long-lived framed streams, first-sync in daemon mode remained above `rsync://` and profiling continued to point at transport/runtime overhead and memory movement.
- This pass targeted low-risk `spargio-quic` internals that reduce copies and command-loop overhead without protocol changes.

### Plan for this pass

1. Reduce ingress datagram copy count in native backend command path.
2. Trim per-op overhead in stream read/write loops.
3. Remove avoidable allocation in connection drive loop iteration.
4. Re-validate with `spargio-quic` tests and sparsync benchmark harness.

### Implemented

1. Ingress datagram command path now accepts `BytesMut` payloads directly
   - Added `NativeProtoDriver::submit_datagram_bytes(remote, payload: bytes::BytesMut)`.
   - Kept existing `submit_datagram(remote, Vec<u8>)` API as compatibility wrapper.
   - Updated `NativeProtoCommand::SubmitDatagram` payload type to `bytes::BytesMut`.
   - Native endpoint ingress pump now forwards `BytesMut` payloads directly into driver.
   - Driver loop now passes payload directly to `endpoint.handle(...)` instead of reconstructing a new `BytesMut` from a slice.

2. Stream I/O retry loops now avoid repeated driver-handle reconstruction
   - In `QuicSendStream::write_all`, `QuicSendStream::write_bytes`, and `QuicRecvStream::read_chunk`, native branch now clones driver once per operation and reuses it across retry loops.

3. Minor queue + loop overhead reductions
   - `WriteStreamOnConnection` fallback path now returns `payload.len()` directly instead of map re-lookup after enqueue.
   - `drive_native_proto_connections` now iterates `proto_connections.iter_mut()` directly instead of collecting a temporary handles `Vec` each pass.

4. Safety/stability note
   - Trialed sub-millisecond stream retry sleep; reverted to `1ms` after instability under benchmark load.
   - Current stream retry interval remains `1ms`.

### Validation

- `cargo fmt --all`
- `cargo test -p spargio-quic`
  - all tests passed

### Downstream benchmark check (sparsync harness, patched to this workspace)

- `RUNS=5 TRANSPORTS=daemon ./scripts/bench_remote_rsync_vs_sparsync_median.sh`
  - `sparsync_first_ms_median=405`
  - `sparsync_second_ms_median=28`
  - `sparsync_changed_ms_median=55`
  - `rsync_remote_first_ms_median=228`
- `RUNS=5 TRANSPORTS=ssh ./scripts/bench_remote_rsync_vs_sparsync_median.sh`
  - `sparsync_first_ms_median=408`
  - `sparsync_second_ms_median=32`
  - `sparsync_changed_ms_median=59`
  - `rsync_ssh_first_ms_median=548`

Interpretation:

- These internal optimizations are stable and keep strong warm/churn performance.
- They do not materially close the daemon first-sync gap by themselves.
- Next high-impact lever remains deeper encrypted transport/runtime tuning (buffer reuse/zero-copy direction, pacing/ACK behavior, scheduler handoff overhead).

## Update: Closed review findings on QUIC docs + API test coverage (2026-03-04)

Addressed two medium-severity review findings for unpushed `spargio-quic` changes.

### 1) User-facing docs for new QUIC stream APIs

- Updated user docs:
  - `book/src/09_protocol_crates.md`
    - added incremental/owned-bytes stream usage section with practical code for:
      - `QuicRecvStream::read_chunk(...)`
      - `QuicSendStream::write_bytes(...)`
    - clarified why this pattern fits long-lived framed protocols better than `read_to_end`.
  - `README.md`
    - added done-item callout for incremental reads + owned-byte writes in QUIC stream APIs.

### 2) Explicit tests for new hot-path bytes APIs

`crates/spargio-quic/tests/quic_tdd.rs`:

- Added stream API tests:
  - `quic_send_stream_write_bytes_roundtrips_native`
  - `quic_send_stream_write_bytes_roundtrips_bridge`
  - validates forward progress semantics and full payload roundtrip via `write_bytes`.
- Added driver ingress API test:
  - `native_proto_driver_ingests_datagram_bytes_and_supports_bounded_drain`
  - explicitly exercises `submit_datagram_bytes(...)` and bounded drain behavior.

### Validation

- `cargo test -p spargio-quic`
- `cargo test --workspace`

## Update: Review follow-up for adaptive QUIC retry backoff commit (2026-03-04)

Addressed the remaining review gaps for unpushed commit
`31a86b5` (`perf(quic): reduce native polling latency with adaptive retry backoff`).

### 1) Added targeted tests for retry policy behavior

`crates/spargio-quic/src/lib.rs` now includes dedicated unit tests for
`native_retry_delay(...)`:

- `native_retry_delay_uses_expected_bands`
  - verifies threshold mapping:
    - retries `< 4` -> `100us`
    - retries `< 16` -> `250us`
    - retries `>= 16` -> `1ms` (`NATIVE_PROTO_POLL_INTERVAL`)
- `native_retry_delay_is_monotonic_and_capped`
  - verifies delay does not decrease as retries grow
  - verifies delay never exceeds the `1ms` cap

This gives explicit coverage for the adaptive backoff policy that was previously
untested.

### 2) Added user-facing documentation for the behavior

- `README.md`:
  - added done-item bullet documenting adaptive QUIC retry backoff and intent.
- `book/src/09_protocol_crates.md`:
  - expanded QUIC backend mode section with concrete delay bands and practical
    interpretation (lower latency on short stalls, bounded spin on long stalls).

### Validation

- `cargo test -p spargio-quic`
- `cargo test --workspace`

## Update: Native-path peer cert chain support for `QuicConnection::peer_cert_chain_der` (2026-03-06)

Implemented native-default parity for `QuicConnection::peer_cert_chain_der` while keeping the API synchronous.

### Red

- Added native + bridge tests that assert client-side cert chain availability:
  - `quic_connection_peer_cert_chain_der_available_native`
  - `quic_connection_peer_cert_chain_der_available_bridge`
- Initial native test failed with:
  - `NotConnected: "quic connection handle is not quinn-backed"`

### Green

1. Added native driver query for peer cert chain
   - New command: `NativeProtoCommand::ConnectionPeerCertChainDer`.
   - New driver APIs:
     - `NativeProtoDriver::connection_peer_cert_chain_der(connection_id)`
     - forwarded in `NativeProtoDriverSend` and `NativeProtoDriverLocal`.
   - Driver loop now resolves connection handle and reads `crypto_session().peer_identity()` from `quinn-proto`.

2. Cached native peer cert chain at handshake completion
   - In native `connect`, `connect_with`, and `accept`, after `wait_for_established(...)`, the endpoint now fetches peer cert chain from the native driver.
   - Capture is best-effort (`Ok(None)` on extraction/query failure) so handshake success is never downgraded into connect/accept failure.
   - `NativeProtoConnectionHandle` now stores `peer_cert_chain_der: Option<Vec<Vec<u8>>>`.
   - `wrap_native_connection(...)` now takes cached cert chain and attaches it to `QuicConnection`.

3. Updated `QuicConnection::peer_cert_chain_der`
   - Native-proto path now returns the cached chain.
   - Keeps existing rustls peer-identity decode path for quinn-backed connections.
   - Returns `NotConnected` when peer identity/cert chain is unavailable (for example, server side without client auth), consistent with existing behavior.

### Additional tests

- Added missing failure-path coverage for both backends:
  - `quic_connection_peer_cert_chain_der_missing_without_client_auth_native`
  - `quic_connection_peer_cert_chain_der_missing_without_client_auth_bridge`
- Added direct native-driver coverage for the new command path:
  - `native_proto_driver_connection_peer_cert_chain_der_matches_handshake_role`

### Validation

- `cargo test -p spargio-quic --test quic_tdd peer_cert_chain_der`
- `cargo fmt --all`
- `cargo test -p spargio-quic`