spargio 0.3.0 - Docs.rs

# Implementation Log

## Snapshot (2026-02-25)

Repository state at start of this log:

- Git repo initialized in `/workspace/spargio`
- Initial implementation committed as:
  - `59d0b34` (`Implement sharded msg-ring-style runtime with TDD tests and benchmarks`)

## Completed So Far

### Design docs

- Added runtime design options:
  - `DESIGN_OPTIONS.md`

### Runtime crate

- Created crate:
  - `spargio`
- Implemented a sharded runtime with:
  - `RuntimeBuilder`, `Runtime`, `ShardCtx`, `RemoteShard`
  - `spawn_on` and `spawn_local`
  - `send_raw` and typed `send` via `RingMsg`
  - `next_event` event stream (`Event::RingMsg`)
  - sender completion tickets (`SendTicket`)

Current backend in this snapshot:

- In-process queue-based message transport (useful as baseline/fallback and for comparative benchmarking).

### TDD tests

- Added API/behavior tests in `tests/runtime_tdd.rs`:
  - local spawn runs on shard
  - raw send delivers to target with sender shard id
  - typed send round-trips through event path

Workflow used:

- Red: tests failed on placeholder API
- Green: implemented runtime until tests passed

### Benchmarks

- Added Criterion benchmark:
  - `benches/ping_pong.rs`
- Includes:
  - runtime ping-pong
  - simple Tokio baseline
  - simple Glommio baseline (feature-gated)

Feature:

- `glommio-bench` enables Glommio benchmark code path on Linux.

## Validation Results

Executed and passing:

- `cargo test`
- `cargo bench --no-run`
- `cargo bench --no-run --features glommio-bench`

Short benchmark sample run completed:

- `spargio`: ~1.62 ms (sample config)
- `tokio_unbounded_channel`: ~1.53 ms (sample config)
- `glommio_simple`: ~3.77–4.47 ms (with `glommio-bench`)

Note:

- These are quick smoke numbers, not stable performance conclusions.

## Next Work (Requested)

- Add a Linux `io_uring` backend that uses `msg_ring` for cross-shard delivery.
- Keep current queue backend for comparative benchmarks and fallback behavior.
- Preserve existing API so both backends can be measured under similar workloads.

## Update: Linux io_uring Backend Added

Implemented after the snapshot above:

- Added runtime backend selector:
  - `BackendKind::Queue`
  - `BackendKind::IoUring`
- Added builder controls:
  - `RuntimeBuilder::backend(BackendKind)`
  - `RuntimeBuilder::ring_entries(u32)`
- Default backend remains:
  - `BackendKind::Queue`

### Backend behavior

- Queue backend:
  - existing in-process message transport path retained.
- io_uring backend (Linux):
  - each shard owns an `IoUring` instance.
  - `send_raw` issued from a shard thread is routed through the source shard ring using:
    - `IORING_OP_MSG_RING` (`opcode::MsgRingData` via `io-uring` crate)
  - target shard receives an event via ring completion and emits:
    - `Event::RingMsg { from, tag, val }`
  - sender ticket completion is tied to sender-ring completion CQE.
- External/non-shard callers:
  - still supported using queue injection fallback (kept intentionally for safety and portability).

### Runtime loop adjustments

- Added backend-aware loop behavior:
  - queue backend keeps timeout-driven idle wait.
  - io_uring backend prefers busy polling (`yield_now`) to avoid artificial millisecond latency.

### Tests

- Existing tests still pass.
- Added Linux-only backend test:
  - `io_uring_backend_delivers_message`
- Full test status:
  - `cargo test` passes.

### Benchmarks updated

- `benches/ping_pong.rs` now benchmarks:
  - `spargio_queue`
  - `spargio_io_uring` (only when backend init succeeds)
  - `tokio_unbounded_channel`
  - `glommio_simple` (with `glommio-bench` feature)

Validation:

- `cargo bench --no-run` passes
- `cargo bench --no-run --features glommio-bench` passes

Quick benchmark sample (short run config):

- `spargio_queue`: ~1.66-1.70 ms
- `spargio_io_uring`: ~0.60-0.72 ms
- `tokio_unbounded_channel`: ~1.49-1.58 ms
- `glommio_simple`: ~4.05-4.85 ms

## Update: Stricter Benchmark Suite

Implemented to improve comparability and isolate what is being measured:

- Switched to persistent harnesses for steady-state measurements.
- Added matched two-worker topology for baselines:
  - Tokio: dedicated runtime thread, two-worker message loop.
  - Glommio (`glommio-bench`): two executor threads with message channels.
- Added explicit benchmark groups:
  - `steady_ping_pong_rtt`
  - `steady_one_way_send_drain`
  - `cold_start_ping_pong`

### Metric definitions

- `steady_ping_pong_rtt`:
  - per-round request/ack round-trip latency over persistent workers.
- `steady_one_way_send_drain`:
  - repeated one-way sends followed by a flush barrier ack.
  - for `spargio`, this now uses a bounded send-ticket window (`SEND_WINDOW=64`) to avoid fully serial per-send awaiting while preserving backpressure.
  - for Tokio/Glommio channel sends, send completion is synchronous enqueue.
- `cold_start_ping_pong`:
  - includes harness/runtime construction and teardown each iteration.

### Safety constraints observed

- No machine-level or persistent system tuning performed.
- No CPU governor/turbo/IRQ/process-affinity changes applied.
- Benchmarks are runnable on standard developer machines.

### Validation

- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.
- Sample full run completed for non-Glommio path.
- Sample targeted run completed for Glommio path.

### Notes from latest tuning pass

- Updated runtime one-way harness from strict per-send await to windowed in-flight tickets.
- Targeted one-way io_uring sample improved from roughly `~1.44 ms` to `~1.17 ms` under short Criterion settings.

## Update: Send Path Optimizations (Proceed Phase)

Implemented next optimization wave:

- Added no-ticket send APIs:
  - `RemoteShard::send_raw_nowait(tag, val)`
  - `RemoteShard::send_nowait(msg)`
  - `ShardCtx::send_raw_nowait(target, tag, val)`
- Added shard-local fast path:
  - local sends now enqueue into a local per-shard queue (`LocalCommand`) and no longer bounce through the shard command channel.
- Added io_uring batching:
  - deferred `ring.submit()` with batched flush (`IOURING_SUBMIT_BATCH=64`)
  - flush on poll/reap and on SQ pressure.
- Added io_uring no-ticket CQE suppression:
  - uses `IORING_MSG_RING_CQE_SKIP` flag value for no-ticket `msg_ring` sends to avoid sender-CQ flooding.

### Benchmark harness alignment updates

- Runtime one-way benchmark now uses `send_raw_nowait` for fire-and-drain semantics.
- io_uring steady one-way harness uses larger ring entries (`4096`) to avoid CQ overflow in high-burst synthetic load.
- Cold-start io_uring path kept at default ring sizing to keep init broadly reliable on dev machines.

### Additional test coverage

- Added test:
  - `send_raw_nowait_delivers_event`

### Current quick sample numbers (50ms warmup/50ms measure)

- `steady_ping_pong_rtt/spargio_queue`: ~`1.47-1.51 ms`
- `steady_ping_pong_rtt/spargio_io_uring`: ~`336-348 us`
- `steady_ping_pong_rtt/tokio_two_worker`: ~`1.21-1.34 ms`
- `steady_one_way_send_drain/spargio_queue`: ~`1.25-1.27 ms`
- `steady_one_way_send_drain/spargio_io_uring`: ~`232-234 us`
- `steady_one_way_send_drain/tokio_two_worker`: ~`69-71 us`

## Update: Fast-Path Checklist Pass (Current)

Requested optimization checklist from the prior analysis and status:

- Doorbell + payload queue batching for io_uring no-ticket sends:
  - Implemented.
  - No-ticket sends now enqueue payloads into per `(target, source)` shared queues and only emit a `msg_ring` doorbell when transitioning empty -> non-empty.
- `send_many_nowait` API:
  - Implemented.
  - Added:
    - `RemoteShard::send_many_raw_nowait`
    - `RemoteShard::send_many_nowait`
    - `ShardCtx::send_many_raw_nowait`
    - `ShardCtx::send_many_nowait`
- Explicit flush API:
  - Implemented.
  - Added:
    - `ShardCtx::flush() -> SendTicket`
    - `RemoteShard::flush() -> SendTicket` (no-op success outside shard context)
  - io_uring implementation flushes pending submissions and uses a `NOP` completion barrier.
- Send waiter structure (`HashMap -> slab`):
  - Implemented.
  - Waiters are now stored in `Slab`, with completion `user_data` carrying slab index.
- Optional io_uring setup knobs (SQPOLL path):
  - Implemented on Linux builder:
    - `io_uring_sqpoll(Option<u32>)`
    - `io_uring_sqpoll_cpu(Option<u32>)`
    - `io_uring_single_issuer(bool)`
    - `io_uring_coop_taskrun(bool)`
- EventState lock removal (`Mutex -> RefCell`):
  - Not applied.
  - Reason: current `spawn_on` API requires `Send` futures; making event state shard-local `Rc<RefCell<...>>` makes `NextEvent` non-`Send`, which breaks valid `spawn_on` usage.

### Correctness note on CQE suppression

- Previous pass used `IORING_MSG_RING_CQE_SKIP` under the assumption it only removed sender-side completions.
- This pass corrected no-ticket suppression to use SQE `SKIP_SUCCESS` for source CQE suppression while preserving receiver delivery.

### Additional tests added

- `send_many_raw_nowait_delivers_in_order`
- `flush_completes_without_messages`
- `io_uring_send_many_nowait_delivers_messages`

### Validation

- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

### Latest quick benchmark sample (50ms warmup/50ms measure)

- `steady_ping_pong_rtt/spargio_queue`: ~`1.36-1.39 ms`
- `steady_ping_pong_rtt/spargio_io_uring`: ~`365-370 us`
- `steady_ping_pong_rtt/tokio_two_worker`: ~`1.23-1.31 ms`
- `steady_one_way_send_drain/spargio_queue`: ~`1.23-1.25 ms`
- `steady_one_way_send_drain/spargio_io_uring`: ~`62.8-64.5 us`
- `steady_one_way_send_drain/tokio_two_worker`: ~`69.0-72.7 us`
- `cold_start_ping_pong/spargio_queue`: ~`2.39-2.40 ms`
- `cold_start_ping_pong/spargio_io_uring`: ~`255-276 us`
- `cold_start_ping_pong/tokio_two_worker`: ~`453-484 us`

## Update: Tokio Batched One-Way Controls

To make the one-way comparison fairer, added additional Tokio benchmarks that batch payloads before crossing threads:

- `steady_one_way_send_drain/tokio_two_worker_batched_64`
- `steady_one_way_send_drain/tokio_two_worker_batched_all`

Implementation notes:

- Added `TokioWire::OneWayBatch(Vec<u32>)`.
- Added `TokioCmd::OneWayBatched { rounds, batch, reply }`.
- Existing `tokio_two_worker` remains unchanged as the per-message baseline.

Quick sample (50ms warmup/50ms measure):

- `steady_one_way_send_drain/spargio_io_uring`: ~`64.2-65.4 us`
- `steady_one_way_send_drain/tokio_two_worker`: ~`83.7-96.0 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_64`: ~`23.3-25.3 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_all`: ~`14.9-15.7 us`

Interpretation:

- The previous Tokio gap was largely due to per-send cross-thread signaling overhead, not an inherent runtime scheduler limit.
- With batching, Tokio is substantially faster on this one-way synthetic workload.

## Update: Disk IO Benchmark (4K Read RTT)

Added a dedicated disk benchmark:

- New bench target:
  - `benches/disk_io.rs`
- Cargo bench config:
  - `[[bench]] name = "disk_io" harness = false`

### Benchmark shape

- Persistent fixture file:
  - 16 MiB (`4096 * 4 KiB`) temp file under system temp dir.
- Metric:
  - `disk_read_rtt_4k` (per-iteration round-trip for `256` 4 KiB reads).
- Compared paths:
  - `tokio_two_worker_pread`
    - two-worker Tokio runtime
    - request/ack over Tokio unbounded channels
    - worker performs `pread` (`FileExt::read_at`)
  - `io_uring_msg_ring_two_ring_pread` (Linux)
    - two rings (`client` + `worker`)
    - request/ack over `IORING_OP_MSG_RING`
    - worker performs `IORING_OP_READ` and replies via `msg_ring`

### Quick sample (50ms warmup/50ms measure)

- `disk_read_rtt_4k/tokio_two_worker_pread`: ~`1.71-1.91 ms`
- `disk_read_rtt_4k/io_uring_msg_ring_two_ring_pread`: ~`2.64-3.09 ms`

### Notes

- This first disk RTT harness is not yet optimized for io_uring throughput; it is currently request/ack serialized and favors simplicity/debuggability.
- VFS work is still present for both paths; `io_uring` changes submission/completion mechanics, not filesystem lookup/permission/page-cache semantics.

## Update: Tokio Interop API Slice (TDD)

Started implementation toward the ADR with a first interop slice focused on submission APIs that can be called from Tokio tasks.

### Red phase

Added failing tests in `tests/tokio_compat_tdd.rs` for:

- `Runtime::handle()` availability.
- `RuntimeHandle::spawn_pinned(shard, fut)` execution on requested shard.
- `RuntimeHandle::spawn_stealable(fut)` round-robin placement.
- `RuntimeHandle` usage from Tokio tasks, including remote send + ticket await.
- `RuntimeHandle` cloneability and `Send + Sync`.

### Green phase

Implemented in `src/lib.rs`:

- New public `RuntimeHandle` (`Clone`, `Send + Sync`).
- `Runtime::handle() -> RuntimeHandle`.
- `RuntimeHandle` APIs:
  - `backend()`
  - `shard_count()`
  - `remote(shard)`
  - `spawn_pinned(shard, fut)`
  - `spawn_stealable(fut)` (round-robin via `AtomicUsize`)
- Refactored spawn logic into shared helper:
  - `spawn_on_shared(...)`

Validation:

- `cargo test` passes (including new `tokio_compat_tdd` tests).
- `cargo bench --no-run` passes.

## Update: Tokio-Compat POLL_ADD Reactor Scaffold (TDD)

Implemented the first compatibility-reactor scaffold behind feature gating.

### Red phase

Added failing tests in `tests/tokio_poll_reactor_tdd.rs` (`cfg(all(feature = "tokio-compat", target_os = "linux"))`) for:

- `PollReactor::register(..., PollInterest::Readable)` receives readable event.
- `PollReactor::deregister(token)` returns `NotFound` on second deregister.
- Token uniqueness across registrations.

### Green phase

Implemented new module in `src/lib.rs`:

- `tokio_compat` (Linux + feature gated):
  - `PollReactor`
  - `PollInterest`
  - `PollToken`
  - `PollEvent`
  - `PollReactorError`
- Uses `IORING_OP_POLL_ADD` for registration and `IORING_OP_POLL_REMOVE` for deregistration.
- Includes minimal completion routing and internal completion tagging for deterministic deregister behavior.

Cargo feature updates (`Cargo.toml`):

- Added features:
  - `tokio-compat`
  - `uring-native`
- Added Linux dependency:
  - `libc`

Validation:

- `cargo test --features tokio-compat` passes.
- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

## Current Status: Tokio-Uring Alternative Scope

Snapshot of what is implemented vs remaining for the target architecture (`msg_ring` + poll-compat + work-stealing + native fast lane):

### Implemented

- Core `msg_ring` runtime and Linux `io_uring` backend.
- Tokio interop handle APIs:
  - `Runtime::handle()`
  - `spawn_pinned(...)`
  - `spawn_stealable(...)` (current policy: round-robin placement).
- `tokio-compat` lane scaffold:
  - `PollReactor` (`IORING_OP_POLL_ADD` / `IORING_OP_POLL_REMOVE`)
  - async `TokioPollReactor`
  - `TokioCompatLane` via `RuntimeHandle::tokio_compat_lane(...)`
  - lane readiness helpers: `wait_readable(fd)`, `wait_writable(fd)`.
- Cancellation cleanup and active-token tracking for poll registrations.
- TDD coverage for all above in:
  - `tokio_compat_tdd.rs`
  - `tokio_poll_reactor_tdd.rs`
  - `tokio_poll_async_tdd.rs`
  - `tokio_runtime_lane_tdd.rs`
  - `tokio_runtime_wait_tdd.rs`

### Remaining

- True work-stealing scheduler:
  - per-worker deque + global injector + steal loop (not implemented yet).
- Submission-time stealing/placement policy for native I/O work (not implemented yet).
- Poll-compat path integrated into shard driver with `msg_ring` doorbells:
  - current poll path uses dedicated reactor worker thread + command channel.
- `uring-native` fast lane:
  - feature flag exists, but native async API surface is not implemented yet.
- Tokio-like compatibility wrappers (`AsyncRead`/`AsyncWrite`) are not implemented yet.
- Full stress/race suite for rearm/cancel/drop edge cases under load is not complete yet.
- Compat-vs-native and mixed-load stealing benchmark suite is not complete yet.

## Proposed Sequence: Functional Slices First

Priority order to ship usable slices earlier:

1. Compat ergonomics slice:
   - stabilize `tokio-compat` lane ergonomics and add simple compatibility wrappers.
2. Native fast-lane MVP:
   - add first `uring-native` read/write APIs with pinned submission.
3. Mixed-mode app slice:
   - make compat and native lanes easy to combine in one app.
4. Submission-time placement policies:
   - add `round_robin`, `sticky`, and explicit shard placement options.
5. True work-stealing scheduler:
   - introduce per-worker deque + global injector + steal loop for stealable tasks.
6. Poll path re-home to shard driver:
   - move poll processing into shard driver path with `msg_ring` wakeups.
7. Hardening and benchmark gate slice:
   - race stress tests + mixed-load benchmark gates.

User stories unlocked after each slice:

1. After compat ergonomics:
   - migrate Tokio readiness-style code with minimal rewrites.
2. After native fast-lane MVP:
   - move only hot I/O paths to native `io_uring` APIs.
3. After mixed-mode:
   - run compatibility code and native ops side by side.
4. After placement policies:
   - control locality/load-balance at submission time.
5. After true work-stealing:
   - auto-balance CPU/control tasks while keeping I/O ring-affine.
6. After poll re-home:
   - reduce poll-path overhead without API changes.
7. After hardening/bench gates:
   - rely on correctness/perf regression protection in CI.

## User Stories Already Possible

With current implementation, users can already:

1. Build and run a sharded runtime with queue or Linux `io_uring` backend.
2. Send typed/raw shard-to-shard messages and await sender tickets.
3. Use no-ticket batched message sends and explicit flush barriers.
4. Spawn pinned or round-robin stealable tasks from Tokio tasks via `RuntimeHandle`.
5. Create a `tokio-compat` lane and use poll registration (`POLL_ADD`/`POLL_REMOVE`) through:
   - direct poll API (`register`, `wait_one`, `deregister`)
   - lane helpers (`wait_readable`, `wait_writable`).
6. Cancel readiness waits without leaking poll registrations (covered by tests).
7. Benchmark message RTT/one-way/cold-start and run a first disk I/O RTT comparison harness.

## Update: Compat Ergonomics Slice (TDD)

Implemented the next functional slice aimed at easier migration ergonomics for readiness-style code.

### Red phase

Added failing tests in `tests/tokio_compat_fd_tdd.rs` (`cfg(all(feature = "tokio-compat", target_os = "linux"))`) for:

- lane-scoped compatibility FD wrapper creation.
- wrapper `writable().await` and `readable().await` behavior.
- wrapper cloneability and FD identity access.

### Green phase

Implemented in `src/lib.rs`:

- New `CompatFd` type (`Clone`) under `tokio-compat`:
  - stores `TokioCompatLane` + `RawFd`.
- New lane factory:
  - `TokioCompatLane::compat_fd(fd) -> CompatFd`
- Wrapper methods:
  - `fd()`
  - `readable().await`
  - `writable().await`

This reuses the lane's cancellation-safe wait logic and poll token cleanup.

Validation:

- `cargo test --features tokio-compat` passes.
- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

## Update: Async Tokio Poll Wrapper (TDD)

Added a Tokio-usable async wrapper over the `POLL_ADD` scaffold to allow direct use from Tokio tasks.

### Red phase

Added failing tests in `tests/tokio_poll_async_tdd.rs` (`cfg(all(feature = "tokio-compat", target_os = "linux"))`) for:

- async `wait_one()` returning readable events.
- async `deregister()` reporting `NotFound` on second remove.

### Green phase

Implemented in `src/lib.rs` (`tokio_compat` module):

- `TokioPollReactor` (`Clone`) wrapping `PollReactor` in `Arc<Mutex<_>>`.
- Methods:
  - `new(entries)`
  - `register(fd, interest)`
  - `wait_one().await`
  - `deregister(token).await`
- Async methods use `tokio::task::spawn_blocking` to execute blocking ring wait/remove logic safely off async worker threads.

Feature/dependency update:

- `tokio-compat` now enables optional Tokio dependency (`dep:tokio`).

Validation:

- `cargo test --features tokio-compat` passes.
- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

## Update: Tokio Compat Lane via RuntimeHandle (TDD)

Integrated poll-compat usage into a runtime-lane API so Tokio tasks can use a single handle for both runtime operations and readiness waiting.

### Red phase

Added failing tests in `tests/tokio_runtime_lane_tdd.rs` (`cfg(all(feature = "tokio-compat", target_os = "linux"))`) for:

- `RuntimeHandle::tokio_compat_lane(entries)` creation.
- Combined lane behavior:
  - `spawn_pinned`
  - `remote(...).send_raw(...).await`
  - event receive path
- Poll API through lane:
  - `register`
  - async `wait_one`

### Green phase

Implemented in `src/lib.rs`:

- `RuntimeHandle::tokio_compat_lane(entries) -> Result<TokioCompatLane, PollReactorError>`
- New `TokioCompatLane` (`Clone`) with delegated runtime APIs:
  - `backend`
  - `shard_count`
  - `remote`
  - `spawn_pinned`
  - `spawn_stealable`
- Lane poll APIs:
  - `register`
  - async `wait_one`
  - async `deregister`

Validation:

- `cargo test --features tokio-compat` passes.
- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

## Update: Lane Readiness Futures + Cancellation Cleanup (TDD)

Implemented lane-scoped readiness waits and fixed cancellation behavior.

### Red phase

Added failing tests in `tests/tokio_runtime_wait_tdd.rs` (`cfg(all(feature = "tokio-compat", target_os = "linux"))`) for:

- `wait_writable(fd)` and `wait_readable(fd)` APIs through `TokioCompatLane`.
- cancellation cleanup:
  - aborting `wait_readable` should not leak poll registrations.

### Green phase

Implemented in `src/lib.rs`:

- `TokioCompatLane` readiness methods:
  - `wait_readable(fd).await`
  - `wait_writable(fd).await`
- Drop cleanup guard for wait futures:
  - best-effort deregistration on cancellation.
- Debug helper for validation:
  - `debug_poll_registered_count()`.

Important fix during this slice:

- Reworked `TokioPollReactor` implementation from `spawn_blocking + Mutex<PollReactor>` to a dedicated worker-thread command loop.
- Reason:
  - prior design could deadlock cleanup when aborted tasks left blocking waits holding the mutex.
- New design:
  - command channel (`register` / `wait_one` / `deregister`)
  - non-blocking waiter pump (`try_wait_one`) to keep deregistration responsive.

Additional reactor hardening:

- Track active poll tokens in `PollReactor`.
- Ignore stale completions for inactive tokens.
- Fast `NotFound` on deregister for unknown token.

Validation:

- `cargo test --features tokio-compat` passes.
- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

## Recap: Requested Slice Sequence and Status (2026-02-26)

Per the requested "functional slices first" plan, the sequence and current status are:

1. Compat ergonomics slice: `completed`.
2. Native fast-lane MVP slice: `completed` (this update).
3. Mixed-mode app slice: `partially completed` (compat + native lanes both exist; additional app-level helpers still pending).
4. Submission-time placement policies: `not started`.
5. True work-stealing scheduler: `not started`.
6. Poll path re-home to shard driver + `msg_ring` wakeups: `not started`.
7. Hardening + benchmark gate slice: `in progress` (coverage exists, full stress/benchmark gates pending).

## Update: Compat Stream Wrappers (TDD)

Extended compat ergonomics with Tokio `AsyncRead`/`AsyncWrite` wrappers for easier migration from socket-like code.

### Red phase

Added failing tests:

- `tests/tokio_compat_stream_tdd.rs`
  - `compat_stream_fd_reads_and_writes`
  - `compat_stream_fd_pending_read_wakes_on_write`
- `tests/tokio_compat_stream_hardening_tdd.rs`
  - `compat_fd_into_stream_reads_bytes`
  - `compat_stream_reads_eof_as_zero`
  - `lane_compat_stream_helper_wraps_asrawfd`

### Green phase

Implemented in `src/lib.rs` (Linux + `tokio-compat`):

- `CompatStreamFd` wrapper.
- `TokioCompatLane::compat_stream_fd(fd)`.
- `TokioCompatLane::compat_stream<T: AsRawFd>(&T)`.
- `CompatFd::into_stream()`.
- `AsyncRead`/`AsyncWrite` impls for `CompatStreamFd` using:
  - nonblocking `libc::read`/`libc::write`
  - lane readiness waits (`wait_readable`/`wait_writable`) on `WouldBlock`.
- helper utilities:
  - `set_nonblocking(fd)`
  - poll-error -> `std::io::Error` mapping.

Validation:

- `cargo test --features tokio-compat` passes.
- `cargo test` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

## Update: `uring-native` Fast-Lane MVP (TDD)

Implemented first native lane API for direct `io_uring` read/write-at operations with pinned shard submission.

### Red phase

Added failing tests in `tests/uring_native_tdd.rs` (`cfg(all(feature = "uring-native", target_os = "linux"))`):

- `uring_native_lane_requires_io_uring_backend`
- `uring_native_lane_reads_file_at_offset`
- `uring_native_lane_writes_file_at_offset`

### Green phase

Implemented in `src/lib.rs` (Linux + `uring-native`):

- `RuntimeHandle::uring_native_lane(shard) -> Result<UringNativeLane, RuntimeError>`.
- `UringNativeLane` API:
  - `read_at(fd, offset, len).await -> io::Result<Vec<u8>>`
  - `write_at(fd, offset, buf).await -> io::Result<usize>`
  - `shard()`.
- `TokioCompatLane::uring_native_lane(shard)` bridge (when both `tokio-compat` and `uring-native` features are enabled).
- Native op command plumbing from shard tasks to backend.
- `IoUringDriver` native op tracking/completion with `IORING_OP_READ` and `IORING_OP_WRITE`.
- Completion demuxing for native op user-data and cleanup on shutdown/error paths.

Notes:

- Native lane currently uses pinned submission through shard-local command flow.
- Queue backend intentionally returns `UnsupportedBackend` for native lane creation.

Validation:

- `cargo test` passes.
- `cargo test --features tokio-compat` passes.
- `cargo test --features uring-native` passes.
- `cargo test --features "tokio-compat uring-native"` passes.
- `cargo bench --no-run` passes.
- `cargo bench --no-run --features glommio-bench` passes.

## Revised Task List: Value Proposition Execution (2026-02-26)

Revised priority list aligned to the current project premise.

Update:

- reordered for faster proof generation.
- benchmark evidence is moved near the front so we validate value earlier.

- Core premise:
  - deliver a differentiated `io_uring` runtime centered on `msg_ring`-based cross-shard coordination and work-stealing.
- Not the core premise:
  - broad Tokio drop-in compatibility across dependency internals.

### Slice 1: Compatibility De-Scoping

Goal:

- remove or deprecate `tokio-compat` paths as active project focus.
- retain only interop boundaries needed for mixed-mode deployment.

Done criteria:

- code/docs/feature flags no longer present `tokio-compat` as strategic direction.
- README + ADRs + crate feature docs reflect runtime-first focus.

Validation gate:

- `cargo test`
- `cargo test --features uring-native`
- `cargo bench --no-run`

### Slice 2: Benchmark MVP Harness (Early Proof)

Goal:

- add the first coordination-heavy benchmark harness early:
  - intra-request fan-out/fan-in
  - shard-skew scenarios
  - mixed control/CPU + ring-affine I/O path.

Done criteria:

- reproducible harness exists and can run quickly on dev machines.
- first p50/p95/p99 + throughput-at-SLO snapshots are recorded.

Validation gate:

- benchmark smoke run in local workflow.
- `cargo bench --no-run` remains green.

### Slice 3: Placement Policy MVP

Goal:

- implement policy-driven submission placement needed by the benchmark:
  - explicit shard
  - sticky-key routing
  - policy round-robin.

Done criteria:

- public APIs expose placement policy selection.
- deterministic tests verify routing behavior.

Validation gate:

- placement policy tests + no regression in existing send/flush tests.

### Slice 4: True Work-Stealing MVP

Goal:

- replace spawn-time round-robin with true stealing mechanics:
  - per-worker deque
  - global injector
  - steal loop with cooperative budgeting.

Done criteria:

- stealable tasks move under load/skew.
- pinned/ring-affine tasks remain protected.

Validation gate:

- scheduler TDD for steal/no-steal invariants and skew behavior.

### Slice 5: Ring-Affine Native I/O Enforcement

Goal:

- make ring-affinity guarantees explicit in runtime state transitions.

Done criteria:

- in-flight native I/O cannot migrate across shards.
- cancellation and completion paths preserve ownership invariants.

Validation gate:

- race/cancel/drop tests for native I/O ownership safety.

### Slice 6: `msg_ring` Transport Hardening

Goal:

- harden coordination path under load:
  - batching behavior
  - doorbell policy
  - SQ/CQ pressure handling.

Done criteria:

- overload behavior is well-defined and tested.
- transport metrics (drops/retries/backpressure) are surfaced.

Validation gate:

- stress tests with bounded memory and deterministic failure semantics.

### Slice 7: Mixed-Runtime Boundary API Hardening

Goal:

- define robust communication contracts between `spargio` and host runtimes (Tokio or others):
  - bounded request/reply channels
  - backpressure semantics
  - cancellation and deadline propagation.

Done criteria:

- boundary API is explicit and documented.
- tests cover cancellation, timeout, and overload behavior.

Validation gate:

- boundary TDD suite (correctness + cancellation + overload).
- existing core tests remain green.

### Slice 8: Observability and Operator Signals

Goal:

- expose metrics and debug hooks needed for production tuning.

Candidate signals:

- per-shard queue depth
- steal rate
- doorbell rate
- pending native ops
- timeout/cancel counters.

Done criteria:

- metrics API and/or tracing events documented and test-covered.

Validation gate:

- instrumentation tests + low-overhead checks in benchmark runs.

### Slice 9: CI Regression Gates

Goal:

- lock in correctness and performance trajectory.

Done criteria:

- mandatory correctness suites for scheduler/transport/native I/O invariants.
- perf guardrails for critical benchmark scenarios.

Validation gate:

- CI blocks regressions on defined thresholds.

### Slice 10: Reference Mixed-Mode Service + Benchmark Expansion

Goal:

- provide a small reference app showing Tokio + `spargio` mixed-runtime usage:
  - request fan-out into `spargio`
  - aggregation and response path
  - explicit cancellation/backpressure boundary.
- expand benchmark suite from MVP to release-grade scenarios and reporting.

Done criteria:

- runnable example with docs and benchmark entry point.
- linked from README as adoption blueprint.
- expanded benchmark scenarios tracked in log and docs.

Validation gate:

- example integration test + benchmark smoke pass.

## Update: Benchmark Review and Suite Refocus (2026-02-26)

Reviewed benchmark outputs against current value proposition (`io_uring` + `msg_ring` coordination + work-stealing trajectory), then refocused the suite.

### Latest quick benchmark sample (Criterion 50ms warmup / 50ms measure / 20 samples)

From `ping_pong`:

- `steady_ping_pong_rtt/spargio_io_uring`: ~`340-360 us`
- `steady_ping_pong_rtt/tokio_two_worker`: ~`1.33-1.45 ms`
- `steady_ping_pong_rtt/spargio_queue`: ~`1.38-1.52 ms`
- `steady_one_way_send_drain/spargio_io_uring`: ~`63-65 us`
- `steady_one_way_send_drain/tokio_two_worker`: ~`84-97 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_64`: ~`23-25 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_all`: ~`13-15 us`
- `cold_start_ping_pong/spargio_io_uring`: ~`255-288 us`
- `cold_start_ping_pong/tokio_two_worker`: ~`505-593 us`

From `disk_io`:

- `disk_read_rtt_4k/tokio_two_worker_pread`: ~`1.81-2.01 ms`
- `disk_read_rtt_4k/io_uring_msg_ring_two_ring_pread`: ~`2.54-2.83 ms`

### Interpretation

- Current value is strongest in control-path/message-path microbenchmarks for `io_uring` backend (`steady_ping_pong_rtt`, unbatched `steady_one_way_send_drain`).
- Batched Tokio one-way is still faster in that synthetic path, so batching-sensitive comparisons remain context, not headline.
- Current serialized disk RTT harness does not yet demonstrate `spargio` advantage.

### Benchmark taxonomy update

Primary KPI direction (to add/expand next):

- coordination-heavy fan-out/fan-in benchmarks with skew and tail-latency focus.

Context / microbench (kept):

- `steady_ping_pong_rtt`
- `steady_one_way_send_drain`

De-emphasized for value-prop claims:

- `cold_start_ping_pong`
- `tokio_two_worker_batched_*` (useful context, not primary proof)
- current `disk_read_rtt_4k` harness (until reworked beyond strict serialized request/ack)

### Glommio benchmark removal decision

Decision:

- remove Glommio comparison path for now.

Reason:

- not currently aligned with primary proof objective and adds maintenance noise.
- current harness shape is not the target benchmark niche for `spargio`.

Changes applied:

- removed Glommio benchmark harness/code from `benches/ping_pong.rs`.
- removed `glommio` dependency and `glommio-bench` feature from `Cargo.toml`.
- removed `glommio-bench` mention from README feature list.

Validation:

- `cargo test` passes.
- `cargo test --features uring-native` passes.
- `cargo bench --no-run` passes.

## Update: Tokio-Compat Removal + Fanout/Fan-in Benchmark MVP (2026-02-26)

Applied the scope change to fully de-emphasize drop-in Tokio emulation and move proof work to coordination-heavy fan-out/fan-in benchmarks.

### Tokio-compat removal (code + tests)

Changes:

- removed `tokio-compat` feature flag from `Cargo.toml`.
- removed optional non-dev Tokio dependency from `[dependencies]`.
- removed all `tokio-compat` lane and poll-emulation code from `src/lib.rs`:
  - deleted `tokio_compat` module.
  - deleted `RuntimeHandle::tokio_compat_lane(...)`.
  - deleted `TokioCompatLane`, `CompatFd`, `CompatStreamFd`, and associated helpers.
- removed compat-only TDD files:
  - `tests/tokio_compat_fd_tdd.rs`
  - `tests/tokio_compat_stream_tdd.rs`
  - `tests/tokio_compat_stream_hardening_tdd.rs`
  - `tests/tokio_poll_reactor_tdd.rs`
  - `tests/tokio_poll_async_tdd.rs`
  - `tests/tokio_runtime_lane_tdd.rs`
  - `tests/tokio_runtime_wait_tdd.rs`
- renamed remaining Tokio interoperability coverage from `tests/tokio_compat_tdd.rs` to `tests/tokio_interop_tdd.rs` for clearer intent.

### New benchmark: fan-out/fan-in with skew

Added `benches/fanout_fanin.rs` and registered it in `Cargo.toml`.

Harness design:

- Same worker width on both runtimes (`4` threads/shards).
- Same workload model on both runtimes:
  - per-request spawn fan-out (`16` branches), then fan-in on join.
  - deterministic synthetic compute per branch.
- Two scenarios:
  - `fanout_fanin_balanced`: all branches equal work.
  - `fanout_fanin_skewed`: one hot branch per request has much heavier work.
- Bench variants:
  - `tokio_mt_4`
  - `spargio_queue`
  - `spargio_io_uring` (Linux)

### Quick MVP benchmark sample

Command:

- `cargo bench --bench fanout_fanin -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

Observed ranges:

- `fanout_fanin_balanced/tokio_mt_4`: ~`1.41-1.51 ms`
- `fanout_fanin_balanced/spargio_queue`: ~`10.7-18.1 ms`
- `fanout_fanin_balanced/spargio_io_uring`: ~`0.782-0.813 ms`
- `fanout_fanin_skewed/tokio_mt_4`: ~`2.34-2.40 ms`
- `fanout_fanin_skewed/spargio_queue`: ~`54.0-54.4 ms`
- `fanout_fanin_skewed/spargio_io_uring`: ~`1.882-1.889 ms`

### Validation

- `cargo fmt` passes.
- `cargo test` passes.
- `cargo test --features uring-native` passes.
- `cargo bench --no-run` passes (includes `fanout_fanin`).

## Direction Note: Full io_uring Runtime Scope (2026-02-26)

Long-term direction:

- evolve `spargio` toward a fuller `io_uring` runtime surface (disk + network I/O), comparable in scope to specialized runtimes.

Near-term priority remains unchanged:

- prove differentiated value first in `msg_ring`-coordinated cross-shard scheduling, placement, and work-stealing benchmarks.

Implication for sequencing:

- full disk/network API breadth is explicitly treated as a later expansion track after current scheduler/coordination milestones are validated.

## Update: Slice Execution MVP (Placement, Stealing, Boundary, CI, Reference App) (2026-02-26)

Executed the remaining planned slices in MVP form with red/green TDD coverage.

### Red-phase tests added

New failing suites introduced first:

- `tests/slices_tdd.rs`
  - placement policy routing (`Pinned`, `Sticky`)
  - stealable execution on non-preferred shard under load
  - runtime stats snapshot counters/shape
- `tests/boundary_tdd.rs`
  - bounded overload behavior (`Overloaded`)
  - blocking timeout behavior (`Timeout`)
  - cancellation-safe reply path (`Canceled`)
  - deadline metadata propagation

Then implementation was iterated until all tests passed.

### Slice 3: Placement policy MVP

Implemented:

- `TaskPlacement` enum:
  - `Pinned(ShardId)`
  - `RoundRobin`
  - `Sticky(u64)`
  - `Stealable`
  - `StealablePreferred(ShardId)`
- `RuntimeHandle::spawn_with_placement(...)`
- `RuntimeHandle::spawn_stealable_on(preferred_shard, ...)`

Notes:

- sticky placement uses stable key hashing to shard index.

### Slice 4: True work-stealing MVP

Implemented:

- global stealable injector channel (`StealableTask`) shared across shard workers.
- shard workers opportunistically drain stealable tasks and execute locally.
- preferred-shard hint tracking with `stealable_stolen` counter when execution shard differs from preferred shard.

Validation:

- `stealable_preferred_tasks_can_run_on_another_shard_under_load` now passes.

### Slice 5: Ring-affine native I/O enforcement

Implemented:

- native local commands now carry `origin_shard`.
- backend validates `origin_shard == current_shard` before submitting native ops.
- affinity violations increment `native_affinity_violations` and fail the operation.
- pending native-op gauge (`pending_native_ops`) is tracked.

### Slice 6: `msg_ring` transport hardening

Implemented:

- configurable `msg_ring_queue_capacity` on `RuntimeBuilder`.
- io_uring payload queues enforce bounded capacity.
- overload now reports `SendError::Backpressure` for saturated payload queues.
- backpressure counter surfaced via `ring_msgs_backpressure`.

### Slice 7: Mixed-runtime boundary API hardening

Implemented `spargio::boundary` module:

- bounded channel construction via `boundary::channel(capacity)`.
- client API:
  - `call(...)`
  - `try_call(...)`
  - `call_with_timeout(...)`
- server API:
  - `recv()`
  - `recv_timeout(...)`
- request API:
  - `request()`
  - `deadline()`
  - `respond(...)` (cancellation-safe)
- ticket API:
  - `Future` implementation
  - `wait_timeout_blocking(...)`

Error model:

- `BoundaryError::{Closed, Overloaded, Timeout, Canceled}`.

### Slice 8: Observability and operator signals

Implemented snapshot API:

- `RuntimeHandle::stats_snapshot() -> RuntimeStats`

Current signals:

- per-shard command depth (`shard_command_depths`)
- submitted pinned / stealable spawn counts
- stealable executed / stolen counts
- ring message submitted / completed / failed / backpressure counts
- native affinity violation count
- pending native-op gauge

### Slice 9: CI regression gates

Added:

- `.github/workflows/ci.yml` with gates for:
  - format check
  - tests
  - `uring-native` tests
  - `cargo bench --no-run`
  - fan-out benchmark smoke + guardrail scripts

Added scripts:

- `scripts/bench_fanout_smoke.sh`
- `scripts/bench_fanout_guardrail.sh`

### Slice 10: Reference mixed-mode service + benchmark expansion

Added:

- `examples/mixed_mode_service.rs`
  - Tokio-hosted request fan-out to `spargio` via boundary channel
  - stealable placement usage + aggregation response path
  - timeout-aware boundary call path

Benchmark update:

- `benches/fanout_fanin.rs` now records throughput units per group (`Throughput::Elements`).

### Validation

- `cargo test` passes.
- `cargo test --features uring-native` passes.
- `cargo bench --no-run` remains green.

## Update: Full Benchmark Snapshot Refresh (2026-02-26)

Captured a fresh baseline across all active benchmark suites after slice MVP implementation.

### Command profile

- `cargo bench --bench ping_pong -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench fanout_fanin -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench disk_io -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Observed ranges

From `ping_pong`:

- `steady_ping_pong_rtt/spargio_queue`: ~`1.37-1.42 ms`
- `steady_ping_pong_rtt/spargio_io_uring`: ~`353-380 us`
- `steady_ping_pong_rtt/tokio_two_worker`: ~`1.41-1.51 ms`
- `steady_one_way_send_drain/spargio_queue`: ~`1.31-1.35 ms`
- `steady_one_way_send_drain/spargio_io_uring`: ~`66.9-69.1 us`
- `steady_one_way_send_drain/tokio_two_worker`: ~`87.2-91.1 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_64`: ~`22.4-23.4 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_all`: ~`13.7-14.7 us`
- `cold_start_ping_pong/spargio_queue`: ~`2.43-2.44 ms`
- `cold_start_ping_pong/spargio_io_uring`: ~`242-264 us`
- `cold_start_ping_pong/tokio_two_worker`: ~`511-560 us`

From `fanout_fanin`:

- `fanout_fanin_balanced/tokio_mt_4`: ~`1.35-1.38 ms`
- `fanout_fanin_balanced/spargio_queue`: ~`3.80-4.10 ms`
- `fanout_fanin_balanced/spargio_io_uring`: ~`1.61-1.65 ms`
- `fanout_fanin_skewed/tokio_mt_4`: ~`2.39-2.59 ms`
- `fanout_fanin_skewed/spargio_queue`: ~`3.44-3.73 ms`
- `fanout_fanin_skewed/spargio_io_uring`: ~`1.99-2.00 ms`

From `disk_io`:

- `disk_read_rtt_4k/tokio_two_worker_pread`: ~`1.80-1.95 ms`
- `disk_read_rtt_4k/io_uring_msg_ring_two_ring_pread`: ~`2.61-2.78 ms`

### Readout

- `spargio_io_uring` is strongest in control-path RTT and cold-start latency.
- one-way unbatched send/drain favors `spargio_io_uring`, but batched Tokio remains significantly faster.
- skewed fan-out/fan-in currently favors `spargio_io_uring`.
- balanced fan-out/fan-in currently favors Tokio.
- current disk RTT harness remains a loss for the io_uring+msg_ring path.

## Update: msg_ring Stealable Dispatch + Benchmark Refresh (2026-02-26)

Implemented work-stealing data-path changes to align with project premise:

- replaced global stealable injector channel with per-shard stealable inboxes.
- changed stealable submit path to:
  1. choose target shard by inbox depth (submission-time decision),
  2. enqueue task into target inbox,
  3. wake target via `msg_ring` doorbell on `IoUring` backend.
- added wake plumbing:
  - `LocalCommand::SubmitStealableWake`
  - `Command::StealableWake`
  - backend `submit_stealable_wake(...)` path.
- kept queue-backend fallback wake semantics for non-io_uring runs.

TDD additions:

- added Linux io_uring slice test proving stealable dispatch submits ring wake traffic:
  - `tests/slices_tdd.rs::io_uring_stealable_dispatch_uses_msg_ring_wake`.

Validation:

- `cargo fmt`
- `cargo test`
- `cargo test --features uring-native`

Benchmark profile:

- `cargo bench --bench ping_pong -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench fanout_fanin -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench disk_io -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `./scripts/bench_fanout_guardrail.sh`

Observed ranges:

From `ping_pong`:

- `steady_ping_pong_rtt/spargio_io_uring`: ~`352-370 us`
- `steady_ping_pong_rtt/tokio_two_worker`: ~`1.30-1.42 ms`
- `steady_one_way_send_drain/spargio_io_uring`: ~`66.6-68.3 us`
- `steady_one_way_send_drain/tokio_two_worker`: ~`84.0-90.6 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_64`: ~`24.2-26.1 us`
- `steady_one_way_send_drain/tokio_two_worker_batched_all`: ~`14.4-15.7 us`
- `cold_start_ping_pong/spargio_io_uring`: ~`248-305 us`
- `cold_start_ping_pong/tokio_two_worker`: ~`500-555 us`

From `fanout_fanin`:

- `fanout_fanin_balanced/tokio_mt_4`: ~`1.43-1.51 ms`
- `fanout_fanin_balanced/spargio_io_uring`: ~`982-989 us`
- `fanout_fanin_skewed/tokio_mt_4`: ~`2.35-2.42 ms`
- `fanout_fanin_skewed/spargio_io_uring`: ~`1.92-1.93 ms`

From `disk_io`:

- `disk_read_rtt_4k/tokio_two_worker_pread`: ~`1.82-2.00 ms`
- `disk_read_rtt_4k/io_uring_msg_ring_two_ring_pread`: ~`2.52-2.74 ms`

Interpretation:

- value proposition now shows up directly in coordination-heavy fan-out/fan-in:
  - balanced and skewed scenarios both favor `spargio_io_uring`.
- compared with earlier same-day snapshot, `fanout_fanin_balanced` flipped from loss to win after the stealable dispatch changes.
- batched Tokio one-way throughput remains a known gap.
- disk RTT benchmark remains a known gap.

## Roadmap: Toward Full Runtime Scope

Objective:

- evolve `spargio` into a fuller async runtime in the class of `glommio` / `monoio` / `compio`, while preserving the current differentiator (`msg_ring`-coordinated cross-shard scheduling + stealing).

Priority roadmap:

1. Lock the differentiator with stable KPI gates.
2. Build scheduler v2 (true per-worker deque stealing + fairness controls).
3. Complete core runtime primitives (timers, cancellation, task groups, backpressure semantics).
4. Deliver native network I/O MVP (TCP/UDP) on io_uring.
5. Deliver native filesystem I/O MVP with clear FD/buffer ownership and affinity rules.
6. Harden reliability and observability (stress/soak, failure injection, per-shard metrics and tracing).
7. Keep sidecar interop first-class; treat broad Tokio-compat readiness emulation as an optional long-term lane.

Immediate milestone sequence:

1. Deque-based stealing + fairness/budgeting.
2. Timer + timeout + cancellation primitives.
3. TCP MVP + dedicated latency/throughput/tail benchmarks.

## Update: Roadmap Tasks 1-5 MVP Implementation (TDD) (2026-02-26)

Implemented the first pass for roadmap tasks 1-5 with red/green TDD, then validated with tests and benchmark guardrails.

### 1) KPI gates for value proposition

Added benchmark guardrails/scripts:

- `scripts/bench_ping_guardrail.sh`
  - checks `steady_ping_pong_rtt`, unbatched `steady_one_way_send_drain`, and `cold_start_ping_pong` against Tokio ratio thresholds.
- `scripts/bench_kpi_guardrail.sh`
  - runs ping + fanout guardrails together.
- existing `scripts/bench_fanout_guardrail.sh` retained.

CI update:

- `.github/workflows/ci.yml` now runs:
  - fanout smoke
  - ping perf guardrail
  - fanout perf guardrail

### 2) Scheduler v2 (per-worker deque stealing + fairness controls)

Runtime changes:

- added `RuntimeBuilder::stealable_queue_capacity(...)`.
- added `RuntimeBuilder::steal_budget(...)`.
- changed stealable submission path:
  - submit to preferred shard deque (`StealablePreferred`) with bounded capacity.
  - return `RuntimeError::Overloaded` on enqueue backpressure.
- worker execution loop now:
  - drains local deque first up to budget.
  - attempts bounded victim steals via rotating cursor when local queue has room.

New stats signals:

- `stealable_backpressure`
- `steal_attempts`
- `steal_success`

### 3) Core runtime primitives (timer/cancellation/task groups/backpressure semantics)

Added:

- `sleep(Duration) -> impl Future<Output = ()>`
- `timeout(Duration, fut) -> Result<T, TimeoutError>`
- `CancellationToken` with:
  - `new()`
  - `cancel()`
  - `is_canceled()`
  - `cancelled() -> Future`
- `TaskGroup` with cooperative cancellation:
  - `TaskGroup::new(handle)`
  - `spawn_with_placement(...) -> TaskGroupJoinHandle<T>`
  - `cancel()`
  - `token()`

Backpressure semantics now include stealable task-queue overload via `RuntimeError::Overloaded`.

### 4) Native network I/O MVP (io_uring lane)

Extended `UringNativeLane` with:

- `recv(fd, len)`
- `send(fd, buf)`

Implemented via native io_uring ops:

- `IORING_OP_RECV`
- `IORING_OP_SEND`

### 5) Native filesystem I/O MVP (ownership + affinity surface)

Added:

- `UringNativeLane::fsync(fd)` (`IORING_OP_FSYNC`)
- `UringBoundFd` ownership wrapper bound to a lane/shard with methods:
  - `read_at`, `write_at`, `recv`, `send`, `fsync`
- binding helpers:
  - `bind_owned_fd`
  - `bind_file`
  - `bind_tcp_stream`
  - `bind_udp_socket`

This gives an explicit ownership + shard-affinity API surface for FD-driven native ops.

### Red/green tests added

- `tests/primitives_tdd.rs`
  - sleep timing
  - timeout success/failure
  - cancellation token notification
  - task-group cancellation and completion semantics
- `tests/slices_tdd.rs` additions
  - stealable queue backpressure -> `RuntimeError::Overloaded`
  - steal attempts/success stats under blocked-owner load
- `tests/uring_native_tdd.rs` additions
  - bound file write/read/fsync
  - bound TCP send/recv
  - bound UDP send/recv

### Validation

- `cargo fmt`
- `cargo test`
- `cargo test --features uring-native`
- `./scripts/bench_ping_guardrail.sh`
- `./scripts/bench_fanout_guardrail.sh`
- `cargo bench --bench disk_io -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Benchmark readout (latest local run profile)

From ping guardrail run:

- `steady_ping_pong_rtt/spargio_io_uring`: ~`363-380 us`
- `steady_ping_pong_rtt/tokio_two_worker`: ~`1.37-1.48 ms`
- `steady_one_way_send_drain/spargio_io_uring`: ~`73.0-75.3 us`
- `steady_one_way_send_drain/tokio_two_worker`: ~`104.6-115.8 us`
- `cold_start_ping_pong/spargio_io_uring`: ~`260-297 us`
- `cold_start_ping_pong/tokio_two_worker`: ~`463-511 us`

From fanout guardrail run:

- `fanout_fanin_balanced/tokio_mt_4`: ~`1.42-1.50 ms`
- `fanout_fanin_balanced/spargio_io_uring`: ~`1.33-1.35 ms`
- `fanout_fanin_skewed/tokio_mt_4`: ~`2.42-2.53 ms`
- `fanout_fanin_skewed/spargio_io_uring`: ~`2.03-2.04 ms`

From disk benchmark run:

- `disk_read_rtt_4k/tokio_two_worker_pread`: ~`1.79-1.93 ms`
- `disk_read_rtt_4k/io_uring_msg_ring_two_ring_pread`: ~`2.65-2.80 ms`

## Benchmark suite update: FS/Net API coverage and legacy disk bench removal

Implemented benchmark suite changes to align with current runtime API surface:

- removed legacy disk RTT benchmark harness:
  - deleted `benches/disk_io.rs`
  - removed `[[bench]] name = "disk_io"` from `Cargo.toml`
- added filesystem API benchmark suite:
  - `benches/fs_api.rs`
  - `fs_read_rtt_4k`:
    - `tokio_spawn_blocking_pread_qd1`
    - `spargio_uring_bound_file_qd1`
  - `fs_read_throughput_4k_qd32`:
    - `tokio_spawn_blocking_pread_qd32`
    - `spargio_uring_bound_file_qd32`
- added network API benchmark suite:
  - `benches/net_api.rs`
  - `net_echo_rtt_256b`:
    - `tokio_tcp_echo_qd1`
    - `spargio_uring_bound_tcp_qd1`
  - `net_stream_throughput_4k_window32`:
    - `tokio_tcp_echo_window32`
    - `spargio_uring_bound_tcp_window32`
- updated `Cargo.toml` benchmark targets:
  - `ping_pong`
  - `fanout_fanin`
  - `fs_api`
  - `net_api`

Validation run:

- `cargo fmt --all`
- `cargo bench --no-run`
- `cargo bench --no-run --features uring-native`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --bench fs_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench net_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

Latest benchmark readout (short smoke profile):

From `fs_api`:

- `fs_read_rtt_4k/tokio_spawn_blocking_pread_qd1`: ~`1.59-1.68 ms`
- `fs_read_rtt_4k/spargio_uring_bound_file_qd1`: ~`1.98-2.11 ms`
- `fs_read_throughput_4k_qd32/tokio_spawn_blocking_pread_qd32`: ~`7.66-7.76 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_bound_file_qd32`: ~`7.51-8.23 ms`

From `net_api`:

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: ~`8.17-8.54 ms`
- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: ~`6.89-6.97 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: ~`11.12-11.42 ms`
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: ~`29.33-30.01 ms`

## Net benchmark tuning pass: reduce `net_stream_throughput_4k_window32` gap

Goal:

- reduce overhead in the `uring-native` TCP path and re-run `net_api` to improve `net_stream_throughput_4k_window32`.

Implemented runtime/API changes (`src/lib.rs`):

- added owned-buffer native APIs:
  - `UringNativeLane::recv_owned(fd, Vec<u8>) -> io::Result<(usize, Vec<u8>)>`
  - `UringNativeLane::send_owned(fd, Vec<u8>) -> io::Result<(usize, Vec<u8>)>`
  - `UringBoundFd::recv_owned(Vec<u8>) -> io::Result<(usize, Vec<u8>)>`
  - `UringBoundFd::send_owned(Vec<u8>) -> io::Result<(usize, Vec<u8>)>`
- kept existing convenience APIs by adapting through owned-buffer path:
  - `recv(fd, len)` now uses `recv_owned` + truncate
  - `send(fd, &[u8])` now uses `send_owned`
- added same-shard fast path in `recv_owned`/`send_owned`:
  - if called from matching runtime/shard context, enqueue native op directly to local command queue instead of spawning a new pinned task.
- wired owned-buffer request/response shapes through local command + backend + io_uring native op completion path.

TDD coverage:

- added `uring_bound_tcp_stream_supports_owned_send_and_recv_buffers` in `tests/uring_native_tdd.rs`.

Benchmark harness tuning (`benches/net_api.rs`):

- moved Spargio net workload execution into a pinned runtime worker task (command-driven harness), instead of issuing all ops from outside the runtime.
- switched throughput receive path to stream-byte draining with a reusable scratch buffer (`64 KiB`) for both Tokio and Spargio:
  - reduces per-op overhead and keeps the workload apples-to-apples as stream throughput.
- switched Spargio send path to owned-buffer reuse (`send_owned`) with fallback for partial sends.

Validation:

- `cargo fmt --all`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --no-run --features uring-native`
- `cargo bench --bench net_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

Result delta from this tuning pass:

- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: improved from ~`6.89-6.97 ms` to ~`5.46-5.70 ms`.
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: improved from ~`29.33-30.01 ms` to ~`12.96-13.16 ms`.

Current comparison (same run):

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: ~`7.62-8.10 ms`
- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: ~`5.46-5.70 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: ~`10.47-11.01 ms`
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: ~`12.96-13.16 ms`

Interpretation:

- RTT is now clearly in Spargio’s favor for this harness.
- Stream throughput gap versus Tokio is substantially reduced (from ~2.6x slower to ~1.2x slower), but still present.

## Next optimization batch (committed plan before implementation)

Based on current net throughput gap, the next batch is:

1. Introduce provided-buffer multishot receive path (`IORING_OP_RECV_MULTISHOT` + `IORING_OP_PROVIDE_BUFFERS`) for stream receive-heavy benchmarks.
2. Expand reusable-buffer APIs (`recv_into`/owned-buffer reuse) so stream loops avoid per-op allocation churn.
3. Add batch-oriented stream APIs (`send_batch`, `recv_batch`/multishot helpers) to reduce per-message control overhead.
4. Increase pipelining depth in throughput paths by issuing batched/native operations with configurable in-flight windows.
5. Add an io_uring throughput preset (`single_issuer`, `coop_taskrun`, optional `sqpoll`) and use it in benchmark harnesses with fallback when unsupported.

Execution approach remains red/green TDD: add failing tests for each new API/behavior, then implement minimal passing behavior, then re-benchmark.

## Implementation: proposal batch (multishot/batching/tuning) completed

Implemented all items from the prior optimization proposal set.

### 1) Provided-buffer multishot receive path

Runtime additions (`src/lib.rs`):

- new local command: `SubmitNativeRecvMultishot`
- new native op state: `NativeIoOp::RecvMulti` (buffer group, target bytes, collected chunks)
- new driver path:
  - `submit_native_recv_multishot(...)`
  - submits `IORING_OP_PROVIDE_BUFFERS` + `IORING_OP_RECV_MULTISHOT`
  - collects CQEs until target bytes reached or stream ends
  - issues `IORING_OP_ASYNC_CANCEL` when target reached while CQE `MORE` continues
  - removes provided buffers via `IORING_OP_REMOVE_BUFFERS` on completion/failure
- completion path updated to process multishot/native housekeeping CQEs safely.

### 2) Reusable-buffer API expansion

Added:

- `UringNativeLane::recv_into(fd, Vec<u8>)`
- `UringBoundFd::recv_into(Vec<u8>)`

These preserve caller-owned buffers and avoid per-op allocation churn.

### 3) Batch-oriented stream APIs

Added:

- `UringNativeLane::send_batch(fd, Vec<Vec<u8>>, window)`
- `UringNativeLane::recv_batch_into(fd, Vec<Vec<u8>>, window)`
- `UringBoundFd::send_batch(...)`
- `UringBoundFd::recv_batch_into(...)`
- `UringNativeLane::recv_multishot(...)`
- `UringBoundFd::recv_multishot(...)`

### 4) Pipelining depth in throughput path

Benchmark harness updates (`benches/net_api.rs`):

- throughput send path now uses `send_batch` with reusable buffer pool.
- throughput receive path attempts `recv_multishot` first, then falls back to `recv_owned` if unsupported.
- this increases in-flight native work while keeping a fallback for older kernels.

### 5) io_uring throughput preset + harness usage

Runtime builder addition:

- `RuntimeBuilder::io_uring_throughput_mode(sqpoll_idle_ms)`
  - enables `coop_taskrun`
  - optional sqpoll setting through argument

Harness usage:

- `benches/fs_api.rs` and `benches/net_api.rs` now try throughput mode and fall back to plain io_uring runtime build if unavailable.

### Additional hardening done while implementing

- `flush_submissions()` now treats transient submit errors (`EAGAIN`/`EBUSY`/`Interrupted`) as retry/defer instead of immediate fatal teardown.
- this removed runtime cancellation failures seen under benchmark pressure.

### TDD additions

`tests/uring_native_tdd.rs` now includes:

- `uring_bound_tcp_stream_supports_recv_into_and_send_batch`
- `uring_bound_tcp_stream_supports_recv_multishot` (with unsupported-kernel fallback)

### Validation

- `cargo fmt --all`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --no-run`
- `cargo bench --no-run --features uring-native`
- `cargo bench --bench fs_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench net_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Latest benchmark readout after this implementation batch

From `fs_api`:

- `fs_read_rtt_4k/tokio_spawn_blocking_pread_qd1`: ~`1.64-1.71 ms`
- `fs_read_rtt_4k/spargio_uring_bound_file_qd1`: ~`1.98-2.28 ms`
- `fs_read_throughput_4k_qd32/tokio_spawn_blocking_pread_qd32`: ~`8.57-8.97 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_bound_file_qd32`: ~`6.73-7.42 ms`

From `net_api`:

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: ~`7.92-8.35 ms`
- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: ~`5.57-5.88 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: ~`10.93-11.85 ms`
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: ~`11.92-12.28 ms`

Interpretation:

- proposal batch is functionally implemented end-to-end (APIs + runtime + tests + benches).
- stream throughput gap versus Tokio narrowed further while preserving RTT advantage.

## Next optimization batch: close net throughput gap vs Tokio

Goal:

- improve `net_stream_throughput_4k_window32` by reducing per-frame control-path overhead in Spargio’s native TCP path.

Planned items (to implement with red/green TDD):

1. True native send batching:
   - add a single-command native submit path for multiple sends (`send_batch_native`) instead of `join_all(send_owned(...))` fanout.
   - aggregate completions in-driver and reply once per batch.
2. Persistent multishot provided-buffer groups:
   - keep a reusable provided-buffer pool per fd/lane for throughput loops.
   - avoid `ProvideBuffers`/`RemoveBuffers` on every throughput batch.
3. Zero-copy-ish multishot completion path cleanup:
   - remove `chunks.clone()` completion duplication.
   - finish by moving accumulated chunks once.
4. Capability caching in benchmark/harness:
   - probe multishot support once and stop retrying unsupported ops each batch.
5. Stronger throughput semantics:
   - add `send_all_batch` behavior (or equivalent) so batch send handles partial writes without leaking throughput accounting.

## Implementation: net throughput optimization batch completed

Implemented all five planned items.

### 1) True native send batching

Runtime changes (`src/lib.rs`):

- new API:
  - `UringNativeLane::send_all_batch(fd, bufs, window)`
  - `UringBoundFd::send_all_batch(bufs, window)`
- `send_batch(...)` now delegates to `send_all_batch(...)`.
- new local command:
  - `SubmitNativeSendBatchOwned`
- new backend + driver path:
  - `ShardBackend::submit_native_send_batch(...)`
  - `IoUringDriver::submit_native_send_batch(...)`
- batch state and CQE handling:
  - `NativeSendBatch`
  - `NativeSendBatchPart`
  - `native_send_batches` + `native_send_parts`
  - `complete_native_send_batch_part(...)`
  - single batch reply channel per batch (not per send op).

### 2) Persistent multishot provided-buffer groups

Runtime changes (`src/lib.rs`):

- `NativeIoOp::RecvMulti` now references a pool key rather than owning temporary storage.
- new pool model:
  - `NativeRecvPoolKey`
  - `NativeRecvPool`
  - `native_recv_pools: HashMap<...>`
- multishot flow now:
  - registers provided buffers once per pool (`registered`).
  - reuses pool storage/group across calls.
  - reprovides consumed bids via `reprovide_multishot_buffers(...)`.
  - marks pool free via `mark_recv_pool_free(...)`.
  - removes all registered groups on driver shutdown.

### 3) Multishot completion path copy cleanup

- removed `chunks.clone()` completion duplication in `complete_native_op(...)`.
- completion now moves collected chunks with `std::mem::take(...)` when finishing multishot ops.

### 4) Capability caching in benchmark path

Benchmark changes (`benches/net_api.rs`):

- `spargio_echo_windowed(...)` now caches multishot support in-loop:
  - if `recv_multishot` returns `EINVAL` / `ENOSYS` / `EOPNOTSUPP`, disable further multishot attempts for the rest of the run.

### 5) Stronger send semantics (`send_all_batch`)

- `send_all_batch` tracks per-buffer progress and retries partial writes until each buffer is fully sent or an error occurs.
- benchmark throughput sender now uses `send_all_batch(...)` (full-send semantics).

### Red/Green TDD additions

Added tests first in `tests/uring_native_tdd.rs`, then implemented runtime until green:

- `uring_bound_tcp_stream_supports_send_all_batch`
- `uring_bound_tcp_stream_reuses_recv_multishot_path_across_calls`

### Validation

- `cargo fmt --all`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --no-run --features uring-native`
- `cargo bench --bench fs_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench net_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Latest benchmark readout after this batch

From `net_api`:

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: ~`7.62-8.07 ms`
- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: ~`5.26-5.70 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: ~`10.42-10.73 ms`
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: ~`11.02-11.16 ms`

From `fs_api`:

- `fs_read_rtt_4k/tokio_spawn_blocking_pread_qd1`: ~`1.60-1.75 ms`
- `fs_read_rtt_4k/spargio_uring_bound_file_qd1`: ~`1.85-1.92 ms`
- `fs_read_throughput_4k_qd32/tokio_spawn_blocking_pread_qd32`: ~`7.51-7.62 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_bound_file_qd32`: ~`6.40-6.96 ms`

Interpretation:

- net throughput gap vs Tokio narrowed again (roughly from ~1.1x slower to ~1.05x slower in this short-run harness).
- net RTT lead remains.
- fs throughput lead remains.

## Implementation: follow-up net throughput optimizations (session + segment path + reprovide coalescing)

Applied the next optimization set aimed at reducing remaining `net_stream_throughput_4k_window32` overhead.

### 1) Persistent session in benchmark worker

`benches/net_api.rs`:

- added `SpargioWindowedSession` that persists across `EchoWindowed` benchmark commands.
- session retains:
  - reusable tx buffer pool,
  - reusable recv scratch buffer,
  - cached multishot capability state.
- worker now reuses this session for matching `(payload, window)` rather than rebuilding per invocation.

### 2) Segment-based multishot API (avoid `Vec<Vec<u8>>` materialization in hot path)

`src/lib.rs`:

- new public types:
  - `UringRecvSegment { offset, len }`
  - `UringRecvMultishotSegments { buffer, segments }`
- new APIs:
  - `UringNativeLane::recv_multishot_segments(...)`
  - `UringBoundFd::recv_multishot_segments(...)`
- `recv_multishot(...)` remains for compatibility and now adapts from segment output.
- `NativeIoOp::RecvMulti` now accumulates into one flat output buffer + segment metadata rather than `Vec<Vec<u8>>`.

### 3) Reprovide coalescing (reduce housekeeping SQEs)

`src/lib.rs`:

- `reprovide_multishot_buffers(...)` now:
  - sorts + deduplicates consumed bids,
  - coalesces contiguous bids into runs,
  - submits one `ProvideBuffers` SQE per contiguous run (instead of one per bid).

### TDD updates

- added test:
  - `uring_bound_tcp_stream_supports_recv_multishot_segments`
- preserved existing multishot compatibility tests; full `--features uring-native` test suite remains green.

### Validation

- `cargo fmt --all`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --no-run --features uring-native`
- `cargo bench --bench net_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Latest net benchmark snapshot after this follow-up

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: ~`7.58-7.90 ms`
- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: ~`5.25-5.35 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: ~`10.51-10.85 ms`
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: ~`10.84-10.95 ms`

Interpretation:

- stream-throughput gap narrowed further and is now close to parity in this short-run harness.
- RTT lead for Spargio remains.

## Implementation: fs RTT (`qd=1`) optimization batch (items 1-3)

Implemented the requested three-item set for `fs_read_rtt_4k`.

### 1) Run Spargio FS loops inside pinned runtime worker

`benches/fs_api.rs`:

- replaced external `block_on` Spargio loop with a pinned worker command loop (`SpargioFsCmd`).
- `ReadRtt` and `ReadQd` now execute on shard `1` in the runtime task itself.
- benchmark caller uses std mpsc request/reply to drive the worker, mirroring Tokio harness structure more closely.

### 2) Reusable read buffer API (`read_at_into`)

`src/lib.rs`:

- added:
  - `UringNativeLane::read_at_into(fd, offset, buf)`
  - `UringBoundFd::read_at_into(offset, buf)`
- `read_at(...)` now adapts through `read_at_into(...)`.
- added native read-owned command path:
  - `LocalCommand::SubmitNativeReadOwned`
  - backend routing `submit_native_read_owned(...)`
  - driver submission `submit_native_read_owned(...)`
  - native op state `NativeIoOp::ReadOwned`
- completion and failure handling updated for `ReadOwned`.

### 3) Persistent file session API (actor-style)

`src/lib.rs`:

- added `UringFileSession`:
  - `read_at_into(...)`
  - `read_at(...)`
  - `shutdown(...)`
  - `shard()`
- new constructor on bound fd:
  - `UringBoundFd::start_file_session()`
- session is implemented as a pinned shard task with command channel (`UringFileSessionCmd`), keeping repeated file operations on one shard.

### Red/Green TDD

Added failing tests first, then implemented until green:

- `uring_bound_file_supports_read_at_into_reuse`
- `uring_bound_file_session_supports_repeated_reads`

### Validation

- `cargo fmt --all`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --no-run --features uring-native`
- `cargo bench --bench fs_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Latest FS benchmark snapshot after this batch

- `fs_read_rtt_4k/tokio_spawn_blocking_pread_qd1`: ~`1.62-1.73 ms`
- `fs_read_rtt_4k/spargio_uring_bound_file_qd1`: ~`0.99-1.01 ms`
- `fs_read_throughput_4k_qd32/tokio_spawn_blocking_pread_qd32`: ~`7.59-7.75 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_bound_file_qd32`: ~`5.74-6.27 ms`

Interpretation:

- `qd=1` RTT moved from slower-than-Tokio to faster-than-Tokio in this short-run harness.
- throughput lead at `qd=32` remains.

## Proposal: unbound submission-time steering for all native ops

Goal:

- allow stealable tasks to issue native ops without pre-pinning a lane, while selecting target shard at submission time.

Design slices:

1. Unbound native entrypoint:
   - add `RuntimeHandle::uring_native_unbound() -> UringNativeAny`.
   - expose all native ops (`read/write/fsync`, `send/recv`, batch, multishot) on `UringNativeAny`.
2. Lane selector:
   - introduce `NativeLaneSelector` using per-shard pending native-op depth + round-robin tie-break.
   - support optional locality hints (`preferred_shard`).
3. FD affinity lease table:
   - add `FdAffinityTable` (`fd -> shard`) with TTL/release on idle.
   - use weak leases for file ops, stronger leases for stream/socket ops, hard affinity for multishot lifetime.
4. Generic native command envelope:
   - add `SubmitNativeAny { op, reply }` and route to selected shard.
   - preserve local fast path when selected shard == current shard.
5. Op-family behavior:
   - file single-shot ops steerable per op,
   - stream single-shot ops steerable with lease-aware ordering,
   - batch ops single-lane per batch,
   - multishot fixed-lane for op lifetime (token/stream tied to owning lane).
6. Cancellation/timeouts:
   - add global `op_id -> shard` tracking for correct cancel routing.
   - keep resource cleanup on owning lane.
7. TDD rollout:
   - slice A: unbound file ops + selector correctness/distribution tests.
   - slice B: unbound stream single-shot + batch ordering tests.
   - slice C: unbound multishot lifecycle/cancel/cleanup tests.
   - slice D: benchmark variants (`*_unbound_*`) vs pinned/session APIs.

Recommendation:

- yes, this is worth doing, but as a phased effort.
- rationale:
  - it preserves explicit pinned/session fast paths while adding flexible scheduler-friendly mode for stealable compute tasks.
  - it unlocks broader ergonomics without forcing users to choose one affinity model globally.
- risk:
  - correctness complexity is non-trivial (lease ownership, cancellation routing, multishot lifetime rules), so TDD slice gating is required.

## Implementation: unbound submission-time steering (slices A-D)

Implemented the full unbound slice set in this pass.

### Slice A: unbound entrypoint + selector + file ops

`src/lib.rs`:

- added `RuntimeHandle::uring_native_unbound() -> UringNativeAny`.
- added `NativeLaneSelector`:
  - selection by per-shard pending native-op depth (`pending_native_ops_by_shard`) with round-robin tie-break.
  - optional preferred-shard hinting.
- added `UringNativeAny` API surface for native ops:
  - `read_at`, `read_at_into`, `write_at`, `fsync`
  - plus stream/batch/multishot methods (below).
- added FD affinity lease table (`FdAffinityTable`):
  - weak lease for file-family ops,
  - strong lease for stream single-shot/batch,
  - hard lease for multishot lifetime.
- added unbound op-route tracking:
  - global `NativeOpId` allocation and `op_id -> shard` map.
  - `active_native_op_count()` / `active_native_op_shard(...)` observability.

Stats:

- `RuntimeStats` now includes `pending_native_ops_by_shard`.
- io_uring driver now updates both global pending-native count and per-shard pending-native depth.

### Slice B: stream single-shot + batch behavior

`UringNativeAny` now supports:

- `recv`, `recv_owned`, `recv_into`
- `send`, `send_owned`
- `send_batch`, `send_all_batch`
- `recv_batch_into`

Behavior:

- stream ops are lease-aware (`strong` lease), preserving lane-local ordering tendencies for repeated ops on the same FD.
- batch ops run single-lane per batch.

### Slice C: multishot lifecycle + cleanup

`UringNativeAny` now supports:

- `recv_multishot`
- `recv_multishot_segments`

Behavior:

- multishot uses `hard` FD affinity for operation lifetime.
- affinity is released when multishot completes.
- op-route map entries are added/removed around each unbound op, preserving ownership tracking.

### Slice D: benchmark variants (`*_unbound_*`)

`benches/fs_api.rs`:

- added `SpargioFsUnboundHarness`.
- added benchmark cases:
  - `spargio_uring_unbound_file_qd1`
  - `spargio_uring_unbound_file_qd32`

`benches/net_api.rs`:

- added `SpargioNetUnboundHarness`.
- added benchmark cases:
  - `spargio_uring_unbound_tcp_qd1`
  - `spargio_uring_unbound_tcp_window32`

### Red/Green TDD

Added failing tests first in `tests/uring_native_tdd.rs`, then implemented to green:

- `uring_native_unbound_requires_io_uring_backend`
- `uring_native_unbound_selector_distributes_when_depths_equal`
- `uring_native_unbound_file_ops_work`
- `uring_native_unbound_stream_ops_preserve_affinity_and_order`
- `uring_native_unbound_multishot_releases_hard_affinity_after_completion`
- `uring_native_unbound_tracks_active_op_routes_for_inflight_work`

### Validation

- `cargo fmt --all`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --no-run --features uring-native`
- `cargo bench --bench fs_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench net_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Latest short-run benchmark snapshot

FS:

- `fs_read_rtt_4k/tokio_spawn_blocking_pread_qd1`: ~`1.55-1.68 ms`
- `fs_read_rtt_4k/spargio_uring_bound_file_qd1`: ~`1.03-1.07 ms`
- `fs_read_rtt_4k/spargio_uring_unbound_file_qd1`: ~`1.01-1.03 ms`
- `fs_read_throughput_4k_qd32/tokio_spawn_blocking_pread_qd32`: ~`8.55-8.70 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_bound_file_qd32`: ~`5.93-6.68 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_unbound_file_qd32`: ~`6.57-7.38 ms`

Net:

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: ~`7.74-7.97 ms`
- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: ~`5.48-5.75 ms`
- `net_echo_rtt_256b/spargio_uring_unbound_tcp_qd1`: ~`7.64-8.04 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: ~`10.69-11.17 ms`
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: ~`11.09-11.33 ms`
- `net_stream_throughput_4k_window32/spargio_uring_unbound_tcp_window32`: ~`10.83-10.99 ms`

## Implementation: direct unbound command-envelope optimization (`SubmitNativeAny`)

Implemented the previously planned unbound-path optimization to remove per-op pinned-spawn overhead.

### What changed

`src/lib.rs`:

- added direct native command envelope:
  - `Command::SubmitNativeAny { op: NativeAnyCommand }`
  - `NativeAnyCommand` variants for read/write/fsync, send/recv, batch, multishot.
- `UringNativeAny` now dispatches native ops via:
  - same-shard local fast path: enqueue `LocalCommand` directly.
  - cross-shard envelope path: send `SubmitNativeAny` command to selected shard.
- preserved existing affinity/route semantics:
  - `NativeLaneSelector` selection.
  - FD lease table (`weak`/`strong`/`hard`).
  - `op_id -> shard` tracking and cleanup.

### New observability

`RuntimeStats` now includes:

- `native_any_envelope_submitted`
- `native_any_local_fastpath_submitted`

### Red/Green TDD

Added failing tests first, then implemented to green:

- `uring_native_unbound_records_command_envelope_submission`
- `uring_native_unbound_records_local_fast_path_submission`

### Validation

- `cargo fmt --all`
- `cargo test -q`
- `cargo test -q --features uring-native`
- `cargo bench --no-run --features uring-native`
- `cargo bench --bench fs_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --bench net_api --features uring-native -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

### Latest short-run snapshot after optimization

FS:

- `fs_read_rtt_4k/tokio_spawn_blocking_pread_qd1`: ~`1.754-1.867 ms`
- `fs_read_rtt_4k/spargio_uring_bound_file_qd1`: ~`1.013-1.062 ms`
- `fs_read_rtt_4k/spargio_uring_unbound_file_qd1`: ~`1.003-1.028 ms`
- `fs_read_throughput_4k_qd32/tokio_spawn_blocking_pread_qd32`: ~`8.732-9.015 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_bound_file_qd32`: ~`5.967-6.988 ms`
- `fs_read_throughput_4k_qd32/spargio_uring_unbound_file_qd32`: ~`6.085-6.866 ms`

Net:

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: ~`7.918-8.187 ms`
- `net_echo_rtt_256b/spargio_uring_bound_tcp_qd1`: ~`6.840-8.632 ms`
- `net_echo_rtt_256b/spargio_uring_unbound_tcp_qd1`: ~`5.539-5.812 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: ~`10.544-10.656 ms`
- `net_stream_throughput_4k_window32/spargio_uring_bound_tcp_window32`: ~`11.073-11.449 ms`
- `net_stream_throughput_4k_window32/spargio_uring_unbound_tcp_window32`: ~`10.996-11.408 ms`

Interpretation:

- unbound `net_echo_rtt_256b` improved materially after removing per-op spawn overhead.
- unbound fs remains competitive and generally close to bound.

## Roadmap Revision: ergonomics-first sequence (requested)

No implementation in this update; this section revises priority order only.

### New priority order

1. Scope simplification first: remove bound APIs to keep the codebase manageable.
   - deprecate/remove `UringNativeLane`/`UringBoundFd`-centric public paths in favor of unbound-first APIs.
   - remove bound-only benchmark variants and docs references once replacement coverage exists.
2. Ergonomics project (highest priority after simplification):
   - deliver a high-level API layer targeting parity with Compio-style filesystem and network ergonomics.
   - target outcome: common file/network flows can be written without manual lane/FD plumbing boilerplate.
3. After ergonomics parity milestone is complete:
   - add benchmark suites against Compio for filesystem and network APIs, with matched workload shapes.
   - prioritize broader native I/O surface expansion.
4. Then continue with remaining milestones:
   - production-grade work-stealing policy (fairness/starvation/adaptive heuristics),
   - tail-latency perf program (longer windows + p95/p99 gates),
   - production hardening (stress/soak/failure injection/observability),
   - optional Tokio-compat readiness shim as a separate large-investment track.

### Ergonomics parity target (Compio-like)

At completion of the ergonomics project, Spargio should provide equivalent day-to-day usability for core filesystem/network tasks:

- filesystem:
  - high-level async file open/create/read/write helpers,
  - convenience methods equivalent to common `read_to_end_at`/buffer-reuse workflows.
- network:
  - high-level async TCP/UDP connect/accept/send/recv helpers,
  - convenience traits/wrappers for common read/write loops and batching patterns.
- runtime entry ergonomics:
  - straightforward app entry patterns (macro or helper-based) with minimal setup boilerplate.

### Notes

- This roadmap change intentionally favors API usability and adoption surface before deeper policy/perf-hardening tracks.
- Bound APIs are treated as temporary complexity and are planned for removal ahead of the ergonomics phase.
- Post-ergonomics benchmarking will include explicit Spargio-vs-Compio fs/net comparisons.

## Update: scope simplification + ergonomics APIs + Compio benchmark lane

Completed the requested implementation batch in three slices:

### 1) Scope simplification (bound API removal)

Removed bound-centric native public APIs from `src/lib.rs`:

- removed `RuntimeHandle::uring_native_lane(...)`
- removed `UringNativeLane`
- removed `UringBoundFd`
- removed `UringFileSession`

Native public surface is now unbound-first:

- `RuntimeHandle::uring_native_unbound() -> UringNativeAny`

Also removed bound-oriented TDD/bench usage and migrated coverage to unbound equivalents.

### 2) Ergonomics project (Compio-like API shape)

Added high-level wrappers over unbound native ops in `src/lib.rs`:

- `spargio::fs`
  - `OpenOptions`
  - `File`
    - `open`, `create`, `from_std`
    - `read_at`, `read_at_into`, `read_to_end_at`
    - `write_at`, `write_all_at`, `fsync`
- `spargio::net`
  - `TcpStream`
    - `connect`, `from_std`
    - `send`, `recv`, `send_owned`, `recv_owned`
    - `send_all_batch`, `recv_multishot_segments`
    - `write_all`, `read_exact`
  - `TcpListener`
    - `bind`, `from_std`, `local_addr`, `accept`

Added red/green tests:

- new `tests/ergonomics_tdd.rs`
  - `fs_open_read_to_end_and_write_at`
  - `net_tcp_stream_connect_supports_read_write_all`
  - `net_tcp_listener_bind_accepts_and_wraps_stream`
- rewrote `tests/uring_native_tdd.rs` to unbound-only coverage.

### 3) Benchmark refresh + Compio comparisons

Added Compio to Linux dev-dependencies:

- `Cargo.toml`:
  - `[target.'cfg(target_os = "linux")'.dev-dependencies]`
  - `compio = { version = "0.18.0", default-features = false, features = ["runtime", "io-uring", "fs", "net", "io"] }`

Rewrote benchmark harnesses:

- `benches/fs_api.rs`
  - compares:
    - `tokio_spawn_blocking_pread_qd1`
    - `spargio_fs_read_at_qd1`
    - `compio_fs_read_at_qd1`
    - `tokio_spawn_blocking_pread_qd32`
    - `spargio_fs_read_at_qd32`
    - `compio_fs_read_at_qd32`
- `benches/net_api.rs`
  - compares:
    - `tokio_tcp_echo_qd1`
    - `spargio_tcp_echo_qd1`
    - `compio_tcp_echo_qd1`
    - `tokio_tcp_echo_window32`
    - `spargio_tcp_echo_window32`
    - `compio_tcp_echo_window32`

### Validation

- `cargo fmt`
- `cargo test --features uring-native --tests`
- `cargo bench --features uring-native --no-run`
- `cargo bench --features uring-native --bench fs_api -- --sample-size 20`
- `cargo bench --features uring-native --bench net_api -- --sample-size 20`

### Latest benchmark snapshot (sample-size 20)

FS:

- `fs_read_rtt_4k/tokio_spawn_blocking_pread_qd1`: `1.601-1.641 ms`
- `fs_read_rtt_4k/spargio_fs_read_at_qd1`: `1.012-1.026 ms`
- `fs_read_rtt_4k/compio_fs_read_at_qd1`: `1.388-1.421 ms`
- `fs_read_throughput_4k_qd32/tokio_spawn_blocking_pread_qd32`: `7.680-7.767 ms`
- `fs_read_throughput_4k_qd32/spargio_fs_read_at_qd32`: `5.971-6.054 ms`
- `fs_read_throughput_4k_qd32/compio_fs_read_at_qd32`: `5.983-6.119 ms`

Net:

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: `7.913-8.056 ms`
- `net_echo_rtt_256b/spargio_tcp_echo_qd1`: `5.542-5.606 ms`
- `net_echo_rtt_256b/compio_tcp_echo_qd1`: `6.530-6.646 ms`
- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: `11.306-11.511 ms`
- `net_stream_throughput_4k_window32/spargio_tcp_echo_window32`: `16.903-17.082 ms`
- `net_stream_throughput_4k_window32/compio_tcp_echo_window32`: `6.928-7.091 ms`

### Notes

- This completes the requested simplification + ergonomics + Compio benchmark scope.
- Current ergonomic `fs::OpenOptions::open`, `net::TcpListener::bind/accept`, and `net::TcpStream::connect` are async wrappers using blocking helper threads for setup operations; native io_uring open/accept/connect op coverage remains future work.

## Update: net throughput optimization pass (owned buffers + batch/multishot receive)

Focused on `net_stream_throughput_4k_window32`, where Spargio remained behind Tokio/Compio after the ergonomics migration.

### Red/Green TDD

Added failing ergonomics test first:

- `tests/ergonomics_tdd.rs`
  - `net_tcp_stream_owned_buffers_support_read_write_all`

Then implemented the API and benchmark-path changes to green.

### API changes (`spargio::net::TcpStream`)

`src/lib.rs`:

- added `write_all_owned(Vec<u8>) -> io::Result<Vec<u8>>`
- added `read_exact_owned(Vec<u8>) -> io::Result<Vec<u8>>`
- optimized `read_exact(&mut [u8])` to reuse a scratch receive buffer rather than allocating per recv loop.

These allow high-frequency send/recv loops to reuse caller-owned buffers and avoid repeated allocation churn.

### Benchmark harness changes

`benches/net_api.rs`:

- `spargio_echo_rtt` now uses owned-buffer helpers:
  - `write_all_owned`
  - `read_exact_owned`
- `spargio_echo_windowed` now uses a throughput-oriented native path:
  - prebuild frame batch from reusable tx pool
  - `send_all_batch(...)`
  - `recv_multishot_segments(...)` with kernel capability fallback (`EINVAL/ENOSYS/EOPNOTSUPP`)
  - fallback receive path uses `read_exact_owned` with reusable buffer

### Validation

- `cargo test --features uring-native --test ergonomics_tdd`
- `cargo bench --features uring-native --bench net_api --no-run`
- `cargo bench --features uring-native --bench net_api -- --sample-size 20`

### Latest `net_api` snapshot after optimization

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: `7.878-8.032 ms`
- `net_echo_rtt_256b/spargio_tcp_echo_qd1`: `5.516-5.613 ms`
- `net_echo_rtt_256b/compio_tcp_echo_qd1`: `6.555-6.715 ms`

- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: `11.147-11.318 ms`
- `net_stream_throughput_4k_window32/spargio_tcp_echo_window32`: `10.889-10.974 ms`
- `net_stream_throughput_4k_window32/compio_tcp_echo_window32`: `7.090-7.225 ms`

Result: Spargio throughput moved from clearly behind Tokio to slightly ahead in this harness run, while remaining behind Compio in sustained stream throughput.

## Update: local stream-session fast path + pool-backed multishot snapshot

Follow-up optimization work after the prior net-throughput pass.

### What was implemented

1) Local stream-session fast path (submission without unbound route tracking)

`src/lib.rs` (`UringNativeAny` + `spargio::net::TcpStream`):

- added direct-to-shard submit helper in `UringNativeAny`:
  - bypasses `op_routes` + FD-affinity lock bookkeeping for stream-session calls.
- added stream-session methods on `UringNativeAny`:
  - `select_stream_session_shard`
  - `recv_owned_on_shard`
  - `send_owned_on_shard`
  - `send_all_batch_on_shard`
  - `recv_multishot_segments_on_shard`
- `spargio::net::TcpStream` now selects a session shard at construction and routes stream ops through these methods.

2) Multishot receive copy-path change

`src/lib.rs` (`IoUringDriver::complete_native_op`):

- removed per-CQE compaction copy (`out.extend_from_slice(...)`) for multishot segments.
- now records segment offsets directly against buffer-pool layout (`bid * buffer_len`).
- returns a pool-backed snapshot buffer (`pool.storage.to_vec()`) with segment metadata.

Note: this is a safe pool-backed snapshot path (no per-segment compaction copy), not a full ownership-transfer zero-copy path. A first ownership-transfer attempt caused unsafe kernel buffer-registration interactions and was not kept.

### Red/Green TDD additions

Added failing tests first, then implemented to green:

- `tests/ergonomics_tdd.rs`
  - `net_tcp_stream_session_path_does_not_track_unbound_op_routes`
- `tests/uring_native_tdd.rs`
  - `uring_native_unbound_multishot_segments_expose_pool_backing_without_compaction_copy`

### Validation

- `cargo test --features uring-native --tests`
- `cargo bench --features uring-native --bench net_api -- --sample-size 20`

### Latest `net_api` snapshot after this pass

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: `7.923-8.118 ms`
- `net_echo_rtt_256b/spargio_tcp_echo_qd1`: `5.410-5.516 ms`
- `net_echo_rtt_256b/compio_tcp_echo_qd1`: `6.447-6.530 ms`

- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: `10.902-11.155 ms`
- `net_stream_throughput_4k_window32/spargio_tcp_echo_window32`: `11.225-11.441 ms`
- `net_stream_throughput_4k_window32/compio_tcp_echo_window32`: `7.007-7.118 ms`

Interpretation:

- stream RTT improved further on Spargio.
- throughput remains near Tokio (within a few percent in this run) and behind Compio on sustained stream throughput.

## Update: imbalanced net-stream benchmark (hot/cold skew)

Added a third `net_api` benchmark to measure skewed stream load across multiple concurrent TCP connections.

### What changed

- `benches/net_api.rs`:
  - refactored echo server fixture to support N accepted client connections per harness (`spawn_echo_server_with_clients`).
  - extended Tokio/Spargio/Compio harness command sets with `EchoImbalanced`.
  - each harness now creates `IMBALANCED_STREAMS=8` persistent streams.
  - existing RTT/windowed benchmarks continue to use the primary stream.
  - new benchmark group: `net_stream_imbalanced_4k_hot1_light7`.

### Imbalanced workload definition

- Streams: `8`
- Payload: `4096` bytes
- Window: `32`
- Heavy stream (`idx=0`): `2048` frames
- Light streams (`idx=1..7`): `128` frames each
- Total per iteration: `11,468,800` bytes

### Validation

- `cargo check --features uring-native --bench net_api`
- `cargo bench --features uring-native --bench net_api -- --sample-size 20`

### Latest results (`--sample-size 20`)

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: `7.903-8.093 ms`
- `net_echo_rtt_256b/spargio_tcp_echo_qd1`: `5.405-5.474 ms`
- `net_echo_rtt_256b/compio_tcp_echo_qd1`: `6.472-6.593 ms`

- `net_stream_throughput_4k_window32/tokio_tcp_echo_window32`: `11.157-11.203 ms`
- `net_stream_throughput_4k_window32/spargio_tcp_echo_window32`: `11.085-11.166 ms`
- `net_stream_throughput_4k_window32/compio_tcp_echo_window32`: `7.136-7.277 ms`

- `net_stream_imbalanced_4k_hot1_light7/tokio_tcp_8streams_hotcold`: `13.595-13.853 ms` (`830-846 MiB/s`)
- `net_stream_imbalanced_4k_hot1_light7/spargio_tcp_8streams_hotcold`: `16.335-16.502 ms` (`697-704 MiB/s`)
- `net_stream_imbalanced_4k_hot1_light7/compio_tcp_8streams_hotcold`: `12.089-12.215 ms` (`942-951 MiB/s`)

### Notes

- The new skew benchmark is stable and repeatable.
- In the current implementation, Spargio is behind Tokio and Compio on this hot/cold multi-stream workload.

## Update: hypotheses and A/B plan for imbalanced net-stream slowdown

This captures why `net_stream_imbalanced_4k_hot1_light7` is currently slower on Spargio and what we should test next before changing core runtime behavior.

### Hypotheses

1. Workload shape is dominated by one serialized hot stream.
- In hot1/light7, one stream carries most bytes; single-stream TCP ordering limits parallelism and reduces benefits from work stealing.

2. Session-shard concentration reduces lane spread.
- Streams are created from one worker context; `TcpStream` picks `session_shard` at construction.
- With preferred-shard bias in selector, many streams may end up on the same shard.

3. Cross-shard submit overhead in imbalanced path.
- Imbalanced benchmark spawns stealable tasks per stream, but stream I/O still routes to stream `session_shard`.
- If task executes off-session-shard, each op pays envelope/command/oneshot overhead.

4. Multishot receive path still performs heavy copying.
- Current multishot completion returns a pool snapshot via `pool.storage.to_vec()`.
- This copies the full pool per batch and can dominate throughput in hot stream workloads.

### Quick A/B plan to prove each cause

A/B-1: workload-shape sensitivity (hot-stream serialization)
- A: current `hot1/light7` profile.
- B: balanced profile with same total bytes spread evenly across streams.
- Success signal: if Spargio narrows/erases gap on balanced profile, shape serialization is a primary contributor.

A/B-2: stream session-shard distribution
- A: current stream construction path.
- B: instrument and enforce explicit spread (round-robin stream creation context or per-stream target shard) and record distribution.
- Success signal: if better spread improves imbalanced throughput, lane concentration is a contributor.

A/B-3: task placement vs. stream session shard
- A: current `spawn_stealable` for stream workers.
- B: run stream workers pinned/preferred to each stream `session_shard`.
- Success signal: if B improves latency/throughput, cross-shard submit overhead is material.

A/B-4: multishot copy cost
- A: current `take_recv_pool_storage -> to_vec()` behavior.
- B: copy only touched segment ranges (or temporarily force non-multishot read path as control).
- Success signal: lower time and reduced CPU/memory pressure confirms copy-path dominance.

### Copy-reduction and related optimization options

1) Copy only touched bytes from multishot segments (low risk).
- Replace full-pool clone with segment-aware gather into a compact output buffer.
- Expected effect: materially lower copy volume on partial-pool consumption.

2) Segment-fold API to avoid materializing receive buffers (medium risk).
- Add API that processes multishot segments in-place and returns folded result (checksum/parser state/etc.).
- Expected effect: near-zero extra copy for many streaming workloads.

3) Pool lease API for true zero-copy receive view (higher complexity).
- Return a lease object that references registered pool storage + segment metadata.
- Reclaim buffers on lease drop, with double-buffered pool strategy to keep pipeline full.

4) Placement alignment for stream workers (complementary).
- Run per-stream tasks on their `session_shard` by default in throughput-oriented paths.
- Expected effect: remove cross-shard submit + response overhead from hot I/O loops.

### Priority suggestion

- First: A/B-4 (copy path) and A/B-3 (placement alignment).
- Then: A/B-2 (distribution), A/B-1 (shape sensitivity) for explanatory confidence and benchmark positioning.

## Update: A/B results for imbalanced net-stream hypotheses

Ran targeted A/B matrix in `benches/net_api.rs` via benchmark group:
- `net_stream_imbalanced_ab_4k`

Command used:
- `cargo bench --features uring-native --bench net_api -- net_stream_imbalanced_ab_4k --sample-size 12`

### Key results (time ranges)

- `tokio_hotcold`: `13.547-13.682 ms`
- `tokio_balanced_total_bytes`: `8.046-8.174 ms`

- `spargio_hotcold_stealable_multishot`: `16.337-16.454 ms`
- `spargio_hotcold_pinned_multishot`: `16.358-16.512 ms`
- `spargio_hotcold_stealable_readexact`: `17.902-17.970 ms`
- `spargio_hotcold_pinned_readexact`: `17.742-17.896 ms`

- `spargio_balanced_stealable_multishot` (single-context stream init): `16.861-16.986 ms`

- `spargio_hotcold_stealable_multishot_distributed_connect`: `13.534-13.684 ms`
- `spargio_hotcold_pinned_multishot_distributed_connect`: `13.300-13.360 ms`
- `spargio_balanced_stealable_multishot_distributed_connect`: `9.080-9.172 ms`

### Hypothesis outcomes

1) Workload shape (hot-stream serialization) matters: **confirmed**.
- Tokio hotcold vs balanced shows a large swing.
- Spargio shows the same swing once stream session distribution is fixed (`13.6 ms` hotcold vs `9.1 ms` balanced in distributed-connect mode).

2) Session-shard concentration / stream distribution: **strongly confirmed (primary factor)**.
- Spargio hotcold improves from ~`16.4 ms` to ~`13.6 ms` by only changing stream init to distributed-connect.
- This is the biggest single improvement in the A/B set.

3) Placement alignment (stealable vs pinned-to-session): **secondary effect**.
- In single-context mode, pinned vs stealable is effectively flat.
- In distributed-connect mode, pinned gives a modest gain (~2%).

4) Multishot copy-path concern: **not primary in this workload**.
- `read_exact` variants are slower than multishot by ~8-10%.
- Conclusion: reducing full-pool clone may still help, but it is not the top bottleneck for this benchmark shape.

### Re-evaluated optimization priorities

1. Make stream session-shard distribution explicit/default for multi-stream workloads.
- Add runtime/net API controls for connect-time lane selection (e.g., round-robin shard hinting).

2. Add stream-task placement helpers that align execution with stream session shard.
- Keep work-stealable default, but provide an easy pinned/session-aligned fast path for throughput loops.

3. Keep multishot as default receive path for throughput profiles.
- Do not switch to read_exact-only path for this workload class.

4. Move copy-reduction work to medium priority.
- Touched-range copy and lease-based zero-copy remain worthwhile, but after (1) and (2).

5. Add follow-up benchmark scenarios to validate generality.
- skewed + distributed under larger windows, mixed payload sizes, and parser-like downstream processing.

## Update: implemented optimization priorities from imbalanced A/B findings

Implemented the re-prioritized optimization set focused on multi-stream distribution, session-aligned execution ergonomics, and receive-copy reduction.

### 1) Stream distribution controls (runtime API)

`src/lib.rs` (`spargio::net`):

- added `StreamSessionPolicy`:
  - `ContextPreferred`
  - `RoundRobin`
  - `Fixed(ShardId)`
- added session-policy connect APIs on `TcpStream`:
  - `connect_with_session_policy(...)`
  - `connect_round_robin(...)`
  - `connect_many_with_session_policy(...)`
  - `connect_many_round_robin(...)`
- added session-policy wrap API:
  - `from_std_with_session_policy(...)`
- kept existing `connect(...)` / `from_std(...)` behavior via `ContextPreferred`.
- added session-policy accept APIs on `TcpListener`:
  - `accept_with_session_policy(...)`
  - `accept_round_robin(...)`

This makes multi-stream session placement explicit and gives a first-class round-robin path without requiring benchmark-specific task orchestration.

### 2) Session-shard-aligned execution helpers

`src/lib.rs` (`spargio::net::TcpStream`):

- added `spawn_on_session(&RuntimeHandle, fut)`
- added `spawn_stealable_on_session(&RuntimeHandle, fut)`

This removes boilerplate for session-aligned throughput loops and enables straightforward pinned-to-session execution from stream handles.

### 3) Keep multishot as default throughput receive path

`benches/net_api.rs`:

- throughput/imbalanced hot paths continue to default to multishot receive mode.
- read-exact is kept only as A/B comparison lane.

### 4) Copy reduction for multishot completion

`src/lib.rs` (io_uring driver):

- replaced full pool clone in multishot completion path with compact touched-range copy:
  - old: full `pool.storage.to_vec()` clone
  - new: copy only segment-covered ranges and rewrite segment offsets to compact buffer coordinates

This reduces receive-copy volume when only a subset of the registered pool is used per operation.

### Benchmark harness updates

`benches/net_api.rs`:

- `SpargioStreamInitMode::DistributedConnect` now uses runtime API (`connect_many_round_robin`) instead of benchmark-local pinned-connect orchestration.
- `bench_net_stream_imbalanced_4k_hot1_light7` uses distributed-connect Spargio harness (optimized multi-stream path).
- A/B matrix retained (`net_stream_imbalanced_ab_4k`) and updated to use the new helpers.

### Red/Green TDD

Added failing tests first, then implemented to green:

- `tests/ergonomics_tdd.rs`
  - `net_tcp_stream_connect_round_robin_distributes_session_shards`
  - `net_tcp_stream_spawn_on_session_runs_on_stream_session_shard`
- `tests/uring_native_tdd.rs`
  - updated multishot-copy expectation:
    - `uring_native_unbound_multishot_segments_use_compact_buffer_copy`

Validation:

- `cargo test --features uring-native --tests`
- `cargo check --features uring-native --bench net_api`
- `cargo bench --features uring-native --bench net_api -- net_stream_imbalanced_ab_4k --sample-size 12`
- `cargo bench --features uring-native --bench net_api -- net_stream_imbalanced_4k_hot1_light7 --sample-size 12`
- `cargo bench --features uring-native --bench net_api -- net_echo_rtt_256b --sample-size 12`

### Post-change benchmark snapshot (latest runs)

Imbalanced target benchmark:

- `net_stream_imbalanced_4k_hot1_light7/tokio_tcp_8streams_hotcold`: `14.058-14.331 ms`
- `net_stream_imbalanced_4k_hot1_light7/spargio_tcp_8streams_hotcold`: `13.300-13.734 ms`
- `net_stream_imbalanced_4k_hot1_light7/compio_tcp_8streams_hotcold`: `12.174-12.499 ms`

A/B confirmation:

- `spargio_hotcold_stealable_multishot_distributed_connect`: `13.410-13.639 ms`
- `spargio_hotcold_pinned_multishot_distributed_connect`: `13.050-13.144 ms`
- `spargio_balanced_stealable_multishot_distributed_connect`: `8.886-8.942 ms`

RTT sanity after harness adjustment:

- `net_echo_rtt_256b/tokio_tcp_echo_qd1`: `7.988-8.128 ms`
- `net_echo_rtt_256b/spargio_tcp_echo_qd1`: `5.625-5.793 ms`
- `net_echo_rtt_256b/compio_tcp_echo_qd1`: `6.599-6.704 ms`

### Interpretation

- Primary bottleneck identified earlier (session concentration) is now addressed via runtime API and benchmark-path adoption.
- Session-aligned helpers are in place and show modest additional gains in distributed mode.
- Compact multishot copy reduced copy overhead and improved several A/B lanes, while multishot remains better than read-exact for these workloads.

## Update: separated net A/B scenarios into experimental benchmark target

To keep long-running benchmark reporting focused and stable, imbalanced A/B diagnostic scenarios were moved out of the main net benchmark target.

### What changed

- Added new bench target in `Cargo.toml`:
  - `[[bench]] name = "net_experiments"`
- Main benchmark target `benches/net_api.rs` now includes only product-facing groups:
  - `net_echo_rtt_256b`
  - `net_stream_throughput_4k_window32`
  - `net_stream_imbalanced_4k_hot1_light7`
- Experimental A/B matrix moved to `benches/net_experiments.rs`.
- Experimental group renamed for clarity:
  - `exp_net_stream_imbalanced_ab_4k`

### Usage

- Product-facing benchmark suite:
  - `cargo bench --features uring-native --bench net_api`
- Experimental diagnostic suite:
  - `cargo bench --features uring-native --bench net_experiments`

### Validation

- `cargo check --features uring-native --bench net_api --bench net_experiments`
- Verified no A/B group is exposed from `net_api` target.
- Verified `net_experiments` runs `exp_net_stream_imbalanced_ab_4k` as intended.

## Update: dynamic-imbalance benchmark backlog + pipeline-hotspot implementation

Captured additional benchmark shapes (posterity/backlog) to better probe the `msg_ring` + work-stealing value proposition under dynamic skew:

1. `net_stream_hotspot_rotation`
- rotating hot stream without explicit CPU stage.
2. `net_stream_bursty_tenants`
- many streams with bursty ON/OFF activity and skewed arrivals.
3. `net_pipeline_imbalanced_io_cpu`
- per-frame recv/CPU/send pipeline with rotating hotspot.
4. `fanout_fanin_hotkey_rotation`
- fanout/fanin with moving hot key pressure across shards.
5. `accept_connect_churn_skewed`
- skewed short-lived connection churn including setup path.

Implemented now:

- Added new benchmark group in `benches/net_api.rs`:
  - `net_pipeline_hotspot_rotation_4k_window32`
- Added runtime lanes in the existing Tokio/Spargio/Compio net harness commands:
  - `*_pipeline_hotspot` command + execution path per runtime.
- Workload shape:
  - 8 streams, 4 KiB frames, window 32.
  - hotspot rotates every 64 frames.
  - per-frame CPU stage after echo receive (`heavy` for current hotspot stream, `light` for others).
- Added a shared deterministic CPU stage helper used by all three runtimes to keep the comparison shape aligned.

Validation:

- `cargo fmt`
- `cargo check --features uring-native --bench net_api`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 10`

Quick snapshot (`sample-size 10`):

- `net_pipeline_hotspot_rotation_4k_window32/tokio_tcp_pipeline_hotspot`: `26.075-26.308 ms`
- `net_pipeline_hotspot_rotation_4k_window32/spargio_tcp_pipeline_hotspot`: `32.686-33.156 ms`
- `net_pipeline_hotspot_rotation_4k_window32/compio_tcp_pipeline_hotspot`: `50.496-51.812 ms`

## Update: added `net_stream_hotspot_rotation_4k` (I/O-only rotating hotspot)

Implemented the follow-up benchmark shape requested to isolate dynamic skew effects without an explicit CPU stage.

What was added:

- New benchmark group in `benches/net_api.rs`:
  - `net_stream_hotspot_rotation_4k`
- New runtime command lane across Tokio/Spargio/Compio harnesses:
  - `EchoHotspotRotation`
- Workload definition:
  - 8 streams
  - 4 KiB frames
  - hotspot rotates each step (`step % stream_count`)
  - per-step frame budget:
    - hotspot stream: `32` frames
    - non-hot streams: `2` frames
  - `64` steps total
  - window `32`

Validation:

- `cargo fmt`
- `cargo check --features uring-native --bench net_api`
- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 10`

Quick snapshot (`sample-size 10`):

- `net_stream_hotspot_rotation_4k/tokio_tcp_8streams_rotating_hotspot`: `8.7249-8.7700 ms`
- `net_stream_hotspot_rotation_4k/spargio_tcp_8streams_rotating_hotspot`: `11.499-11.600 ms`
- `net_stream_hotspot_rotation_4k/compio_tcp_8streams_rotating_hotspot`: `16.637-16.766 ms`

## Roadmap update: runtime entry ergonomics moved to the front

To reduce first-use friction, runtime entry ergonomics is now the first item in the upcoming roadmap.

Updated upcoming order:

1. Runtime entry ergonomics:
   - add a simple helper entrypoint (for example `spargio::run(...)`).
   - add optional `#[spargio::main]` proc-macro sugar in a companion proc-macro crate.
   - ensure feature-gated behavior and clear fallback/error messaging on unsupported platforms.
2. Remove blocking APIs from the public runtime surface.
   - replace helper-thread `run_blocking` paths in `fs::OpenOptions::open`, `net::TcpStream::connect`, and `net::TcpListener::bind/accept`.
   - require native/non-blocking paths for these setup operations.
3. Continue ergonomic parity work for fs/net API discoverability and docs.
4. Continue dynamic-imbalance benchmark expansion and optimization loops.
5. Proceed with broader native I/O surface + hardening milestones.

## Update: runtime entry ergonomics slice (helpers + `#[spargio::main]`)

Completed the next runtime-entry ergonomics slice with red/green TDD.

### Red phase

- Added new integration tests in `tests/entry_macro_tdd.rs`:
  - `main_macro_executes_async_body`
  - `main_macro_applies_builder_overrides`
  - `main_macro_panics_on_runtime_build_failure`
- Ran:
  - `cargo test --features macros --test entry_macro_tdd`
- Expected failure observed:
  - package did not yet expose a `macros` feature.

### Green phase

- Added companion proc-macro crate:
  - `spargio-macros/Cargo.toml`
  - `spargio-macros/src/lib.rs`
- Implemented `#[spargio::main]` attribute macro:
  - supports async no-arg function entry wrappers;
  - supports options: `shards = ...`, `backend = "queue" | "io_uring"`;
  - validates unsupported signatures/options at compile time.
- Wired feature-gated export in main crate:
  - `Cargo.toml`: added optional dependency + `macros` feature.
  - `src/lib.rs`: `#[cfg(feature = "macros")] pub use spargio_macros::main;`
- Existing helper entry APIs (`spargio::run`, `spargio::run_with`) remain the non-macro path.

### Validation

- `cargo test --features macros --test entry_macro_tdd`
- `cargo test --test runtime_tdd`
- `cargo test --features macros --tests`
- `cargo fmt`

### Status

- Runtime entry ergonomics roadmap item is now covered by:
  - helper entry (`run`, `run_with`) and
  - optional attribute macro entry (`#[spargio::main]`).
- Next planned item remains removing blocking setup APIs from the public fs/net surface.

## Update: removed blocking setup helpers from fs/net public APIs (Red/Green TDD)

Goal completed:

- Removed helper-thread `run_blocking` setup paths from:
  - `spargio::fs::OpenOptions::open`
  - `spargio::net::TcpStream::connect*`
  - `spargio::net::TcpListener::bind/accept*`

### Red phase

Added/expanded failing tests in `tests/ergonomics_tdd.rs` to lock behavior before implementation:

- `net_tcp_stream_connect_supports_read_write_all` now asserts returned stream fd is nonblocking.
- `net_tcp_listener_bind_accepts_and_wraps_stream` now asserts accepted stream fd is nonblocking.
- Added fs option-compat tests:
  - `fs_open_options_create_new_reports_already_exists`
  - `fs_open_options_append_and_truncate_is_invalid`

Observed red failure before implementation:

- connected/accepted stream nonblocking assertions failed with existing helper-thread setup path.

### Green phase

Implemented native setup operations in the io_uring command pipeline:

- Added new native command flow variants (`NativeAnyCommand`, `LocalCommand`, backend dispatch, driver submission/completion):
  - `OpenAt`
  - `Connect`
  - `Accept`
- Added `UringNativeAny` helpers:
  - `open_at(...)`
  - `connect_on_shard(...)`
  - `accept_on_shard(...)`
- Added driver-side completion handling for new `NativeIoOp` variants.

Public API behavior changes:

- `fs::OpenOptions::open` now uses native `IORING_OP_OPENAT` instead of helper threads.
- `net::TcpStream::connect*` now creates nonblocking sockets and completes with native `IORING_OP_CONNECT` on the chosen shard.
- `net::TcpListener::accept*` now uses native `IORING_OP_ACCEPT` (nonblocking + cloexec accepted sockets).
- `net::TcpListener::bind` now creates/binds/listens via nonblocking socket syscalls (no helper thread).
- `TcpStream::from_std_with_session_policy` now enforces nonblocking mode.

Notes:

- Added sockaddr encode/decode helpers for IPv4/IPv6 setup/completion paths.
- `fs::OpenOptions` flag mapping now validates invalid combinations in-process and uses `openat` flags/mode directly.

### Validation

Executed:

- `cargo fmt`
- `cargo test --features uring-native --test ergonomics_tdd`
- `cargo test --features uring-native --test uring_native_tdd`
- `cargo test --features uring-native`

Result:

- All tests pass.

## Update: benchmark refresh after native setup-path changes

Re-ran the monitored benchmark suites and refreshed README tables.

Command profile used for all runs:

- `--warm-up-time 0.05`
- `--measurement-time 0.05`
- `--sample-size 20`

Commands executed:

- `cargo bench --features uring-native --bench ping_pong -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --features uring-native --bench fanout_fanin -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --features uring-native --bench fs_api -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`
- `cargo bench --features uring-native --bench net_api -- --warm-up-time 0.05 --measurement-time 0.05 --sample-size 20`

Highlights from refreshed results:

- Coordination:
  - `steady_ping_pong_rtt`: Tokio `1.4509-1.4888 ms`, Spargio `357.27-378.34 us`.
  - `steady_one_way_send_drain`: Tokio `70.972-75.645 us`, Spargio `66.006-66.811 us`.
  - `cold_start_ping_pong`: Tokio `535.65-601.90 us`, Spargio `262.24-291.99 us`.
  - `fanout_fanin_balanced`: Tokio `1.4625-1.5346 ms`, Spargio `1.3333-1.3496 ms`.
  - `fanout_fanin_skewed`: Tokio `2.4001-2.7005 ms`, Spargio `1.9590-1.9900 ms`.

- Native fs/net:
  - `fs_read_rtt_4k`: Tokio `1.6476-1.7647 ms`, Spargio `0.99148-1.0145 ms`, Compio `1.3893-1.4970 ms`.
  - `fs_read_throughput_4k_qd32`: Tokio `7.4895-7.6145 ms`, Spargio `5.9790-6.4699 ms`, Compio `5.4749-5.8905 ms`.
  - `net_echo_rtt_256b`: Tokio `7.7059-8.0959 ms`, Spargio `5.3708-5.6477 ms`, Compio `6.4743-6.7640 ms`.
  - `net_stream_throughput_4k_window32`: Tokio `11.163-11.324 ms`, Spargio `10.668-10.719 ms`, Compio `7.2779-7.4795 ms`.

- Imbalanced net:
  - `net_stream_imbalanced_4k_hot1_light7`: Tokio `13.426-14.098 ms`, Spargio `13.510-13.911 ms`, Compio `12.221-12.479 ms`.
  - `net_stream_hotspot_rotation_4k`: Tokio `8.6480-8.7488 ms`, Spargio `11.285-11.811 ms`, Compio `16.346-16.702 ms`.
  - `net_pipeline_hotspot_rotation_4k_window32`: Tokio `26.383-26.937 ms`, Spargio `34.962-35.935 ms`, Compio `50.764-51.179 ms`.

Outcome:

- README benchmark tables and interpretation updated to match this refresh.

## Next Plan: remove remaining blocking surfaces (checklist + sequence)

Goal:

- Keep data-plane waits and setup on native nonblocking/io_uring paths.
- Move control-plane APIs to async-first shapes, then deprecate blocking variants.

Remaining blocking surfaces identified:

- Boundary blocking ticket wait:
  - `BoundaryTicket::wait_timeout_blocking`.
- Boundary blocking server/client paths:
  - `BoundaryServer::recv`, `BoundaryServer::recv_timeout`, and blocking `BoundaryClient::call`.
- Timer helper:
  - `sleep` currently spawns a thread and uses `thread::sleep`.
- Hostname resolution path:
  - `to_socket_addrs()` in `first_socket_addr` can block for DNS.
- Synchronous runtime-control entry points:
  - `run_with` (`block_on`) and `shutdown` thread `join` waits.
- Queue-backend shard idle wait:
  - `rx.recv_timeout(idle_wait)` (fallback/control-plane backend).

Execution sequence (prioritized):

1. io_uring timer lane (high impact, low risk)
   - Add native timeout operation (`IORING_OP_TIMEOUT`) and route `sleep` through it on io_uring backend.
   - Keep queue backend fallback behavior unchanged.
   - Add TDD coverage for timer correctness/cancellation semantics.

2. Async-first boundary API (high impact, medium risk)
   - Add async `BoundaryServer::recv_async`/stream-style polling API.
   - Add async-first client call path and keep existing blocking APIs as compatibility wrappers.
   - Mark blocking variants as compatibility APIs in docs (and later deprecate).

3. Address-resolution split (medium impact, low risk)
   - Add `connect_socket_addr`-first API guidance and docs.
   - Keep hostname API but route through explicit resolver boundary so blocking DNS is isolated and optional.
   - Add tests that `SocketAddr` path stays fully nonblocking.

4. Runtime-control async variants (medium impact, medium risk)
   - Add `run_async` and `shutdown_async` (non-blocking caller thread semantics).
   - Keep existing sync entry points for ergonomics/back-compat.

5. Queue backend scope decision (medium impact, design choice)
   - Either:
     - keep queue backend as debug/fallback and accept blocking `recv_timeout`, or
     - reduce queue backend role and push io_uring-only profiles as default perf lane.
   - Record decision in ADR/log before implementation changes.

Acceptance checklist:

- [ ] No data-plane helper-thread blocking waits in io_uring mode.
- [ ] `sleep` uses native timeout path when io_uring backend is active.
- [ ] Boundary APIs have async-first equivalents covering current usage.
- [ ] Hostname resolution path is explicitly isolated from native data plane.
- [ ] README/implementation log reflect which blocking APIs are compatibility-only vs removed.

## Update: queue backend removed from public runtime configuration

Decision implemented from the blocking-surface plan:

- Queue backend is no longer selectable via `BackendKind`.
- `BackendKind` now exposes only `IoUring`.
- `RuntimeBuilder::default()` now defaults to `BackendKind::IoUring`.

Code and harness updates:

- Removed `BackendKind::Queue` usage from tests and benches.
- Updated runtime tests that previously forced queue mode to use io_uring (with existing graceful skip behavior when io_uring init is unavailable).
- Updated `ping_pong` and `fanout_fanin` benches to stop running `spargio_queue` variants.
- Updated README status text to describe io_uring-only backend.

Validation:

- `cargo fmt`
- `cargo test --features uring-native`
- `cargo bench --features uring-native --no-run`

Notes:

- Internal queue-oriented backend code paths remain in `ShardBackend` as dead code at this stage and are no longer instantiated through public builder/backend selection.
- Follow-up cleanup can remove those branches entirely if we want to reduce maintenance surface further.

## Update: internal queue backend branches removed

Follow-up cleanup completed after public queue-backend removal.

Changes:

- Removed internal `ShardBackend::Queue` handling branches from runtime dispatch.
- `ShardBackend` now routes only through io_uring paths in the Linux build.
- Removed queue-branch fallback logic in native submit handlers (`submit_native_*`).
- Removed shard-loop blocking idle wait path (`rx.recv_timeout(...)`), leaving nonblocking poll + cooperative yield behavior.
- Removed `RuntimeBuilder::idle_wait` field/method since it only supported the removed queue idle path.

Related API/harness alignment:

- `#[spargio::main(...)]` macro backend option now accepts only `"io_uring"`.
- Macro tests and examples updated accordingly.
- `ping_pong` and `fanout_fanin` benches no longer include `spargio_queue` variants.

Validation:

- `cargo fmt`
- `cargo test --features "uring-native macros"`
- `cargo bench --features uring-native --no-run`

Result:

- All checks pass.

## Update: blocking-surface plan slice implemented (Red/Green TDD)

Scope completed from the blocking-removal checklist:

- io_uring timer lane:
  - Added native timeout command path (`IORING_OP_TIMEOUT`) to the io_uring driver.
  - Added `UringNativeAny::sleep(Duration)`.
  - Routed top-level `spargio::sleep(...)` to shard-local native timeout path when running inside a Spargio shard; keeps fallback behavior outside shard context.

- Async-first boundary APIs:
  - Added async-first boundary surfaces:
    - `BoundaryClient::call_async(...)`
    - `BoundaryClient::call_async_with_timeout(...)`
    - `BoundaryServer::recv_async(...)`
    - `BoundaryServer::recv_timeout_async(...)`
    - `BoundaryTicket::wait_timeout(...)`
  - Kept blocking methods (`call`, `recv`, `recv_timeout`, `wait_timeout_blocking`) as compatibility wrappers.

- Address-resolution split:
  - Added explicit non-DNS socket-address APIs:
    - `net::TcpStream::connect_socket_addr(...)`
    - `net::TcpStream::connect_socket_addr_round_robin(...)`
    - `net::TcpStream::connect_many_socket_addr_round_robin(...)`
    - `net::TcpStream::connect_many_socket_addr_with_session_policy(...)`
    - `net::TcpStream::connect_socket_addr_with_session_policy(...)`
    - `net::TcpListener::bind_socket_addr(...)`
  - Kept hostname-based APIs as compatibility wrappers around a clearly named resolver path (`resolve_first_socket_addr_blocking`).

- Runtime-control async variants:
  - Added async runtime-entry/control APIs:
    - `run_async(...)`
    - `run_with_async(...)`
    - `Runtime::shutdown_async(...)`
  - Kept sync entry/control APIs (`run`, `run_with`, `shutdown`) for compatibility/ergonomics.

Red tests added:

- `tests/boundary_tdd.rs`
  - `boundary_async_call_and_recv_round_trip`
  - `boundary_async_recv_timeout_reports_timeout`
  - `boundary_ticket_wait_timeout_async_reports_timeout`
- `tests/runtime_tdd.rs`
  - `run_async_helper_executes_top_level_future`
  - `run_with_async_applies_custom_builder`
  - `runtime_shutdown_async_is_idempotent`
- `tests/ergonomics_tdd.rs`
  - `net_tcp_stream_connect_socket_addr_supports_read_write_all`
  - `net_tcp_listener_bind_socket_addr_accepts_and_wraps_stream`
- `tests/uring_native_tdd.rs`
  - `uring_native_unbound_sleep_uses_timeout_path`

Green + validation:

- `cargo fmt`
- `cargo test --features "uring-native macros" --test boundary_tdd --test runtime_tdd --test ergonomics_tdd --test uring_native_tdd`
- `cargo test --features "uring-native macros"`

Acceptance checklist status:

- [x] No data-plane helper-thread blocking waits in io_uring mode.
- [x] `sleep` uses native timeout path when io_uring backend is active on shard context.
- [x] Boundary APIs have async-first equivalents covering current usage.
- [x] Hostname resolution path is explicitly isolated from native data plane.
- [x] README/implementation log reflect which blocking APIs are compatibility-only vs removed.

## Update: removed public sync compatibility wrappers; async APIs are canonical (Red/Green TDD)

Rationale:

- Crate is not yet published; this is the lowest-risk point to make the API async-first and remove blocking wrapper surfaces.

What changed:

- Runtime entry/control API cleanup:
  - `run` is now async (`run(...).await`).
  - `run_with` is now async (`run_with(builder, ...).await`).
  - Removed public `run_async` and `run_with_async` aliases.
  - `Runtime::shutdown` is now async.
  - Removed public sync `Runtime::shutdown`; retained internal blocking shutdown path only for `Drop`.

- Boundary API cleanup:
  - `BoundaryClient::call` and `call_with_timeout` are async-first.
  - `BoundaryServer::recv` and `recv_timeout` are async-first.
  - `BoundaryTicket::wait_timeout` remains async.
  - Removed sync compatibility wrappers:
    - `BoundaryTicket::wait_timeout_blocking`
    - sync `BoundaryServer::recv`/`recv_timeout` wrappers
    - sync `BoundaryClient::call`/`call_with_timeout` wrappers

- Macro compatibility after async rename:
  - `#[spargio::main]` now uses a hidden `spargio::__private::block_on(...)` helper to invoke async `run_with(...)` from generated sync `main`.

- Examples/tests updated to new async API names:
  - boundary TDD switched to async call/recv/timeout paths.
  - runtime TDD switched to async `run`/`run_with`/`shutdown` usage.
  - `examples/network_work_stealing.rs` updated to async `run_with(...).await`.
  - `examples/mixed_mode_service.rs` updated for async boundary call path.

Validation:

- `cargo test --features "uring-native macros"`
- `cargo bench --features uring-native --no-run`

Result:

- Full test suite and benchmark target compilation pass after the async-first API break.

## Update: rotating-hotspot slowdown investigation plan (Tokio vs Spargio)

Question captured:

- Why are `net_stream_hotspot_rotation_4k` and `net_pipeline_hotspot_rotation_4k_window32` still faster on Tokio?

Current code-path findings:

- Both hotspot groups already use distributed stream setup in Spargio (`SpargioNetHarness::new_distributed()`), so this is not the earlier single-context concentration issue.
- Spargio hotspot stream path uses `send_all_batch + recv_multishot_segments (+ fallback read_exact_owned)`; Tokio uses simpler `write_all + read_exact` loops.
- Spargio pipeline hotspot path currently uses `write_all_owned/read_exact_owned` per frame and spawns per-stream jobs with generic `spawn_stealable`, not session-aligned placement.
- Native op submission still pays envelope/oneshot/tracking overhead per op when execution is off the stream session shard.

Working hypotheses for the current gap:

1. Placement mismatch in rotating-hotspot loops:
- per-stream tasks can execute off-session-shard (`spawn_stealable`), adding submit/reply overhead without enough skew persistence to amortize stealing wins.

2. Pipeline I/O method overhead:
- `write_all_owned/read_exact_owned` path has extra owned-buffer/method overhead in tight per-frame loops.

3. Multishot path may be suboptimal for this specific rotating shape:
- for short rotating bursts, multishot setup/segment handling may underperform simple exact-read loops.

4. Benchmark harness overhead differences:
- Tokio path uses a very lean inner loop and may currently benefit from less per-op user-space bookkeeping in this shape.

### Planned A/B matrix

A/B-1: task placement (both hotspot benchmarks)
- A: current `spawn_stealable`.
- B: `stream.spawn_stealable_on_session(...)`.
- C: `stream.spawn_on_session(...)`.

A/B-2: pipeline I/O method
- A: current `write_all_owned/read_exact_owned`.
- B: borrowed `write_all/read_exact` with reusable buffers.

A/B-3: stream-hotspot receive mode
- A: current multishot-first path.
- B: force read-exact path.

Execution plan:

1. Add experimental A/B benchmark lanes (net experiments target), no product-table changes yet.
2. Run targeted A/B for both hotspot benchmarks.
3. Implement only the winning changes into the main benchmark/runtime paths.
4. Keep TDD discipline: add failing tests for any API/runtime behavior changes, then implement to green.

## Update: rotating-hotspot A/B results + adopted optimizations

Executed the planned A/B matrix in `benches/net_experiments.rs`:

- `exp_net_stream_hotspot_rotation_ab_4k`
- `exp_net_pipeline_hotspot_rotation_ab_4k_window32`

Command set:

- `cargo bench --features uring-native --bench net_experiments -- exp_net_stream_hotspot_rotation_ab_4k --sample-size 12`
- `cargo bench --features uring-native --bench net_experiments -- exp_net_pipeline_hotspot_rotation_ab_4k_window32 --sample-size 12`

### A/B findings

`exp_net_stream_hotspot_rotation_ab_4k`:

- `tokio_hotspot_rotation`: `8.7424-8.8669 ms`
- `spargio_hotspot_stealable_multishot`: `11.667-11.801 ms`
- `spargio_hotspot_stealable_session_multishot`: `11.705-11.967 ms`
- `spargio_hotspot_pinned_multishot`: `9.8044-9.9619 ms`
- `spargio_hotspot_pinned_readexact`: `9.5227-9.5928 ms`

Interpretation:

- Session-pinned placement is the main gain for this shape.
- For rotating hotspot stream-only traffic, read-exact outperforms multishot.
- Stealable-session-preferred did not beat pinned here.

`exp_net_pipeline_hotspot_rotation_ab_4k_window32`:

- `tokio_pipeline_hotspot`: `26.473-26.678 ms`
- `spargio_pipeline_stealable_owned`: `32.167-32.563 ms`
- `spargio_pipeline_stealable_session_owned`: `32.356-32.844 ms`
- `spargio_pipeline_pinned_owned`: `29.618-30.016 ms`
- `spargio_pipeline_pinned_borrowed`: `30.080-30.247 ms`

Interpretation:

- Session-pinned placement is again the primary improvement.
- Owned I/O loop stays slightly better than borrowed mode in this pipeline shape.

### Optimizations implemented from A/B

Applied to product benchmark path (`benches/net_api.rs`):

1. `net_stream_hotspot_rotation_4k`:
- per-stream work now runs with `stream.spawn_on_session(...)` (session-pinned placement).
- receive mode switched to read-exact for this rotating stream-hotspot workload.

2. `net_pipeline_hotspot_rotation_4k_window32`:
- per-stream work now runs with `stream.spawn_on_session(...)` (session-pinned placement).
- kept owned I/O loop (`write_all_owned/read_exact_owned`) as the better A/B mode.

3. Kept existing defaults unchanged where A/B did not indicate improvement:
- throughput/imbalanced hot path remains multishot-first.
- generic stealable placement remains for non-hotspot benchmark paths.

### Post-optimization benchmark snapshots (`net_api`)

Commands:

- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 12`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 12`

Results:

- `net_stream_hotspot_rotation_4k/tokio_tcp_8streams_rotating_hotspot`: `8.6989-8.7937 ms`
- `net_stream_hotspot_rotation_4k/spargio_tcp_8streams_rotating_hotspot`: `9.5875-9.8201 ms`
- `net_stream_hotspot_rotation_4k/compio_tcp_8streams_rotating_hotspot`: `16.782-17.053 ms`

- `net_pipeline_hotspot_rotation_4k_window32/tokio_tcp_pipeline_hotspot`: `26.328-26.504 ms`
- `net_pipeline_hotspot_rotation_4k_window32/spargio_tcp_pipeline_hotspot`: `29.411-29.919 ms`
- `net_pipeline_hotspot_rotation_4k_window32/compio_tcp_pipeline_hotspot`: `50.787-51.425 ms`

Net effect vs prior `net_api` snapshots:

- Stream rotating-hotspot: Spargio improved materially (about 14-16% faster) and moved closer to Tokio.
- Pipeline rotating-hotspot: Spargio improved materially (about 8-11% faster) and moved closer to Tokio.
- Both workloads still trail Tokio, but the remaining gap is substantially smaller than before.

## Update: implemented next hotspot optimizations (Red/Green TDD)

Follow-up optimizations implemented from the latest hotspot analysis:

1. Remove extra owned-buffer read/write overhead in stream loops.
2. Add a tighter same-shard native-op fast path for session-stream ops.

### Red phase

Added failing test in `tests/ergonomics_tdd.rs`:

- `net_tcp_stream_spawn_on_session_uses_local_direct_native_fastpath`

Initial failure:

- compile-time red because `RuntimeStats` had no `native_any_local_direct_submitted` field.

### Green phase

Implemented:

- New runtime stat:
  - `RuntimeStats::native_any_local_direct_submitted`
  - tracked in `RuntimeStatsInner` and surfaced via `stats_snapshot()`.

- Session-stream local direct path:
  - in `UringNativeAny::{recv_owned_at_on_shard, send_owned_at_on_shard}`, when running on the same runtime+shard context:
    - enqueue `LocalCommand::SubmitNative{Recv,Send}Owned` directly
    - increment `native_any_local_direct_submitted`
    - avoid `NativeAnyCommand -> LocalCommand` conversion path

- Offset-based native send/recv plumbing:
  - added `offset` to `NativeAnyCommand::{RecvOwned, SendOwned}`
  - added `offset` to `LocalCommand::{SubmitNativeRecvOwned, SubmitNativeSendOwned}`
  - io_uring driver now submits `Recv/Send` against `buf[offset..]` without cloning/splitting buffers.

- Stream owned I/O loop rewrites:
  - `TcpStream::write_all_owned` now advances using `send_owned_from(buf, offset)` (no fallback `send(&buf[sent..])` cloning path).
  - `TcpStream::read_exact_owned` now advances using `recv_owned_from(dst, offset)` (no `read_exact` scratch/copy path).

Validation:

- `cargo test --features uring-native --test ergonomics_tdd`
- `cargo test --features uring-native --tests`

### Post-change benchmark snapshot

Commands:

- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 12`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 12`

Results:

- `net_stream_hotspot_rotation_4k/tokio_tcp_8streams_rotating_hotspot`: `8.7900-8.8664 ms`
- `net_stream_hotspot_rotation_4k/spargio_tcp_8streams_rotating_hotspot`: `9.3389-9.4787 ms`
- `net_stream_hotspot_rotation_4k/compio_tcp_8streams_rotating_hotspot`: `16.661-16.845 ms`

- `net_pipeline_hotspot_rotation_4k_window32/tokio_tcp_pipeline_hotspot`: `26.322-26.549 ms`
- `net_pipeline_hotspot_rotation_4k_window32/spargio_tcp_pipeline_hotspot`: `28.933-29.121 ms`
- `net_pipeline_hotspot_rotation_4k_window32/compio_tcp_pipeline_hotspot`: `51.323-52.073 ms`

Effect:

- Additional improvement in both rotating-hotspot benchmarks.
- Remaining gap to Tokio narrowed again (now roughly ~5-10% depending on exact bound pair).

## Update: local direct native replies now avoid oneshot allocation (Red/Green TDD)

Completed the in-progress local fast-path refactor so same-runtime same-shard
`recv_owned/send_owned` submissions do not allocate/use a oneshot channel.

### Green implementation details

- Added `NativeBufReply::{Oneshot, Local}` and `NativeBufReply::complete(...)`.
- Added local waiter pair:
  - `NativeBufReply::local_pair()`
  - `NativeLocalBufReplySlot` + `NativeLocalBufReplyFuture`
- Wired local-direct branch in:
  - `UringNativeAny::recv_owned_at_on_shard`
  - `UringNativeAny::send_owned_at_on_shard`
  to use the local waiter/future instead of oneshot.
- Updated io_uring native recv/send submit/completion paths to use
  `NativeBufReply` uniformly.

Validation:

- `cargo check --features uring-native`
- `cargo test --features uring-native --test ergonomics_tdd`
- `cargo test --features uring-native --tests`

### Post-change hotspot benchmark snapshot

Commands:

- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 12`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 12`

Results:

- `net_stream_hotspot_rotation_4k/tokio_tcp_8streams_rotating_hotspot`: `8.6940-8.8212 ms`
- `net_stream_hotspot_rotation_4k/spargio_tcp_8streams_rotating_hotspot`: `9.3020-9.4073 ms`
- `net_stream_hotspot_rotation_4k/compio_tcp_8streams_rotating_hotspot`: `16.681-16.812 ms`

- `net_pipeline_hotspot_rotation_4k_window32/tokio_tcp_pipeline_hotspot`: `26.286-26.560 ms`
- `net_pipeline_hotspot_rotation_4k_window32/spargio_tcp_pipeline_hotspot`: `29.025-29.574 ms`
- `net_pipeline_hotspot_rotation_4k_window32/compio_tcp_pipeline_hotspot`: `50.614-50.986 ms`

Effect:

- Refactor is functionally complete and fully green.
- This specific change is mostly neutral on these two benchmark shapes
  (small movement within run-to-run noise).

## Update: keyed-hotspot benchmark follow-up (event-queue/msg path optimization backlog)

Context:

- Added `net_keyed_hotspot_rotation_4k` in `benches/net_api.rs` to stress
  rotating hotspot network I/O plus keyed cross-shard dispatch.
- Current snapshot (`--sample-size 12`):
  - `tokio_tcp_keyed_router_hotspot`: `9.2375-9.3226 ms`
  - `spargio_tcp_keyed_router_hotspot`: `10.061-10.254 ms`
- Interpretation: Tokio is still faster on this shape; likely overhead comes from
  per-message payload queueing, doorbell signaling, and event queue handling in
  Spargio’s ring-msg path.

Planned optimization ideas (highest ROI first):

1. Batch payload enqueue under one lock (high ROI, low risk)
- Problem: `SubmitRingMsgBatch` currently loops through per-message submit calls.
- Cost: lock/unlock and per-item queue overhead in `enqueue_payload` for each msg.
- Plan:
  - add a true backend/io_uring batch enqueue path:
    - one queue lock
    - append all payloads
    - one doorbell when queue transitions empty -> non-empty.
- Expected impact: reduce keyed-hotspot dispatch overhead materially.

2. Batch `EventState` delivery (high ROI, low-medium risk)
- Problem: `drain_payload_queue` pushes one event at a time, each with lock+wake.
- Plan:
  - add `EventState::push_many(...)`
  - queue drained ring-msg events in one critical section
  - wake waiters once per drained batch.
- Expected impact: lower owner-side event ingestion overhead.

3. Lower synchronization cost in `EventState` (medium ROI, medium risk)
- Problem: current queue uses mutex-protected `VecDeque` and per-push wake path.
- Plan options:
  - switch to lighter mutex implementation (e.g. `parking_lot`)
  - split producer-consumer queue/waker paths to reduce contention.
- Expected impact: lower overhead for high ring-msg event rates.

4. Fast path for hot internal ring-msg tags (medium ROI, medium-high risk)
- Problem: hot dispatch tags share same generic `EventState` path as all events.
- Plan:
  - route selected internal tags to dedicated per-shard mailboxes
  - keep `next_event()` for general API compatibility
  - use msg_ring as wake/doorbell only for these hot lanes.
- Expected impact: better keyed-router style throughput under hotspot churn.

5. Direct msg payload mode for tiny control messages (exploratory, medium-high risk)
- Problem: payload-queue + doorbell indirection adds overhead for tiny values.
- Plan:
  - where semantics allow, encode tiny payloads directly in `MSG_RING` CQEs
    (skip intermediate payload queue).
- Expected impact: reduced dispatch overhead for control-heavy micro-messages.

Validation plan for each change:

- Re-run:
  - `cargo bench --features uring-native --bench net_api -- net_keyed_hotspot_rotation_4k --sample-size 12`
  - `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 12`
  - `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 12`
- Track regression guardrails on:
  - `net_stream_throughput_4k_window32`
  - `net_stream_imbalanced_4k_hot1_light7`

## Update: keyed-hotspot optimization pass (batching complete, lock-free payload A/B reverted)

Implemented in this pass:

1. `SubmitRingMsgBatch` now uses a true backend batch path
- `ShardBackend::submit_ring_msg_batch(...)` submits one batch call.
- `IoUringDriver::submit_ring_msg_batch(...)` enqueues in one queue lock section,
  sends at most one doorbell for empty->non-empty transitions, and accounts
  partial acceptance/backpressure once per batch.

2. Event ingress now batches queue+wake
- Added `EventState::push_many(...)` and used it from:
  - io_uring CQE ring-msg reap path
  - payload-queue drain path
- `ring_msgs_completed` accounting now aggregates by batch where applicable.

3. Lowered `EventState` synchronization overhead
- Replaced mutex-protected event queue with `crossbeam_queue::SegQueue<Event>`.
- Kept waiter registration under a small mutex (`Vec<Waker>`).
- `push/push_many` now perform lock-free queue push and only lock to drain waiters.

4. Ran a lock-free payload-queue A/B and reverted it
- Experiment: replaced per-target/per-source payload queues with bounded
  `ArrayQueue`.
- Outcome:
  - no keyed-hotspot improvement
  - rotating-stream hotspot regressed
- Decision: reverted payload-queue `ArrayQueue` experiment; retained
  event-queue synchronization changes above.

Validation:

- `cargo fmt`
- `cargo check --features uring-native`
- `cargo test --features uring-native --tests`

Benchmarks (post-revert baseline, `--sample-size 12`):

- `net_keyed_hotspot_rotation_4k/tokio_tcp_keyed_router_hotspot`: `9.3457-9.3879 ms`
- `net_keyed_hotspot_rotation_4k/spargio_tcp_keyed_router_hotspot`: `10.008-10.062 ms`

- `net_stream_hotspot_rotation_4k/tokio_tcp_8streams_rotating_hotspot`: `8.8285-8.9134 ms`
- `net_stream_hotspot_rotation_4k/spargio_tcp_8streams_rotating_hotspot`: `9.3247-9.5191 ms`
- `net_stream_hotspot_rotation_4k/compio_tcp_8streams_rotating_hotspot`: `16.668-16.808 ms`

- `net_pipeline_hotspot_rotation_4k_window32/tokio_tcp_pipeline_hotspot`: `26.305-26.569 ms`
- `net_pipeline_hotspot_rotation_4k_window32/spargio_tcp_pipeline_hotspot`: `29.010-29.400 ms`
- `net_pipeline_hotspot_rotation_4k_window32/compio_tcp_pipeline_hotspot`: `50.682-51.536 ms`

Interpretation:

- Batching and event-ingress improvements are in place and stable.
- Main remaining gap on keyed-hotspot is not from payload queue lock granularity.
- Highest-ROI remaining ideas are:
  - hot-tag/internal mailbox fast path
  - direct tiny-control-message `MSG_RING` payload mode (selective bypass of doorbell queue)

## Update: direct `MSG_RING` control API (opt-in) + validation

Implemented:

- Added opt-in direct message APIs that bypass the payload queue/doorbell path:
  - `RemoteShard::send_raw_direct_nowait(...)`
  - `RemoteShard::send_many_raw_direct_nowait(...)`
  - `ShardCtx::send_raw_direct_nowait(...)`
  - `ShardCtx::send_many_raw_direct_nowait(...)`
- Runtime wiring:
  - new local command `SubmitRingMsgDirectBatch`
  - backend handler `submit_ring_msg_direct_batch(...)`
  - io_uring submit path `submit_ring_msg_direct_nowait(...)` (one `MSG_RING` SQE per message)

Red/Green tests added:

- `send_raw_direct_nowait_delivers_event`
- `send_many_raw_direct_nowait_delivers_in_order`

Validation:

- `cargo check --features uring-native`
- `cargo test --features uring-native --test runtime_tdd`
- `cargo test --features uring-native --tests`

Notes:

- This direct path is intentionally opt-in and currently best suited for low-volume,
  tiny control messages.
- Attempting to swap keyed-hotspot benchmark traffic to direct mode increased runtime
  significantly (high per-message SQE overhead under that specific load), so benchmark
  default was reverted to the stable batched payload-queue path.

Post-change benchmark sanity snapshot:

- `cargo bench --features uring-native --bench net_api -- net_keyed_hotspot_rotation_4k --sample-size 12`
  - `tokio_tcp_keyed_router_hotspot`: `9.2793-9.3288 ms`
  - `spargio_tcp_keyed_router_hotspot`: `9.9952-10.249 ms`
- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 10`
  - `tokio_tcp_8streams_rotating_hotspot`: `8.7510-8.8628 ms`
  - `spargio_tcp_8streams_rotating_hotspot`: `9.3289-9.6232 ms`
  - `compio_tcp_8streams_rotating_hotspot`: `16.771-16.908 ms`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 10`
  - `tokio_tcp_pipeline_hotspot`: `26.193-26.447 ms`
  - `spargio_tcp_pipeline_hotspot`: `28.856-28.982 ms`
  - `compio_tcp_pipeline_hotspot`: `50.464-51.058 ms`

## Update: hot-tag mailbox lane (msg routing fast path) for keyed dispatch

Implemented:

- Runtime builder hot-tag routing configuration:
  - `RuntimeBuilder::hot_msg_tag(tag)`
  - `RuntimeBuilder::hot_msg_tags(iter)`
- Added dedicated shard-local hot event lane:
  - `ShardCtx::next_hot_event()`
  - internal `hot_event_state` alongside regular `event_state`
- Routed incoming ring messages by tag at ingestion time:
  - io_uring CQE ring-msg path
  - payload-queue drain path
  - external `InjectRawMessage` path
- Keyed benchmark wiring:
  - benchmark runtime now enables hot tags for `KEYED_DISPATCH_TAG`/`KEYED_STOP_TAG`
  - keyed owner tasks consume via `next_hot_event()`

Red/Green TDD:

- Added tests:
  - `hot_msg_tag_routes_to_hot_event_lane`
  - `non_hot_msg_tag_remains_on_regular_event_lane`
- Existing direct-message tests retained and passing.

Validation:

- `cargo fmt`
- `cargo check --features uring-native`
- `cargo test --features uring-native --tests`

Benchmark snapshot after this change:

- `cargo bench --features uring-native --bench net_api -- net_keyed_hotspot_rotation_4k --sample-size 12`
  - `tokio_tcp_keyed_router_hotspot`: `9.4113-9.5537 ms`
  - `spargio_tcp_keyed_router_hotspot`: `9.9657-10.005 ms`
- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 10`
  - `tokio_tcp_8streams_rotating_hotspot`: `8.6508-8.7692 ms`
  - `spargio_tcp_8streams_rotating_hotspot`: `9.4165-9.5420 ms`
  - `compio_tcp_8streams_rotating_hotspot`: `16.692-16.835 ms`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 10`
  - `tokio_tcp_pipeline_hotspot`: `26.336-26.504 ms`
  - `spargio_tcp_pipeline_hotspot`: `29.244-29.392 ms`
  - `compio_tcp_pipeline_hotspot`: `50.869-51.357 ms`

Interpretation:

- Hot-tag lane is now functional and benchmarked.
- Keyed hotspot remains close to prior best range but still behind Tokio.
- Next likely high-ROI step remains value-coalescing for hot dispatch tags
  (aggregate frequent tiny hot-tag increments before queueing/wake).

## Update: coalesced-hot-tag ingestion (batch value aggregation)

Implemented:

- Added explicit coalesced-hot-tag config:
  - `RuntimeBuilder::coalesced_hot_msg_tag(tag)`
  - `RuntimeBuilder::coalesced_hot_msg_tags(iter)`
- Coalesced tags are automatically treated as hot tags.
- Extended ring-msg ingest path to coalesce same `(from, tag)` values within each
  ingest batch before queueing hot events:
  - io_uring CQE ring-msg batch
  - payload-queue drain batch
  - coalescing emits one or more `Event::RingMsg` with summed `val`
    (chunked safely if sum exceeds `u32::MAX`).
- Keyed benchmark harness now enables:
  - hot tags: `KEYED_DISPATCH_TAG`, `KEYED_STOP_TAG`
  - coalesced hot tag: `KEYED_DISPATCH_TAG`

Red/Green TDD:

- Added tests:
  - `coalesced_hot_msg_tag_aggregates_batch_values`
  - `non_coalesced_hot_msg_tag_preserves_batch_events`
- Existing hot-lane tests retained and passing.

Validation:

- `cargo fmt`
- `cargo check --features uring-native`
- `cargo test --features uring-native --tests`

Benchmark snapshot after coalescing:

- `cargo bench --features uring-native --bench net_api -- net_keyed_hotspot_rotation_4k --sample-size 12`
  - `tokio_tcp_keyed_router_hotspot`: `9.3593-9.4503 ms`
  - `spargio_tcp_keyed_router_hotspot`: `9.8008-10.002 ms`
- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 10`
  - `tokio_tcp_8streams_rotating_hotspot`: `8.7586-8.8332 ms`
  - `spargio_tcp_8streams_rotating_hotspot`: `9.4692-9.6138 ms`
  - `compio_tcp_8streams_rotating_hotspot`: `16.851-17.197 ms`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 10`
  - `tokio_tcp_pipeline_hotspot`: `26.303-26.520 ms`
  - `spargio_tcp_pipeline_hotspot`: `29.011-29.267 ms`
  - `compio_tcp_pipeline_hotspot`: `50.880-51.315 ms`

Interpretation:

- Coalescing improved keyed-hotspot path modestly and safely, with no material
  regression on stream/pipeline guardrails.
- Remaining keyed-hotspot gap appears to come from broader per-event control-path
  overhead, not just duplicate dispatch-value churn.

## Update: enqueue-time coalescing for coalesced-hot tags (queue-pressure reduction)

Implemented:

- `IoUringDriver` now carries coalesced-hot-tag lookup and applies it while
  writing payload queues (not only at ingest time).
- For coalesced-hot tags, enqueue path now merges with the queue tail when
  `(tail.tag == tag)`, including safe overflow chunking.
- This allows tight-capacity queues to absorb bursty tiny dispatch increments
  without immediate backpressure.

Red/Green TDD:

- Added `coalesced_hot_tag_absorbs_batch_under_tight_queue_capacity`:
  - runtime with `msg_ring_queue_capacity(1)`
  - coalesced hot tag burst `(59,1),(59,2),(59,3)`
  - verifies success and single hot event with `val=6`
- Full suite remains green.

Validation:

- `cargo fmt`
- `cargo check --features uring-native`
- `cargo test --features uring-native --tests`

Benchmark snapshot after enqueue-time coalescing:

- `cargo bench --features uring-native --bench net_api -- net_keyed_hotspot_rotation_4k --sample-size 12`
  - `tokio_tcp_keyed_router_hotspot`: `9.3417-9.4771 ms`
  - `spargio_tcp_keyed_router_hotspot`: `9.5432-9.6410 ms`
- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 10`
  - `tokio_tcp_8streams_rotating_hotspot`: `8.7407-8.8063 ms`
  - `spargio_tcp_8streams_rotating_hotspot`: `9.3352-9.4076 ms`
  - `compio_tcp_8streams_rotating_hotspot`: `16.536-16.814 ms`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 10`
  - `tokio_tcp_pipeline_hotspot`: `26.361-26.744 ms`
  - `spargio_tcp_pipeline_hotspot`: `29.060-29.326 ms`
  - `compio_tcp_pipeline_hotspot`: `50.503-51.418 ms`

Interpretation:

- Keyed-hotspot improved materially again; this slice appears higher ROI than
  ingest-only coalescing.
- Stream/pipeline guardrails remained stable.

## Update: completed remaining keyed-hotspot optimization slices (counter lane + adaptive wake policy)

Completed slices:

1. Cross-batch hot-counter accumulation
- Coalesced hot tags are now aggregated into shard-local counters (`u16 -> u64`)
  instead of being emitted as per-message hot events.
- Aggregation persists across ingest batches and drains, not only within a single
  batch callback.

2. Hot-counter consume fast path
- Added consume API:
  - `ShardCtx::next_hot_count(tag) -> Future<Output = u64>`
  - `ShardCtx::try_take_hot_count(tag) -> Option<u64>`
- Keyed benchmark owner path now consumes dispatch volume via `next_hot_count`
  and only uses `next_hot_event` for stop/control tags.
- This removes event-object overhead for coalesced dispatch traffic.

3. Adaptive dispatch/wake policy + hardening
- Added tuning knob:
  - `RuntimeBuilder::hot_counter_wake_threshold(u64)`
- Wake policy for waiting hot-counter consumers:
  - wake on 0->nonzero transition
  - or on crossing threshold from below.
- Added hardening tests:
  - `coalesced_hot_count_accumulates_across_batches`
  - `hot_counter_threshold_does_not_starve_first_update`
  - existing coalescing/hot-lane tests retained.
- Kept benchmark gate reruns on:
  - keyed hotspot (target KPI)
  - stream hotspot (guardrail)
  - pipeline hotspot (guardrail)

Validation:

- `cargo fmt`
- `cargo check --features uring-native`
- `cargo test --features uring-native --tests`

Benchmark gate snapshot (post-slices):

- `cargo bench --features uring-native --bench net_api -- net_keyed_hotspot_rotation_4k --sample-size 12`
  - `tokio_tcp_keyed_router_hotspot`: `9.3712-9.4256 ms`
  - `spargio_tcp_keyed_router_hotspot`: `9.5867-9.7558 ms`
- `cargo bench --features uring-native --bench net_api -- net_stream_hotspot_rotation_4k --sample-size 10`
  - `tokio_tcp_8streams_rotating_hotspot`: `8.7801-8.8376 ms`
  - `spargio_tcp_8streams_rotating_hotspot`: `9.3909-9.4505 ms`
  - `compio_tcp_8streams_rotating_hotspot`: `16.640-17.098 ms`
- `cargo bench --features uring-native --bench net_api -- net_pipeline_hotspot_rotation_4k_window32 --sample-size 10`
  - `tokio_tcp_pipeline_hotspot`: `26.380-26.482 ms`
  - `spargio_tcp_pipeline_hotspot`: `28.856-29.242 ms`
  - `compio_tcp_pipeline_hotspot`: `50.770-51.273 ms`

Outcome:

- Remaining planned slices for this keyed-hotspot track are now implemented.
- Spargio is now very close to Tokio on keyed-hotspot in this harness, with stable
  guardrails on other hotspot shapes.

## Update: keyed hotspot benchmark now includes Compio

Added `compio` variant to `net_keyed_hotspot_rotation_4k`:

- new bench case: `compio_tcp_keyed_router_hotspot`
- wired through `CompioNetCmd::EchoKeyedHotspot`, harness command handling, and
  `compio_echo_keyed_hotspot_rotation(...)`.

Sanity run (`--sample-size 10`):

- `tokio_tcp_keyed_router_hotspot`: `9.2799-9.3554 ms`
- `spargio_tcp_keyed_router_hotspot`: `9.5718-9.7460 ms`
- `compio_tcp_keyed_router_hotspot`: `16.652-16.712 ms`

## Update: full benchmark refresh + README sync (2026-02-27)

Ran the full benchmark suite with current `uring-native` implementation and
updated README benchmark tables/interpretation to match.

Commands:

- `cargo bench --features uring-native --bench ping_pong -- --sample-size 12`
- `cargo bench --features uring-native --bench fanout_fanin -- --sample-size 12`
- `cargo bench --features uring-native --bench fs_api -- --sample-size 12`
- `cargo bench --features uring-native --bench net_api -- --sample-size 12`

Snapshot:

- Coordination (Tokio vs Spargio):
  - `steady_ping_pong_rtt`: Tokio `1.4911-1.5024 ms`, Spargio `394.83-396.21 us`
  - `steady_one_way_send_drain`: Tokio `68.607-70.859 us`, Spargio `49.232-50.110 us`
  - `cold_start_ping_pong`: Tokio `553.31-561.83 us`, Spargio `284.23-287.50 us`
  - `fanout_fanin_balanced`: Tokio `1.4534-1.4631 ms`, Spargio `1.3426-1.3480 ms`
  - `fanout_fanin_skewed`: Tokio `2.4026-2.4220 ms`, Spargio `1.9979-2.0032 ms`

- Native API (Tokio vs Spargio vs Compio):
  - `fs_read_rtt_4k`: Tokio `1.6174-1.6565 ms`, Spargio `1.0008-1.0188 ms`, Compio `1.4782-1.4978 ms`
  - `fs_read_throughput_4k_qd32`: Tokio `7.8804-8.1672 ms`, Spargio `6.1570-6.2793 ms`, Compio `4.0877-5.0803 ms`
  - `net_echo_rtt_256b`: Tokio `7.7462-7.9687 ms`, Spargio `5.4356-5.5084 ms`, Compio `6.4541-6.5632 ms`
  - `net_stream_throughput_4k_window32`: Tokio `11.142-11.247 ms`, Spargio `10.745-10.813 ms`, Compio `7.0631-7.1570 ms`

- Imbalanced native API:
  - `net_stream_imbalanced_4k_hot1_light7`: Tokio `13.584-13.799 ms`, Spargio `13.191-13.375 ms`, Compio `12.283-12.414 ms`
  - `net_stream_hotspot_rotation_4k`: Tokio `8.7891-8.8560 ms`, Spargio `9.3683-9.4526 ms`, Compio `16.870-16.982 ms`
  - `net_pipeline_hotspot_rotation_4k_window32`: Tokio `26.415-26.654 ms`, Spargio `29.113-29.517 ms`, Compio `50.648-51.210 ms`
  - `net_keyed_hotspot_rotation_4k`: Tokio `9.3152-9.4912 ms`, Spargio `9.5691-9.7957 ms`, Compio `16.781-16.994 ms`

Interpretation updates reflected in README:

- Spargio retains clear lead on coordination-heavy and low-depth latency cases.
- Compio retains lead on sustained balanced stream throughput and static-hotspot imbalance.
- Tokio remains ahead in rotating-hotspot stream/pipeline; keyed routing is near parity.

## Note: do the network optimizations fit Spargio's value proposition?

Question:

- Do the network optimizations we added to close the Tokio gap actually make sense
  for Spargio, and are they realistic for users to adopt?

Answer:

- Yes, primarily when they reduce cross-shard coordination cost (coalesced hot
  tags, hot-counter fast path, adaptive wake policy, keyed ownership routing).
  These directly support Spargio's core value proposition: efficient
  `io_uring` + `msg_ring` work-stealing/steering under coordination-heavy load.
- These optimizations are most relevant for keyable/skewed multi-stream
  workloads (tenant/session/partition keyed routing), where steering and
  aggregation reduce dispatch overhead.
- They should remain opt-in tuning for advanced users. Default paths should
  stay simple and semantically conservative when applications need per-message
  event fidelity and straightforward observability.

Follow-up planned:

- Add user-facing documentation for these knobs (what each knob does, semantic
  trade-offs, recommended workload shapes, and safe defaults), plus a short
  tuning guide in README/docs.

## Update: flaky `uring-native` CI test fixed (2026-02-28)

Observed:

- CI run `22511780569` failed at `Cargo test (uring-native)` with exit code 101.
- Failure was intermittent and initially non-reproducible on a single local run.

Root cause:

- `coalesced_hot_count_accumulates_across_batches` in `tests/runtime_tdd.rs` had
  a race in test logic.
- The receiver polled `try_take_hot_count(61)` in a loop and could consume the
  first coalesced update (`3`) before the second batch (`+3`) arrived, causing
  occasional `left: 3, right: 6`.

Fix:

- Made the test deterministic by introducing a non-coalesced barrier tag and
  waiting for a barrier event before reading the hot counter.
- Updated the test to assert total hot count only after both sends are known to
  have been delivered to the target shard.

Validation:

- `cargo test --features uring-native --test runtime_tdd coalesced_hot_count_accumulates_across_batches`
- 50x stress loop of that single test: all pass.
- `cargo test --features uring-native`: pass.

Outcome:

- Removed known flake in `uring-native` test suite.
- No runtime behavior change; this was a test synchronization fix.

## Update: Compio parity audit snapshot (2026-02-28)

Captured a focused feature-parity snapshot against current Compio docs and
our current public `spargio` surface, with emphasis on practical user-facing
gaps.

### I/O API breadth: present vs missing

Current Spargio public I/O surface:

- `fs`: `OpenOptions` + `File` with `open/create/from_std`, positional
  `read_at`/`read_at_into`/`write_at`/`write_all_at`, `read_to_end_at`, `fsync`.
- `net`: TCP-only (`TcpStream`, `TcpListener`) including session-policy connect/accept,
  owned buffer APIs, and multishot segment receive helpers.
- runtime-native unbound lane methods routed through `io_uring`.

Compared with Compio's documented surface, notable missing breadth in Spargio:

1. Filesystem path-level helpers and metadata APIs
   - examples: `create_dir`, `create_dir_all`, `hard_link`, `metadata`,
     `remove_dir`, `remove_file`, `rename`, `set_permissions`, `symlink`,
     `symlink_metadata`, convenience `read`/`write`.
2. Broader network protocol/socket families
   - UDP and Unix domain socket APIs (`UdpSocket`, `UnixListener`,
     `UnixStream`, `UnixDatagram`) are not currently in Spargio public API.
3. Generic async I/O trait/adaptor layer
   - no public Spargio equivalent to Compio `io` traits and adapters
     (`AsyncRead`/`AsyncWrite` families, buffered wrappers, compat/framed utilities).
4. Higher-level transport/runtime-integrated modules
   - no Spargio public modules corresponding to Compio optional
     `process`/`signal`/`tls`/`ws`/`quic` ecosystem crates.

This aligns with existing README scope note:

- "Broader filesystem and network native-op surface ... not done yet."

### Core runtime parity: what is still missing in Spargio

Core runtime is functional and differentiated (shards, placement APIs,
work-stealing MVP, timers, cancellation/task group, boundary APIs), but gaps
remain versus broader runtime ecosystems:

1. Backend/platform breadth
   - `BackendKind` is currently `IoUring` only.
2. Top-level `!Send` ergonomics
   - public runtime handle spawn paths require `Send`; `!Send` execution is
     currently available only via shard-local `ShardCtx::spawn_local(...)`.
3. Time/runtime utility breadth
   - currently minimal top-level primitives (`sleep`, `timeout`) rather than a
     fuller interval/deadline utility set.
4. Production hardening/tuning depth
   - advanced stealing policy tuning and long-window hardening/observability are
     still listed as pending in project docs.

Conclusion:

- Spargio currently has partial feature overlap with Compio for core
  fs/tcp runtime workflows, but does not yet have Compio-level I/O breadth.
- Current project direction remains valid: keep differentiating on
  cross-shard coordination + placement/stealing, while closing practical
  fs/net/runtime-surface gaps incrementally.

## Update: `!Send` ergonomics slice (`run_local_on` + `spawn_local_on`) (2026-02-28)

Captured and implemented the proposal discussed in review:

- add a first-class local-entry helper that can run `!Send` futures on a chosen shard.
- add a handle-level construct-on-shard API so callers can build `!Send` futures
  on target shard context without requiring a prior `ShardCtx` hop.

### Red phase

Added failing tests in `tests/runtime_tdd.rs`:

- `run_local_on_accepts_non_send_future`
- `runtime_handle_spawn_local_on_accepts_non_send_future`

Red failure signals:

- unresolved import: `spargio::run_local_on`
- missing method: `RuntimeHandle::spawn_local_on`

### Green phase

Implemented public APIs in `src/lib.rs`:

1. New top-level entry helper
   - `run_local_on(builder, shard, entry)`
   - signature accepts `entry: FnOnce(ShardCtx) -> Fut + Send`, with `Fut: Future + 'static`
     (no `Send` bound on `Fut`), and `T: Send`.
2. New runtime-handle API
   - `RuntimeHandle::spawn_local_on(shard, init)`
   - same construct-on-shard shape and `!Send` future support.
3. Internal spawn path
   - added `spawn_local_on_shared(...)`.
   - implementation routes through existing shard command channel (`Command::Spawn`)
     and, on the target shard, constructs the future using live `ShardCtx`,
     then executes it via `ctx.spawn_local(...)`.

Design notes:

- No new scheduler lane or command type was required.
- `!Send` is enabled by constructing the future on the shard and running it via
  local spawner; cross-thread transfer only carries the `Send` initializer closure.
- Return type remains `JoinHandle<T>` with `T: Send` for cross-thread join safety.

### Validation

Commands run:

- `cargo test --features uring-native --test runtime_tdd run_local_on_accepts_non_send_future`
- `cargo test --features uring-native --test runtime_tdd runtime_handle_spawn_local_on_accepts_non_send_future`
- `cargo test --features uring-native --test runtime_tdd`

Result:

- both new tests pass.
- full `runtime_tdd` suite passes (`24 passed`).

### Outcome

- Spargio now supports a direct top-level local entry and handle-level local
  spawn path for `!Send` futures, reducing friction for shard-local state
  patterns (`Rc`, `RefCell`, etc.) while preserving existing shard-safety model.

## Update: low-level unsafe native extension API slice (2026-02-28)

Recorded proposal and implemented it in this slice:

- add a low-level unsafe extension lane so external crates can submit custom
  SQE/CQE workflows without editing Spargio core for each new operation.
- keep high-level fs/net APIs safe and unchanged; isolate risk in explicit
  unsafe extension entry points.

### Red phase

Added new tests in `tests/uring_native_tdd.rs` for extension use-cases:

- `uring_native_unbound_unsafe_extension_supports_custom_nop`
- `uring_native_unbound_unsafe_extension_supports_custom_read_entry`

These encode the intended external-writer workflow:

- provide extension-owned state
- build a custom SQE from that state
- decode CQE into a typed result

### Green phase

Implemented low-level unsafe API on `UringNativeAny`:

- `unsafe submit_unsafe(...)`
- `unsafe submit_unsafe_on_shard(...)`

Added new public completion type:

- `UringCqe { result, flags }`

Internal runtime wiring added:

- new internal native command variant carrying extension op envelopes
- extension op envelope retained in runtime until completion
- SQE built on target shard, user data overridden by runtime tracking key
- completion/failure paths return typed result through oneshot
- dispatch integrated with existing fast path / envelope path and affinity
  violation guardrails

### Validation

Commands run:

- `cargo test --features uring-native --test uring_native_tdd uring_native_unbound_unsafe_extension_supports_custom_nop`
- `cargo test --features uring-native --test uring_native_tdd uring_native_unbound_unsafe_extension_supports_custom_read_entry`
- `cargo test --features uring-native --test runtime_tdd --test uring_native_tdd`

Result:

- new unsafe-extension tests pass.
- full `runtime_tdd` and `uring_native_tdd` suites pass.

### Docs sync

README updated to reflect completed status:

- added done bullets for:
  - `!Send` ergonomics (`run_local_on`, `RuntimeHandle::spawn_local_on`)
  - low-level unsafe extension API (`UringNativeAny::{submit_unsafe, submit_unsafe_on_shard}`)
- reviewed done/not-done sections and adjusted wording:
  - "broader built-in fs/net surface" remains not done
  - added safe-wrapper/cookbook work for unsafe extension API to not-done backlog

## Update: time/runtime utility parity comparison (Compio + monoio, io_uring fit adjusted) (2026-02-28)

Revised the time/runtime parity recommendations to account for whether each gap
is:

- `Direct io_uring`: maps directly to io_uring operations.
- `Hybrid`: io_uring covers the wait/I/O path, while policy/scheduling/control
  remains user-space runtime logic.
- `Not io_uring-native`: mostly scheduler/context/ergonomics API surface above
  kernel I/O.

Context:

- This section is scoped to time/runtime utility APIs (not broader fs/net API
  breadth).
- Spargio today already has: `sleep`, `timeout`, `run`, `run_with`,
  `run_local_on`, `spawn_local_on`, cancellation token, and task group support.

### Compio parity gaps (time/runtime utility scope), io_uring fit, and recommendation

1. Absolute-deadline and interval timer APIs
   - Missing in Spargio:
     - `sleep_until`
     - `timeout_at`
     - `interval` / `interval_at`
     - `Interval::tick`
   - io_uring fit:
     - `Direct io_uring`:
       - `sleep_until` via timeout op on the native lane.
     - `Hybrid`:
       - `timeout_at` as composition over deadline timer + future race.
       - interval/tick as runtime policy on top of timer primitives.
   - Recommendation:
     - Add.
   - Priority:
     - High.
   - Rationale:
     - Strong functional value and clear alignment with io_uring timer path.

2. Rich timer object controls
   - Missing in Spargio:
     - resettable/introspectable timer object shape (`deadline`/`reset`/
       elapsed-style helpers).
   - io_uring fit:
     - `Hybrid` / mostly `Not io_uring-native` (API ergonomics and runtime timer
       bookkeeping over timer ops).
   - Recommendation:
     - Add a minimal version later.
   - Priority:
     - Medium.
   - Rationale:
     - Useful, but secondary to shipping base deadline/interval primitives.

3. `spawn_blocking` bridge
   - Missing in Spargio:
     - explicit runtime blocking bridge API.
   - io_uring fit:
     - `Not io_uring-native` (thread-pool/runtime policy feature).
   - Recommendation:
     - Add with strict bounds and opt-in behavior.
   - Priority:
     - Medium-high.
   - Rationale:
     - Operationally important escape hatch, but not part of io_uring data path.

4. Runtime control surface (`run`/`poll`/`poll_with`/`current_timeout`)
   - Missing in Spargio:
     - explicit low-level runtime control API set comparable to Compio.
   - io_uring fit:
     - `Hybrid`:
       - polling/timeout plumbing can map to io_uring waits, but API shape is
         mostly scheduler-control surface.
   - Recommendation:
     - Do not add full stable parity surface now; keep internal or debugging use.
   - Priority:
     - Low.
   - Rationale:
     - Limited end-user value and higher misuse/maintenance risk.

5. Runtime context API (`enter`/current-runtime access)
   - Missing in Spargio:
     - explicit public context-enter/current-runtime model.
   - io_uring fit:
     - `Not io_uring-native` (TLS/context ergonomics).
   - Recommendation:
     - Defer.
   - Priority:
     - Low-medium.
   - Rationale:
     - Useful only for narrower extension patterns; easy to misuse if overexposed.

6. `attach(fd)`-style extension-author hook
   - Missing in Spargio:
     - public attach hook for custom high-level wrappers.
   - io_uring fit:
     - `Hybrid`:
       - could map to registration/fixed-file strategy, but behavior and benefit
         are workload-dependent.
   - Recommendation:
     - Defer for now.
   - Priority:
     - Low.
   - Rationale:
     - unsafe extension path already exists; add attach semantics only if measured
       wrapper use-cases require it.

7. Builder knobs (`thread_affinity`, scheduler `event_interval`)
   - Missing in Spargio:
     - explicit builder options matching Compio naming/shape.
   - io_uring fit:
     - `Not io_uring-native` (scheduler/thread policy).
   - Recommendation:
     - Partial add, benchmark-gated.
   - Priority:
     - Medium.
   - Rationale:
     - Can help production tuning, but belongs to controlled runtime policy work.

### monoio parity gaps (time/runtime utility scope), io_uring fit, and recommendation

1. Absolute-deadline and interval timer APIs
   - Missing in Spargio:
     - `sleep_until`
     - `timeout_at`
     - `interval` / `interval_at`
     - `Interval::tick`
   - io_uring fit:
     - same split as Compio analysis: direct timer op base + hybrid interval
       policy layer.
   - Recommendation:
     - Add.
   - Priority:
     - High.
   - Rationale:
     - Core utility breadth with direct io_uring timer alignment.

2. Interval policy controls (`MissedTickBehavior`, interval metadata)
   - Missing in Spargio:
     - missed-tick policy controls and period inspection API.
   - io_uring fit:
     - `Not io_uring-native` (runtime policy semantics).
   - Recommendation:
     - Add later (after base interval API).
   - Priority:
     - Medium.
   - Rationale:
     - Valuable for precision semantics, but not required for first parity slice.

3. Resettable/introspectable `Sleep` object
   - Missing in Spargio:
     - `Sleep`-style object with `deadline` / `is_elapsed` / `reset`.
   - io_uring fit:
     - `Hybrid`:
       - backed by timeout ops, but object semantics are runtime/user-space layer.
   - Recommendation:
     - Add later (minimal form).
   - Priority:
     - Medium.
   - Rationale:
     - Power-user utility; should follow stable base timer/deadline APIs.

4. `spawn_blocking` + blocking runtime configuration
   - Missing in Spargio:
     - blocking bridge and policy knobs.
   - io_uring fit:
     - `Not io_uring-native`.
   - Recommendation:
     - Add with constrained configuration.
   - Priority:
     - Medium-high.
   - Rationale:
     - Important operational bridge, but separate from io_uring core mechanics.

### Net decision summary (io_uring-aware)

Add now (direct io_uring base + essential hybrid policy):

- `sleep_until`
- `timeout_at`
- `interval` / `interval_at` / `tick` (minimal first version)

Add next (important, mostly non-kernel policy/runtime features):

- `spawn_blocking` with bounded/opt-in policy
- limited affinity tuning in builder

Add later (power-user timer ergonomics):

- interval missed-tick behavior controls
- resettable/introspectable timer object (`Sleep`-style surface)

Defer/avoid for now:

- broad public low-level runtime polling/control API parity
- explicit runtime context enter/current-runtime API
- `attach(fd)` hook unless concrete, benchmark-backed wrapper demand emerges

## Update: I/O surface parity comparison (Compio + monoio, io_uring fit adjusted) (2026-02-28)

Revised the I/O parity recommendations to explicitly account for whether each
gap is:

- `Direct io_uring`: has a direct opcode path in current `io-uring` crate.
- `Hybrid`: hot path can use io_uring, but setup/orchestration still uses
  regular syscalls or user-space composition.
- `Not io_uring-native`: mostly trait/adaptor/protocol surface above kernel I/O.

Context:

- This section is scoped to I/O API surface (fs/net/io traits/utilities), not
  timer/runtime utilities.
- Spargio today has:
  - `fs::File` + `OpenOptions` and positional file ops (`read_at`, `write_at`,
    `read_to_end_at`, `fsync`).
  - `net::TcpStream` and `net::TcpListener` (session-policy aware APIs).
  - unbound unsafe extension lane for custom raw io_uring operations.

### Compio parity gaps (I/O surface scope), io_uring fit, and recommendation

1. Filesystem path-level helpers and metadata/perms utility breadth
   - Missing in Spargio:
     - path-level helpers like `create_dir`, `create_dir_all`, `remove_file`,
       `remove_dir`, `rename`, convenience `read`/`write`, and broader metadata/
       permissions/symlink/hard-link helpers.
   - io_uring fit:
     - `Direct io_uring` candidates:
       - `create_dir` (`MkDirAt`)
       - `remove_file` / `remove_dir` (`UnlinkAt`)
       - `rename` (`RenameAt`)
       - metadata (`Statx`)
       - symlink/hard-link (`SymlinkAt` / `LinkAt`)
       - convenience `read`/`write` composed from `OpenAt/OpenAt2 + Read/Write + Close`
     - `Hybrid` candidates:
       - `create_dir_all` (userspace recursion + repeated mkdir op)
       - richer convenience wrappers (`read_to_string`, recursive utilities)
       - some permissions/canonicalization helpers that may require syscall or
         userspace fallback paths depending on kernel support
   - Recommendation:
     - Add now for direct-op helpers.
     - Add later for hybrid helpers.
   - Priority:
     - High for direct helpers; Medium for hybrid helpers.
   - Rationale:
     - This adds high-utility API breadth while staying aligned with Spargio's
       io_uring-first performance model.

2. Network protocol/socket family breadth
   - Missing in Spargio:
     - `UdpSocket`
     - Unix domain sockets (`UnixStream`, `UnixListener`, `UnixDatagram`)
   - io_uring fit:
     - `Direct io_uring` hot path:
       - `Socket`, `Accept`, `Connect`, `Send`, `Recv`, `SendMsg`, `RecvMsg`,
         `Shutdown`
     - `Hybrid` setup/control path:
       - socket options, bind/listen, DNS/address resolution, feature probing
   - Recommendation:
     - Add.
   - Priority:
     - High for UDP; Medium-high for Unix sockets.
   - Rationale:
     - Strong fit for io_uring data path and large practical adoption win beyond
       TCP-only coverage.

3. Generic async I/O trait + adapter layer
   - Missing in Spargio:
     - Compio-style traits/extensions (`AsyncRead*` / `AsyncWrite*`) and common
       adapters/utilities (`split`, buffered wrappers, framing/compat layers).
   - io_uring fit:
     - `Not io_uring-native` (user-space abstraction layer).
   - Recommendation:
     - Add, but as companion crate(s), not in core runtime crate.
   - Priority:
     - Medium.
   - Rationale:
     - Important ergonomics/interoperability value, but no kernel-path
       differentiation and substantial maintenance surface.

4. Optional higher-level transport/integration modules
   - Missing in Spargio:
     - Compio optional module breadth (`process`, `signal`, `tls`, `ws`, `quic`).
   - io_uring fit:
     - Mostly `Not io_uring-native` as runtime-level feature sets; some pieces
       may use io_uring underneath but are not core io_uring API-surface gaps.
   - Recommendation:
     - Defer in core; pursue as ecosystem crates after core fs/net/io parity
       baseline is complete.
   - Priority:
     - Low.
   - Rationale:
     - Broad scope with weaker direct alignment to immediate io_uring runtime
       differentiation.

### monoio parity gaps (I/O surface scope), io_uring fit, and recommendation

1. Filesystem path-level helper breadth
   - Missing in Spargio:
     - monoio-style helpers (`read`, `write`, `create_dir`, `create_dir_all`,
       `remove_file`, `remove_dir`, `rename`) and metadata conveniences.
   - io_uring fit:
     - same split as above: direct-op coverage for core helpers, hybrid for
       recursive/convenience wrappers.
   - Recommendation:
     - Add direct-op helpers now; phase in hybrid helpers later.
   - Priority:
     - High for direct-op helpers; Medium for hybrid helpers.
   - Rationale:
     - Baseline parity and migration ergonomics with strong io_uring alignment.

2. Network breadth beyond TCP
   - Missing in Spargio:
     - `UdpSocket`
     - Unix domain socket APIs.
   - io_uring fit:
     - direct-op hot path with hybrid setup path, same as Compio analysis.
   - Recommendation:
     - Add.
   - Priority:
     - High for UDP; Medium-high for Unix sockets.
   - Rationale:
     - Real-world protocol coverage with clear io_uring throughput/latency fit.

3. I/O utility stack (traits + utility wrappers)
   - Missing in Spargio:
     - monoio-style utility stack (`copy`, split halves, buffered wrappers,
       stream/sink adapters, cancelable helpers, zero-copy utility wrappers).
   - io_uring fit:
     - mostly `Not io_uring-native` (API composition layer).
   - Recommendation:
     - Add a practical subset after core direct-op I/O breadth lands; keep larger
       utility surface outside core crate.
   - Priority:
     - Medium.
   - Rationale:
     - Good ergonomics payoff, but should follow direct io_uring-aligned API
       expansion.

### Net decision summary (io_uring-aware)

Add now (direct io_uring or low-risk hybrid):

- path-level fs helpers that map cleanly to io_uring opcodes
  (`create_dir`, `remove_file`, `remove_dir`, `rename`, metadata, basic `read`/`write`)
- UDP socket API

Add next (hybrid or non-kernel surface with strong usability gain):

- Unix domain socket API
- foundational I/O trait/extensions and core helpers (`split`, `copy`) in
  companion crate(s)

Add later (mostly composition layers):

- recursive/richer fs convenience helpers (`create_dir_all`, broader wrappers)
- richer buffered/framed/compat layers

Defer/avoid in core for now:

- large optional integration surfaces (`process`, `signal`, `tls`, `ws`, `quic`)
  until core io_uring-aligned fs/net parity goals are met

## Update: parity execution sweep (time/runtime + I/O breadth) with red/green TDD (2026-02-28)

Executed the requested implementation sweep for all previously marked
`add now`, `add next`, and `add later` items in the time/runtime and I/O parity
sections, then validated with full `uring-native` test pass.

### Red phase

Added failing tests first:

1. Time/runtime primitives (`tests/primitives_tdd.rs`)
   - `sleep_until_waits_for_deadline`
   - `timeout_at_returns_err_when_deadline_expires`
   - `interval_ticks_with_configurable_missed_tick_behavior`
   - `interval_at_uses_requested_start_deadline`
   - `sleep_object_supports_deadline_reset_and_elapsed_state`
   - `runtime_handle_spawn_blocking_executes_closure`

2. Runtime builder tuning (`tests/runtime_tdd.rs`)
   - `runtime_builder_thread_affinity_option_builds_runtime`

3. I/O breadth (`tests/ergonomics_tdd.rs`)
   - `fs_path_helpers_cover_common_workflows`
   - `fs_link_helpers_support_symlink_and_hard_link`
   - `net_udp_socket_supports_send_recv_and_send_to_recv_from`
   - `net_unix_stream_listener_and_datagram_cover_core_paths`
   - `io_helpers_split_copy_and_framed_work`

Red failures were expected:

- unresolved time/runtime symbols (`sleep_until`, `timeout_at`, `interval*`,
  `Sleep`, `MissedTickBehavior`, `spawn_blocking`, `thread_affinity`).
- unresolved I/O symbols (`fs` path helpers, `UdpSocket`, `Unix*`, `io` module).

### Green phase

Implemented in `src/lib.rs`:

1. Time/runtime utility breadth
   - Added:
     - `sleep_until(Instant)`
     - `timeout_at(Instant, fut)`
     - `Sleep` (`new`, `until`, `deadline`, `is_elapsed`, `reset`, `Future`)
     - `interval(period)`, `interval_at(start, period)`
     - `Interval::tick`, `Interval::period`,
       `Interval::{missed_tick_behavior,set_missed_tick_behavior}`
     - `MissedTickBehavior::{Burst, Delay, Skip}`

2. Runtime utilities/tuning
   - Added `RuntimeHandle::spawn_blocking(...) -> Result<JoinHandle<_>, RuntimeError>`.
   - Added `RuntimeBuilder::thread_affinity(...)`.
   - Wired per-shard thread affinity application during shard thread startup
     (best-effort, Linux `sched_setaffinity`).

3. Filesystem API breadth
   - Added path-level async helpers in `spargio::fs`:
     - `create_dir`, `create_dir_all`, `remove_file`, `remove_dir`, `rename`
     - `hard_link`, `symlink`
     - `metadata`, `symlink_metadata`, `set_permissions`, `canonicalize`
     - convenience `read`, `read_to_string`, `write`
   - Added internal blocking bridge helper in fs module using
     `RuntimeHandle::spawn_blocking`.

4. Network API breadth
   - Added `spargio::net::UdpSocket`:
     - `bind`, `from_std`, `local_addr`, `connect`
     - `send`, `recv`, `send_to`, `recv_from`
   - Added `spargio::net::UnixStream`:
     - `connect`, `connect_with_session_policy`, `from_std`
     - `send`/`recv`, owned buffer variants, `write_all`/`read_exact`
   - Added `spargio::net::UnixListener`:
     - `bind`, `from_std`, `local_addr`, `accept`
   - Added `spargio::net::UnixDatagram`:
     - `bind`, `from_std`, `local_addr`, `connect`
     - `send`, `recv`, `send_to`, `recv_from`

5. Foundational I/O utility layer
   - Added `spargio::io` module:
     - traits: `AsyncRead`, `AsyncWrite` + extension traits
     - `split(...)` with `ReadHalf` / `WriteHalf`
     - `copy_to_vec(...)`
     - lightweight wrappers: `BufReader`, `BufWriter`
     - framed helper: `io::framed::LengthDelimited::{new, write_frame, read_frame}`

### Validation

Executed and passing:

- `cargo test --features uring-native --test primitives_tdd`
- `cargo test --features uring-native --test ergonomics_tdd`
- `cargo test --features uring-native --test runtime_tdd --test uring_native_tdd`
- `cargo test --features uring-native`

Result:

- full `uring-native` test suite passes after the parity sweep.

## Proposal: syscall migration to io_uring for fs path helpers (2026-02-28)

Goal:

- Remove remaining helper-thread `spawn_blocking(std::fs::...)` usage from the
  high-value `spargio::fs` path APIs where direct io_uring opcodes exist.
- Keep low-value/hard cases as compatibility paths for now.

Proposed execution model:

1. Add direct unbound native commands + opcodes for path operations:
   - `MkDirAt` (`create_dir`)
   - `UnlinkAt` (`remove_file`, `remove_dir` via `AT_REMOVEDIR`)
   - `RenameAt` (`rename`)
   - `LinkAt` (`hard_link`)
   - `SymlinkAt` (`symlink`)
2. Migrate corresponding `spargio::fs` helpers to native io_uring submission.
3. Keep these deferred as compatibility wrappers:
   - `create_dir_all`:
     - recursive user-space semantics and error behavior matching require extra
       traversal/orchestration logic; not a single direct opcode operation.
   - `canonicalize`:
     - path-resolution semantics are better handled by libc/kernel resolver
       paths; no direct single-op parity target in current surface.
   - `metadata`, `symlink_metadata`, `set_permissions`:
     - current public return/argument types are std wrappers
       (`std::fs::Metadata` / `Permissions`) not directly constructible from
       raw `statx` payloads without additional compatibility syscall layers.
4. Keep red/green TDD workflow:
   - add failing native fs-op tests first,
   - implement op plumbing + fs helper migration,
   - run targeted tests then full `cargo test --features uring-native`.

Acceptance criteria:

- No helper-thread path for: `create_dir`, `remove_file`, `remove_dir`,
  `rename`, `hard_link`, `symlink`.
- Deferred items remain clearly documented as compatibility paths.
- Full `uring-native` test suite remains green.

## Update: syscall migration to io_uring (fs path helpers) implemented (Red/Green TDD) (2026-02-28)

Implemented the proposal slice for direct-op fs path helpers, with explicit
kernel-support fallback behavior for unsupported opcode errors.

### Red phase

Added failing tests first in `tests/uring_native_tdd.rs`:

- `uring_native_unbound_fs_path_ops_cover_mkdir_rename_link_symlink_and_unlink`

Observed expected red failure:

- compile errors for missing `UringNativeAny` methods:
  - `mkdir_at`
  - `unlink_at`
  - `rename_at`
  - `link_at`
  - `symlink_at`

### Green phase

Implemented native io_uring path-op helpers on `UringNativeAny` (in `src/lib.rs`)
using the existing unsafe extension submission lane internally:

- `mkdir_at(path, mode)` -> `opcode::MkDirAt`
- `unlink_at(path, is_dir)` -> `opcode::UnlinkAt` (+ `AT_REMOVEDIR` for dirs)
- `rename_at(from, to)` -> `opcode::RenameAt`
- `link_at(original, link)` -> `opcode::LinkAt`
- `symlink_at(target, linkpath)` -> `opcode::SymlinkAt`

Then migrated high-level `spargio::fs` helpers to these native operations:

- `create_dir`
- `remove_file`
- `remove_dir`
- `rename`
- `hard_link`
- `symlink`

Compatibility behavior kept intentionally:

- For unsupported opcode errors (`EINVAL`, `ENOSYS`, `EOPNOTSUPP`), the above
  high-level helpers transparently fall back to prior blocking helper-thread
  implementations to preserve functionality on older kernels.

Deferred (unchanged, by proposal):

- `create_dir_all`
- `canonicalize`
- `metadata`
- `symlink_metadata`
- `set_permissions`

### Validation

Executed and passing:

- `cargo test --features uring-native --test uring_native_tdd`
- `cargo test --features uring-native --test ergonomics_tdd`
- `cargo test --features uring-native`