spargio 0.5.13

Work-stealing async runtime for Rust built on io_uring and msg_ring
Documentation

spargio

spargio is a work-stealing io_uring-based async runtime for Rust, using msg_ring for cross-thread coordination.

Instead of a strict thread-per-core/share-nothing execution model like other io_uring runtimes (glommio/monoio/compio and tokio_uring), spargio uses hybrid scheduling: submission-time placement of stealable tasks, plus execution-time work stealing for imbalance recovery.

In our benchmarks (detailed below), spargio outperforms compio (and likely all share-nothing runtimes) in imbalanced workloads by up to 70%, and outperforms tokio for cases involving high coordination or disk I/O by up to 320%. compio leads for sustained, balanced workloads by up to 70%.

Out-of-the-box, we support async disk I/O, network I/O (including TLS/WebSockets/QUIC), process execution, and signal handling, and provide an extension API for additional io_uring operations. We support both tokio-style stealable tasks and compio-style pinned (thread-affine) tasks.

Disclaimer

spargio is still experimental and subject to change. Use for evaluation only.

Quick start

Pre-requisites: Linux 6.0+ recommended (5.18+ for core io_uring + msg_ring paths)

Add spargio as a dependency:

cargo add spargio --features macros,uring-native

Then use it for native I/O operations and stealable task spawning:

use spargio::{fs::File, net::TcpListener, RuntimeHandle};

#[spargio::main]
async fn main(handle: RuntimeHandle) -> std::io::Result<()> {
    std::fs::create_dir_all("ingest-out")?;
    let listener = TcpListener::bind(handle.clone(), "127.0.0.1:7001").await?;
    let mut id = 0u64;

    loop {
        let (stream, _) = listener.accept_round_robin().await?;
        let (h, s, path) = (handle.clone(), stream.clone(), format!("ingest-out/{id}.bin"));
        id += 1;

        stream.spawn_stealable_on_session(&handle, async move {
            let file = File::create(h, path).await.unwrap();
            let (n, buf) = s.recv_owned(vec![0; 64 * 1024]).await.unwrap();
            file.write_all_at(0, &buf[..n]).await.unwrap();
            file.fsync().await.unwrap();
        }).expect("spawn");
    }
}

Tokio Integration

Recommended model today:

  • Run Tokio and Spargio side-by-side.
  • Exchange work/results through explicit boundaries (spargio::boundary, channels, adapters).
  • Move selected hot paths into Spargio without forcing full dependency migration.

Note: uniquely to Spargio, a Tokio-compat readiness shim based on IORING_OP_POLL_ADD is possible to build on top of it without sacrificing work-stealing, but building and maintaining a dependency-transparent drop-in lane would be a large investment.

Inspirations and Further Reading

Using msg_ring for coordination is heavily inspired by ourio. We extend that idea to work-stealing.

Wondering whether to build a work-stealing pool using io_uring at all was inspired by the following (excellent) blog posts:

Terminology: Shards

In Spargio, a shard is one worker thread + its io_uring ring (SQ + CQ) + a local run/command queue. Internally within Spargio, we pass work from one shard to another by enqueueing work and injecting CQEs across shards, waking up a recipient worker thread to drain pending work from its queue.

How Does Spargio Compare to Tokio / Ringolo / Compio / Monoio / Glommio / Tokio-uring?

Runtime Work stealing Post-start task migration Per-thread io_uring ownership Spawned tasks always pinned Explicit task placement (Pinned/Sticky/Stealable/RoundRobin)
Tokio Yes Yes No* No No**
Ringolo Yes Yes Yes No No
Spargio Yes No Yes No Yes
Compio / Monoio / Glommio / Tokio-uring No No Yes Yes No

Spectrum, from most migratable to most thread-local: Tokio -> Ringolo -> Spargio -> Compio / Monoio / Glommio / Tokio-uring. Tokio freely migrates Send tasks; Ringolo allows post-start migration while respecting thread-local io_uring constraints (at yield points); Spargio only steals queued work before execution starts (tasks that yield are resumed on the same thread); the share-nothing runtimes keep tasks and io_uring ownership on one thread.

Benchmark results (sans Ringolo which is an interesting but newer, alpha runtime) are below to highlight types of load each point of the spectrum handles well.

* Tokio has active work toward io_uring-backed support in the main crate, including an incremental tokio::fs effort and broader io_uring design notes. See Tokio issues #7266 and #2411.

** Tokio exposes locality tools such as LocalSet, but unlike Spargio it cannot route spawned tasks to a chosen worker thread. Crates such as tokio-sticky-channel can help preserve same-worker routing at the application layer, but this is not runtime-level sticky spawn/submission.

Benchmark Results

All benchmark tables below report Criterion mean wall-clock iteration latency (point estimates from estimates.json). Speedup columns use baseline_mean / spargio_mean (higher is better for Spargio).

Coordination-focused workloads (Tokio vs Spargio)

Benchmark Description Tokio Spargio Speedup
steady_ping_pong_rtt Two-worker request/ack round-trip loop 1.409 ms 367.09 us 3.8x
steady_one_way_send_drain One-way sends, then explicit drain barrier 64.97 us 47.27 us 1.4x
cold_start_ping_pong Includes runtime/harness startup and teardown 467.35 us 238.07 us 2.0x
fanout_fanin_balanced Even fanout/fanin across shards 1.421 ms 1.185 ms 1.2x
fanout_fanin_skewed Skewed fanout/fanin with hotspot pressure 2.354 ms 1.938 ms 1.2x

Compio is not listed in this coordination-only table because it is share-nothing (thread-per-core), while these cases are focused on cross-shard coordination behavior.

Native API workloads (Tokio vs Spargio vs Compio)

Benchmark Description Tokio Spargio Compio Spargio vs Tokio Spargio vs Compio
fs_read_rtt_4k (qd=1) 4 KiB file read latency, depth 1 1.582 ms 1.218 ms 1.543 ms 1.3x 1.3x
fs_read_throughput_4k_qd32 4 KiB file reads, queue depth 32 14.751 ms 6.689 ms 5.281 ms 2.2x 0.8x
net_echo_rtt_256b (qd=1) 256-byte TCP echo latency, depth 1 7.365 ms 5.967 ms 6.578 ms 1.2x 1.1x
net_stream_throughput_4k_window32 4 KiB stream throughput, window 32 13.429 ms 12.110 ms 6.991 ms 1.1x 0.6x

Imbalanced Native API workloads (Tokio vs Spargio vs Compio)

Benchmark Description Tokio Spargio Compio Spargio vs Tokio Spargio vs Compio
net_stream_imbalanced_4k_hot1_light7 8 streams, 1 static hot + 7 light, 4 KiB frames 15.558 ms 14.192 ms 13.772 ms 1.1x 1.0x
net_stream_hotspot_rotation_4k 8 streams, rotating hotspot each step, I/O-only 10.093 ms 11.004 ms 18.784 ms 0.9x 1.7x
net_pipeline_hotspot_rotation_4k_window32 8 streams, rotating hotspot with recv/CPU/send pipeline 30.103 ms 33.698 ms 57.827 ms 0.9x 1.7x
net_keyed_hotspot_rotation_4k 8 streams, rotating hotspot with keyed ownership routing 10.598 ms 11.148 ms 18.499 ms 1.0x 1.7x

Benchmark Interpretation

TL;DR: As expected, Spargio is strongest on coordination-heavy and low-depth latency workloads; Compio is strongest on sustained balanced stream throughput. Tokio is near parity with Spargio on rotating-hotspot network shapes.

  • Spargio leads in coordination-heavy cross-shard cases versus Tokio (steady_ping_pong_rtt, steady_one_way_send_drain, cold_start_ping_pong, fanout_fanin_*).
  • Spargio leads in low-depth fs/net latency (fs_read_rtt_4k, net_echo_rtt_256b) versus both Tokio and Compio.
  • Compio leads in sustained balanced stream throughput and static-hotspot imbalance (net_stream_throughput_4k_window32, net_stream_imbalanced_4k_hot1_light7), while Spargio is currently ahead of Tokio in both of those cases.
  • Tokio and Spargio are near parity in rotating-hotspot stream/pipeline cases and keyed routing (net_stream_hotspot_rotation_4k, net_pipeline_hotspot_rotation_4k_window32, net_keyed_hotspot_rotation_4k).

For performance, different workload shapes favor different runtimes.

Exploratory Benchmarks (Subject to Change, May Be Removed)

These workloads focus on mixed coordination/dispatch shapes, queue-depth-insensitive patterns, and fs+net deadline-churn microservice paths.

Benchmark Description Tokio Spargio Compio Spargio vs Tokio Spargio vs Compio
net_keyed_hotspot_rotation_4k_window64_cpu Keyed rotating hotspot with larger window + CPU tail 12.958 ms 14.280 ms 24.636 ms 0.9x 1.7x
ingress_dispatch_to_workers_rr_256b_ack Ingress dispatch loop with 256B ACK-shaped payloads 48.631 ms 44.360 ms 76.128 ms 1.1x 1.7x
fs_net_microservice_4k_read_then_256b_reply_qd1 4KiB read then 256B reply per request (qd=1) 15.906 ms 11.275 ms 13.342 ms 1.4x 1.2x
fanout_fanin_rotating_hot_partition_4k_window32 Rotating hot partition with recv/CPU/send pipeline 29.807 ms 31.997 ms 57.288 ms 0.9x 1.8x
session_owner_with_spillover_4k Session-owned streams under spillover pressure 39.773 ms 41.490 ms 75.135 ms 1.0x 1.8x
net_burst_flip_imbalance_4k Burst-heavy hotspot that flips owner periodically 92.295 ms 91.988 ms 72.799 ms 1.0x 0.8x
fanin_barrier_micro_batches_1k Fan-in barrier with micro-batches 53.850 ms 51.174 ms 87.086 ms 1.1x 1.7x
serial_dep_chain_rpc_256b Serial dependency chain of RPCs (qd=1) 29.779 ms 20.838 ms 25.075 ms 1.4x 1.2x
keyed_hotspot_flip_p99_4k Keyed hotspot ownership flips each phase 73.335 ms 74.969 ms 56.894 ms 1.0x 0.8x
fanin_barrier_rounds_1k Cross-shard fan-in barrier rounds 54.253 ms 48.981 ms 77.533 ms 1.1x 1.6x
wakeup_sparse_event_rtt_64b Sparse wakeups with idle gaps between tiny events 16.836 ms 15.289 ms 16.244 ms 1.1x 1.1x
timer_cancel_reschedule_storm Timer cancel/reschedule churn 1.097 s 14.953 ms 14.458 ms 73.4x 1.0x
mixed_control_data_plane_4k_plus_64b Mixed 4KiB data-plane with 64B control RPCs 26.742 ms 24.522 ms 18.396 ms 1.1x 0.8x
bounded_pipeline_backpressure_4k_window2 Bounded pipeline under backpressure (window=2) 21.725 ms 19.725 ms 33.063 ms 1.1x 1.7x
post_io_cpu_locality_4k_window1 Post-I/O CPU locality-sensitive pipeline (window=1) 17.837 ms 16.611 ms 29.331 ms 1.1x 1.8x
fs_net_microservice_deadline_dispatch_4k_read_256b_reply 4KiB read + 256B reply with deadline churn and ingress dispatch 604.230 ms 53.017 ms 83.323 ms 11.4x 1.6x
net_echo_rtt_deadline_routing_256b RPC gateway shape: RTT + routing + deadline churn 481.742 ms 56.117 ms 80.623 ms 8.6x 1.4x
net_stream_multitenant_4k_window8 Multi-tenant keyed stream with smaller in-flight window 27.898 ms 26.968 ms 45.805 ms 1.0x 1.7x
net_stream_hotflip_4k Hot owner flips quickly across streams 99.604 ms 98.086 ms 76.627 ms 1.0x 0.8x
net_pipeline_barrier_4k_window4 Pipeline with explicit barrier rounds (window=4) 35.477 ms 33.237 ms 53.151 ms 1.1x 1.6x
keyed_router_with_session_owner_spillover_4k Keyed owner routing plus spillover pressure 49.198 ms 48.075 ms 86.503 ms 1.0x 1.8x
fs_metadata_then_reply_qd1 fstat+read+small reply with deadline churn (qd=1) 238.254 ms 20.541 ms 24.149 ms 11.6x 1.2x
high_depth_fanout_first_k_cancel_256b_window64 High-depth fanout with first-K style cancellation pressure 202.849 ms 119.585 ms 204.941 ms 1.7x 1.7x
high_depth_multitenant_keyed_router_4k_window64 High-depth multi-tenant keyed router with rotating hotspots 91.931 ms 95.681 ms 173.001 ms 1.0x 1.8x
high_depth_barriered_pipeline_4k_window64 High-depth pipeline with barrier synchronization rounds 63.167 ms 68.530 ms 117.868 ms 0.9x 1.7x
high_depth_deadline_gateway_256b_window64 High-depth gateway with routing, RPC, and deadline churn 249.920 ms 68.963 ms 72.443 ms 3.6x 1.1x
high_depth_fs_net_admission_control_4k_read_256b_reply_window64 High-depth fs+net admission-control path with timer churn 131.842 ms 32.336 ms 54.807 ms 4.1x 1.7x

What's Done

  • Sharded runtime with Linux IoUring backend.
  • Cross-shard typed/raw messaging, nowait sends, batching, and flush tickets.
  • Placement APIs: Pinned, RoundRobin, Sticky, Stealable, StealablePreferred.
  • Work-stealing scheduler with adaptive steal gating/backoff, victim probing, batch steals, wake coalescing, backpressure, and runtime stats.
  • Runtime primitives: sleep, sleep_until, timeout, timeout_at, Interval/interval_at, Sleep (resettable deadline timer), CancellationToken, and TaskGroup cooperative cancellation.
  • Runtime entry ergonomics: async-first spargio::run(...), spargio::run_with(builder, ...), and optional #[spargio::main(...)] via macros.
  • Runtime utility bridge knobs: RuntimeHandle::spawn_blocking(...) and RuntimeBuilder::thread_affinity(...).
  • Local !Send ergonomics: run_local_on(...) and RuntimeHandle::spawn_local_on(...) for shard-pinned local futures.
  • Unbound native API: RuntimeHandle::uring_native_unbound() -> UringNativeAny with file ops (read_at, read_at_into, write_at, fsync) and stream/socket ops (recv, send, send_owned, recv_owned, send_all_batch, recv_multishot_segments), plus submission-time shard selector, FD affinity leases, and active op route tracking.
  • Low-level unsafe native extension API: UringNativeAny::{submit_unsafe, submit_unsafe_on_shard} for custom SQE/CQE workflows in external extensions.
  • Safe native extension wrapper slice + cookbook: spargio::extension::fs::{statx, statx_on_shard, statx_or_metadata} plus docs/native_extension_cookbook.md.
  • Ergonomic fs/net APIs on top of native I/O: spargio::fs::{OpenOptions, File} plus path helpers (create_dir*, rename, remove_*, metadata/link helpers, read/write), and spargio::net::{TcpListener, TcpStream, UdpSocket, UnixListener, UnixStream, UnixDatagram}.
  • Directory traversal + du parity helpers: low-level spargio::extension::fs::read_dir_entries(...) and high-level spargio::fs::{read_dir(...), du(...), DuOptions, DuSummary} with sparse/hardlink/symlink and one-filesystem policy support.
  • Measured metadata fast path helper: spargio::fs::metadata_lite(...) (statx-backed with fallback).
  • Native-first fs path-op lane on Linux io_uring for high-value helpers (create_dir, remove_file, remove_dir, rename, hard_link, symlink), with compatibility fallback on unsupported opcode kernels.
  • Foundational I/O utility layer: spargio::io::{AsyncRead, AsyncWrite, split, copy_to_vec, BufReader, BufWriter} and io::framed::LengthDelimited.
  • Native setup path on Linux io_uring lane: open/connect/accept are nonblocking and routed through native setup ops (no helper-thread run_blocking wrappers in public fs/net setup APIs).
  • Native timeout path on io_uring lane: UringNativeAny::sleep(...) and shard-context spargio::sleep(...) route through IORING_OP_TIMEOUT.
  • Async-first boundary APIs: call, call_with_timeout, recv, recv_timeout, and BoundaryTicket::wait_timeout.
  • Explicit socket-address APIs that bypass DNS resolution: connect_socket_addr* and bind_socket_addr.
  • Benchmark suites: benches/ping_pong.rs, benches/fanout_fanin.rs, benches/fs_api.rs (Tokio/Spargio/Compio), and benches/net_api.rs (Tokio/Spargio/Compio).
  • Scheduler profiling lane with callgrind/cachegrind: scripts/bench_scheduler_profile.sh and ratio guardrail helper scripts/scheduler_profile_guardrail.sh.
  • Mixed-runtime boundary API: spargio::boundary.
  • Companion crate suite: spargio-process, spargio-signal, spargio-protocols (legacy blocking bridge helpers), spargio-tls (rustls/futures-rustls adapter), spargio-ws (async-tungstenite adapter), and spargio-quic with selectable backend mode (QuicBackend::Native default dispatch and explicit QuicBackend::Bridge compatibility fallback).
  • Native-vs-bridge QUIC cutover guardrails: native data path is validated to avoid bridge task spawning, while bridge mode remains explicit compatibility fallback.
  • QUIC native default backend now runs on quinn-proto driver path (NativeProtoDriver + native UDP pump/timers) with stream/datagram operations routed through the driver; bridge mode remains explicit compatibility fallback.
  • QUIC stream APIs support incremental reads (QuicRecvStream::read_chunk) and owned-byte writes (QuicSendStream::write_bytes) for long-lived framed protocols.
  • QUIC native wait loops use adaptive retry backoff (100us -> 250us -> 1ms) to reduce low-latency polling overhead while avoiding sustained busy spinning under longer stalls.
  • Companion hardening lane: scripts/companion_ci_smoke.sh plus CI companion-matrix job.
  • QUIC qualification lanes: interop matrix (scripts/quic_interop_matrix.sh), soak/fault lane (scripts/quic_soak_fault.sh, nightly), and native-vs-bridge perf gate (scripts/quic_perf_gate.sh).
  • In-repo user-facing book/ (mdBook) covering quick start, task placement (!Send + stealable locality-first defaults), I/O API selection, protocol crates, native extensions, performance tuning, operations, migration, and status.
  • Reference mixed-mode service example.

What's Not Done Yet

  • Hostname-based ToSocketAddrs connect/bind paths can still block for DNS resolution; use explicit SocketAddr APIs (connect_socket_addr*, bind_socket_addr) for strictly non-DNS data-plane paths.
  • Remaining fs helper migration to native io_uring where it is not a clear win is deferred: canonicalize, metadata, symlink_metadata, and set_permissions currently use compatibility blocking paths (create_dir_all is native-first for straightforward paths; metadata_lite exists as native-first metadata alternative).
  • Work-stealing tuning guidance still needs deeper production case studies and calibration examples on top of the current knob and profiling documentation.
  • Work-stealing queue structure is still generic compared with Tokio's specialized scheduler queues; evaluate a better-optimized queue design (owner-fast-path + injection/steal structure) later.
  • Continue readability/editorial cleanup across README + book/: tighten wording, keep examples minimal but practical, and reduce ambiguous terminology.
  • Broaden documentation coverage while refactoring core modules for maintainability: keep API docs/book content aligned as runtime/fs/net surfaces continue to be split into smaller focused units.

Longer-term Improvement Ideas

  • Optional Tokio-compat readiness emulation shim (IORING_OP_POLL_ADD) is explicitly deprioritized for now (backlog-only, not planned right now).
  • Full production-grade higher-level ecosystem parity is still in progress; companion crates now provide practical bridges and qualification lanes, but deeper protocol-specific maturity remains (broader TLS/WS tuning surfaces, richer process stdio orchestration, and deeper long-window failure coverage).
  • QUIC backend hardening is still in progress: native default path is driver-backed now, but long-window soak/fault/perf requalification depth and rollout maturity (rollout_stage) still need production validation.
  • Production hardening beyond smoke lanes: deeper failure-injection/soak coverage, broader observability for companion protocol paths, and long-window p95/p99 gates.
  • Further workload-specific work-stealing model calibration is still iterative (the adaptive policy is implemented, but thresholds/weights are expected to continue evolving with production traces).
  • Multi-endpoint QUIC sharding/fan-out orchestration is not built in yet: a single QuicEndpoint still owns one native transport backend, so multi-core listener scaling is currently a manual multi-endpoint deployment pattern.
  • Fully io_uring-submitted directory traversal is still in progress: read_dir/du APIs are built-in, but (as of 2026-03-03) upstream io_uring userspace/kernel ABIs do not expose a stable getdents opcode surface (IORING_OP_GETDENTS), so traversal currently uses a blocking-helper lane (getdents64 with compatibility fallback) instead of pure in-ring submission.

Contributor Quick Start

cargo test
cargo test --features uring-native
cargo bench --features uring-native --no-run
cargo test --features macros --test entry_macro_tdd

Benchmark helpers:

./scripts/bench_fanout_smoke.sh
./scripts/bench_ping_guardrail.sh
./scripts/bench_fanout_guardrail.sh
./scripts/bench_kpi_guardrail.sh
./scripts/bench_scheduler_profile.sh
./scripts/bench_scheduler_calibration.sh
./scripts/scheduler_profile_guardrail.sh
./scripts/companion_ci_smoke.sh
./scripts/quic_interop_matrix.sh
./scripts/quic_perf_gate.sh
./scripts/quic_soak_fault.sh

Reference app:

cargo run --example mixed_mode_service

Runtime Entry

Helper-based entry:

#[tokio::main]
async fn main() -> Result<(), spargio::RuntimeError> {
    spargio::run(|handle| async move {
        let job = handle.spawn_stealable(async { 42usize }).expect("spawn");
        assert_eq!(job.await.expect("join"), 42);
    })
    .await
}

Attribute-macro entry (enable with --features macros):

#[spargio::main(shards = 4, backend = "io_uring")]
async fn main() {
    // async body runs on Spargio runtime
}

This takes two optional arguments. Without them, #[spargio::main] uses sensible defaults: io_uring backend and shard count from available CPU parallelism. Use macro arguments only when you need explicit overrides.

Use run_with(...) for builder-only controls such as worker idle strategy:

use std::time::Duration;

#[tokio::main]
async fn main() -> Result<(), spargio::RuntimeError> {
    let builder = spargio::Runtime::builder()
        .shards(2)
        .idle_strategy(spargio::IdleStrategy::Sleep(Duration::from_millis(1)));

    spargio::run_with(builder, |handle| async move {
        let job = handle.spawn_stealable(async { 7usize }).expect("spawn");
        assert_eq!(job.await.expect("join"), 7);
    })
    .await
}

Repository Map

  • src/lib.rs: runtime implementation.
  • tests/: TDD coverage.
  • benches/: Criterion benchmarks.
  • examples/: mixed-mode reference app.
  • scripts/: benchmark smoke/guard helpers.
  • .github/workflows/: CI gates.
  • IMPLEMENTATION_LOG.md: implementation and benchmark log.
  • architecture_decision_records/: ADRs.

Connection Placement Best Practices

  • Use spargio::net::TcpStream::connect(...) for simple or latency-first paths (few streams, short-lived connections).
  • Use spargio::net::TcpStream::connect_many_round_robin(...) (or connect_with_session_policy(..., RoundRobin)) for sustained multi-stream throughput workloads.
  • For per-stream hot I/O loops, pair round-robin stream setup with stream.spawn_on_session(...) to keep execution aligned with the stream session shard.
  • Use stealable task placement when post-I/O CPU work is dominant and can benefit from migration.
  • As a practical starting heuristic: if active stream count is at least 2x shard count and streams are long-lived, prefer round-robin/distributed mode.

Engineering Method

Development style is red/green TDD:

  1. Add failing tests.
  2. Implement minimal passing behavior.
  3. Validate with full test and benchmark checks.

License

This project is licensed under the MIT License. See LICENSE.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in spargio by you shall be licensed as MIT, without any additional terms or conditions.