nexar
Distributed runtime for Rust. QUIC transport, stream-multiplexed messaging, built-in collectives. No C dependencies.
nexar replaces MPI for inter-node communication. It handles the network layer — point-to-point transfers, allreduce, broadcast, barrier — so your distributed application doesn't have to shell out to mpirun or link against libfabric.
Why not MPI?
MPI works. It's also a C library with decades of accumulated complexity, a rigid process launcher, TCP-based transports that suffer from head-of-line blocking, and an implicit assumption that you'll manage serialization yourself.
nexar takes a different approach:
- QUIC transport (via quinn). Multiplexed streams mean a stalled tensor transfer doesn't block your barrier. TLS is built into the protocol.
- No process launcher. A lightweight seed node handles discovery. Workers connect, get a rank, and form a direct peer-to-peer mesh. Nodes can join and leave.
- No C dependencies. Pure Rust, compiles with
cargo build. Nolibmpi, nolibfabric, nolibucp. - Async-native. Built on tokio. Send and receive overlap naturally.
What it provides
Point-to-point:
send/recv— tagged messages between any two ranks
Collectives:
ring_allreduce— scatter-reduce + allgather over a ringtree_broadcast— fan-out from rootring_allgather— ring-based gatherring_reduce_scatter— ring-based reduce-scattertwo_phase_barrier— distributed synchronization with timeout
RPC:
- Register handlers by function ID, call them by rank. Responses are matched per-request, so concurrent RPCs don't interfere.
Device abstraction:
DeviceAdaptertrait lets GPU backends stage memory for network I/O without nexar knowing anything about CUDA or ROCm. ACpuAdapteris included.
Quick start
Add to Cargo.toml:
[]
= { = "../nexar" }
= { = "1", = ["full"] }
Bootstrap a local cluster and run allreduce:
use ;
use Arc;
async
Architecture
seed node (discovery only, no data routing)
│
├── worker 0 ──── worker 1
│ \ /
│ \ /
│ worker 2 ── worker 3
│ ...
└── direct peer-to-peer mesh
Workers connect to the seed to get a rank and peer list, then establish direct QUIC connections to every other worker. The seed is not on the data path.
Each peer connection runs a router — a background task that accepts incoming QUIC streams and dispatches them to typed channels:
| Lane | Traffic | Consumer |
|---|---|---|
rpc_requests |
Incoming RPC calls | Dispatcher serve loop |
rpc_responses |
RPC replies (matched by request ID) | rpc() caller via oneshot |
control |
Barrier, heartbeat, join/leave | Barrier logic, health monitor |
data |
Point-to-point send/recv |
Application code |
raw |
Bulk byte streams | Tensor transfers |
Lanes are independent. A full data channel doesn't block control messages.
Stream protocol
Every QUIC unidirectional stream starts with a 1-byte tag:
0x01— framed message (8-byte LE length prefix + serializedNexarMessage)0x02— raw bytes (8-byte LE length prefix + payload)
Messages are serialized with rkyv (zero-copy deserialization). Maximum message size is 4 GiB.
When to use nexar
Use nexar when:
- You need collectives (allreduce, broadcast) across machines
- You want async, non-blocking communication in Rust
- You don't want to deal with MPI installation,
mpirun, or C FFI - You're building distributed ML training or inference
Don't use nexar when:
- Your GPUs are on the same machine — use NCCL directly (NVLink is 10-100x faster than any network)
- You need RDMA / GPUDirect — nexar uses standard UDP/QUIC
- You're already happy with MPI
Building
Requires Rust 1.85+.
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.