Expand description
Safe Rust wrappers for NVIDIA NCCL (multi-GPU collective communication).
Layered on top of baracuda-nccl-sys.
Use this crate directly for typed, RAII-managed communicators +
collectives; reach for -sys only when adding a function the safe
layer doesn’t expose yet.
§Scope
- Communicator lifecycle: single-process multi-GPU via
ncclCommInitAll, multi-process viancclCommInitRank+UniqueIdexchange, communicator destruction, error querying. - All collectives:
all_reduce,all_gather,broadcast,reduce,reduce_scatter,send/recv(point-to-point used bybaracuda-kernels’s Ring Attention K/V chunk rotation). - Group operations:
group_start/group_endfor batched collective launches (essential for Megatron-LM TP all-reduce patterns). - Communicator features: stream binding, error checking, abort + finalize for graceful shutdown.
- Datatype helpers: f32/f64/f16/bf16/u8/i32/i64/u64 reduction
support via
DataTypeenum. - Reduction ops:
sum,prod,min,max,avg,pre_mul_sum.
§Platform support
NCCL is primarily a Linux library. Windows has experimental
support in newer NCCL versions but is uncommon. The crate compiles
on Windows but Communicator::init_all returns
LoaderError::LibraryNotFound at runtime on hosts without NCCL —
single-device callers can detect this and fall back gracefully.
§Usage with baracuda-kernels
baracuda-kernels’s Ring Attention plan (Phase 56,
ring_attention feature) and Megatron-LM TP primitives (Phase 57,
megatron_tp feature) both consume this crate. Direct callers
commonly use the communicator for synchronous data-parallel
training all-reduce.
Re-exports§
pub use RedOp as NcclReduceOp;pub use UniqueId as NcclUniqueId;pub use NcclScalar as NcclDataType;
Structs§
- Communicator
- A NCCL communicator — one rank’s view of a distributed group.
- NcclMem
- NCCL-managed device allocation. Drop calls
ncclMemFree. - Unique
Id - A 128-byte opaque identifier for establishing a multi-process NCCL
communicator. One process calls
UniqueId::newand distributes the bytes to all other processes via a user-provided channel (TCP, MPI, …); every process then callsCommunicator::init_rankwith the same id.
Enums§
- RawNccl
Data Type - Re-export the raw
ncclDataType_tenum so callers can pattern-match or pass it to lower-level helpers if needed. NCCL element data type. - RedOp
- Reduction operation for
all_reduce/reduce. - Scalar
Residence - Where the scalar passed to
Communicator::create_pre_mul_sumlives.
Traits§
- Nccl
Scalar - Element type for NCCL buffers. Implemented by baracuda-types primitives via a sealed trait.
Functions§
- all_
reduce - All-reduce: each rank sends
sendand receives the per-element reduction (across every rank) intorecv. In-place use (send == recv) is legal. - broadcast
- Broadcast the data at
root’ssendbuffer to every rank’srecvbuffer. - error_
string - Human-readable name for a status code.
- group_
end - End the current collective group.
- group_
start - Begin a group of collectives that must be submitted atomically (e.g. in single-process multi-GPU all-reduce).
- version
- NCCL library version as a packed integer (e.g.
22100for NCCL 2.21.0).