Skip to main content

Crate baracuda_nccl

Crate baracuda_nccl 

Source
Expand description

Safe Rust wrappers for NVIDIA NCCL (multi-GPU collective communication).

Layered on top of baracuda-nccl-sys. Use this crate directly for typed, RAII-managed communicators + collectives; reach for -sys only when adding a function the safe layer doesn’t expose yet.

§Scope

  • Communicator lifecycle: single-process multi-GPU via ncclCommInitAll, multi-process via ncclCommInitRank + UniqueId exchange, communicator destruction, error querying.
  • All collectives: all_reduce, all_gather, broadcast, reduce, reduce_scatter, send / recv (point-to-point used by baracuda-kernels’s Ring Attention K/V chunk rotation).
  • Group operations: group_start / group_end for batched collective launches (essential for Megatron-LM TP all-reduce patterns).
  • Communicator features: stream binding, error checking, abort + finalize for graceful shutdown.
  • Datatype helpers: f32/f64/f16/bf16/u8/i32/i64/u64 reduction support via DataType enum.
  • Reduction ops: sum, prod, min, max, avg, pre_mul_sum.

§Platform support

NCCL is primarily a Linux library. Windows has experimental support in newer NCCL versions but is uncommon. The crate compiles on Windows but Communicator::init_all returns LoaderError::LibraryNotFound at runtime on hosts without NCCL — single-device callers can detect this and fall back gracefully.

§Usage with baracuda-kernels

baracuda-kernels’s Ring Attention plan (Phase 56, ring_attention feature) and Megatron-LM TP primitives (Phase 57, megatron_tp feature) both consume this crate. Direct callers commonly use the communicator for synchronous data-parallel training all-reduce.

Re-exports§

pub use RedOp as NcclReduceOp;
pub use UniqueId as NcclUniqueId;
pub use NcclScalar as NcclDataType;

Structs§

Communicator
A NCCL communicator — one rank’s view of a distributed group.
NcclMem
NCCL-managed device allocation. Drop calls ncclMemFree.
UniqueId
A 128-byte opaque identifier for establishing a multi-process NCCL communicator. One process calls UniqueId::new and distributes the bytes to all other processes via a user-provided channel (TCP, MPI, …); every process then calls Communicator::init_rank with the same id.

Enums§

RawNcclDataType
Re-export the raw ncclDataType_t enum so callers can pattern-match or pass it to lower-level helpers if needed. NCCL element data type.
RedOp
Reduction operation for all_reduce / reduce.
ScalarResidence
Where the scalar passed to Communicator::create_pre_mul_sum lives.

Traits§

NcclScalar
Element type for NCCL buffers. Implemented by baracuda-types primitives via a sealed trait.

Functions§

all_reduce
All-reduce: each rank sends send and receives the per-element reduction (across every rank) into recv. In-place use (send == recv) is legal.
broadcast
Broadcast the data at root’s send buffer to every rank’s recv buffer.
error_string
Human-readable name for a status code.
group_end
End the current collective group.
group_start
Begin a group of collectives that must be submitted atomically (e.g. in single-process multi-GPU all-reduce).
version
NCCL library version as a packed integer (e.g. 22100 for NCCL 2.21.0).

Type Aliases§

Error
Error type for NCCL operations.
Result
Result alias.