Skip to main content

Crate ferrotorch_distributed

Crate ferrotorch_distributed 

Source
Expand description

Distributed training for ferrotorch.

This crate provides the building blocks for multi-rank training:

  • Backends (backend) — Transport-agnostic communication. TcpBackend for real multi-process training, SimulatedBackend for in-process testing.

  • Collectives (collective) — allreduce, all_gather, reduce_scatter, broadcast, and barrier.

  • Async collectives (async_collective) — async_all_gather and async_reduce_scatter return a PendingCollective handle that can be wait()ed on after local compute, enabling FSDP backward prefetch.

  • DDP (ddp) — DDP wraps a Module and synchronizes gradients across ranks after each backward pass.

  • FSDP (fsdp) — FSDP wraps a Module and shards parameters across ranks, all-gathering during forward and reduce-scattering gradients during backward.

  • RPC (rpc) — Remote Procedure Call framework with RpcContext for invoking functions on remote ranks, and RRef for holding references to remote data.

  • Pipeline parallelism (pipeline) — Pipeline splits a model into sequential stages and processes microbatches through them. Supports GPipe and Interleaved1F1B schedules.

  • GPU collectives ([gpu_collective], requires gpu feature) — gpu_allreduce and gpu_broadcast route through NCCL when the nccl feature is enabled, or through an opt-in host round-trip when FERROTORCH_ENABLE_GPU_FALLBACK=1 is set. Without either, they return Err (PyTorch parity). See [gpu_collective] for details.

  • Native-Rust backends (gloo_backend / mpi_backend, require gloo-backend / mpi-native feature respectively, both default off) — pure-Rust TCP transport with textbook ring allreduce / tree broadcast / ring barrier collectives. No C/C++ FFI. #1132 landed the gloo backend; #1133 landed the MPI-subset backend delegating to the same gloo_native primitives. With the feature off, construction returns DistributedError::BackendUnavailable for source-compat with the original #459 skeleton contract. The legacy mpi-backend feature name aliases to mpi-native.

  • Skeleton backends (ucc_backend, requires ucc-backend feature, default off) — API contract only. Construction returns DistributedError::BackendUnavailable when the feature is off. The feature would unlock a real binding (UCC C library) — tracked in #1134 (replacing closed #459). Use is_gloo_available, is_mpi_available, and is_ucc_available to discriminate at runtime.

§Quick start

use ferrotorch_distributed::backend::SimulatedBackend;
use ferrotorch_distributed::collective::{allreduce, ReduceOp};
use ferrotorch_distributed::ddp::DDP;
use ferrotorch_distributed::fsdp::FSDP;
use ferrotorch_distributed::rpc::{RpcContext, SimulatedRpcBackend};
use ferrotorch_distributed::pipeline::{Pipeline, PipelineStage, PipelineSchedule};

§REQ status (per .design/ferrotorch-distributed/lib.md)

REQStatusEvidence
REQ-1 (lint baseline)SHIPPED#![warn(clippy::all, clippy::pedantic)] / #![deny(rust_2018_idioms)] at top of lib.rs; consumers: every other .rs file in the crate inherits the policy and cargo clippy -- -D warnings is green.
REQ-2 (module surface)SHIPPEDpub mod block in lib.rs declares 16 unconditional + 5 feature-gated modules; consumers: every sibling crate::<mod>:: path.
REQ-3 (re-exports)SHIPPEDpub use block in lib.rs re-exports 35 symbols from 10 submodules; consumer ferrotorch/src/lib.rs (pub use ferrotorch_distributed::*;).
REQ-4 (crate-level docs)SHIPPED//! block in lib.rs enumerates every subsystem with rustdoc links; consumer cargo doc -p ferrotorch-distributed landing page.

Re-exports§

pub use async_collective::PendingCollective;
pub use async_collective::async_all_gather;
pub use async_collective::async_reduce_scatter;
pub use backend::Backend;
pub use backend::SimulatedBackend;
pub use backend::SubBackend;
pub use backend::TcpBackend;
pub use checkpoint::AsyncCheckpointer;
pub use checkpoint::CheckpointFuture;
pub use checkpoint::DistCheckpointError;
pub use checkpoint::DistributedCheckpoint;
pub use checkpoint::ShardMetadata;
pub use checkpoint::TensorShardSpec;
pub use checkpoint::flat_shard_metadata;
pub use checkpoint::load_distributed;
pub use checkpoint::reshard;
pub use checkpoint::save_distributed;
pub use collective::DEFAULT_COLLECTIVE_TIMEOUT;
pub use collective::ReduceOp;
pub use collective::all_gather;
pub use collective::all_gather_with_timeout;
pub use collective::all_to_all;
pub use collective::all_to_all_single_uneven;
pub use collective::all_to_all_with_timeout;
pub use collective::allreduce;
pub use collective::allreduce_with_timeout;
pub use collective::barrier;
pub use collective::broadcast;
pub use collective::reduce_scatter;
pub use collective::reduce_scatter_tensor;
pub use collective::reduce_scatter_with_timeout;
pub use ddp::DDP;
pub use device_mesh::DeviceMesh;
pub use dtensor::DTensor;
pub use dtensor::Placement;
pub use error::DistributedError;
pub use fsdp::FSDP;
pub use gloo_backend::GlooBackend;
pub use gloo_backend::is_gloo_available;
pub use mpi_backend::MpiBackend;
pub use mpi_backend::is_mpi_available;
pub use p2p::recv;
pub use p2p::recv_into;
pub use p2p::recv_into_with_timeout;
pub use p2p::recv_with_timeout;
pub use p2p::send;
pub use p2p::sendrecv;
pub use pipeline::Pipeline;
pub use pipeline::PipelineSchedule;
pub use rpc::RpcAgent;
pub use rpc::RpcError;
pub use rpc::TcpRpcBackend;
pub use sync_batch_norm::SyncBatchNorm2d;
pub use ucc_backend::UccBackend;
pub use ucc_backend::is_ucc_available;

Modules§

async_collective
Asynchronous collective handles for overlapping collectives with compute.
backend
Communication backends for distributed training.
checkpoint
Distributed checkpointing with per-rank shard saving, loading, and resharding.
collective
Collective communication operations.
ddp
Distributed Data Parallel (DDP) wrapper.
device_mesh
DeviceMesh — multi-dimensional rank layout. (#591)
dtensor
Distributed tensor (DTensor) over a DeviceMesh (#611).
error
Error types for distributed operations.
fsdp
Fully Sharded Data Parallel (FSDP) wrapper.
gloo_backend
Gloo backend public surface (issue #1132).
mpi_backend
Native-Rust MPI-subset backend (#1133, replaces closed #459).
p2p
Tensor-level point-to-point primitives. (#591)
pipeline
Pipeline parallelism for distributed training.
rpc
Remote Procedure Call (RPC) framework for distributed training.
sync_batch_norm
Synchronized Batch Normalization (SyncBatchNorm).
ucc_backend
Native-Rust UCC backend (#1134, replaces closed #459).