Expand description
Distributed training for ferrotorch.
This crate provides the building blocks for multi-rank training:
-
Backends (
backend) — Transport-agnostic communication.TcpBackendfor real multi-process training,SimulatedBackendfor in-process testing. -
Collectives (
collective) —allreduce,all_gather,reduce_scatter,broadcast, andbarrier. -
Async collectives (
async_collective) —async_all_gatherandasync_reduce_scatterreturn aPendingCollectivehandle that can bewait()ed on after local compute, enabling FSDP backward prefetch. -
DDP (
ddp) —DDPwraps aModuleand synchronizes gradients across ranks after each backward pass. -
FSDP (
fsdp) —FSDPwraps aModuleand shards parameters across ranks, all-gathering during forward and reduce-scattering gradients during backward. -
RPC (
rpc) — Remote Procedure Call framework withRpcContextfor invoking functions on remote ranks, andRReffor holding references to remote data. -
Pipeline parallelism (
pipeline) —Pipelinesplits a model into sequential stages and processes microbatches through them. SupportsGPipeandInterleaved1F1Bschedules. -
GPU collectives ([
gpu_collective], requiresgpufeature) —gpu_allreduceandgpu_broadcastroute through NCCL when thencclfeature is enabled, or through an opt-in host round-trip whenFERROTORCH_ENABLE_GPU_FALLBACK=1is set. Without either, they returnErr(PyTorch parity). See [gpu_collective] for details. -
Native-Rust backends (
gloo_backend/mpi_backend, requiregloo-backend/mpi-nativefeature respectively, both default off) — pure-Rust TCP transport with textbook ring allreduce / tree broadcast / ring barrier collectives. No C/C++ FFI. #1132 landed the gloo backend; #1133 landed the MPI-subset backend delegating to the same gloo_native primitives. With the feature off, construction returnsDistributedError::BackendUnavailablefor source-compat with the original #459 skeleton contract. The legacympi-backendfeature name aliases tompi-native. -
Skeleton backends (
ucc_backend, requiresucc-backendfeature, default off) — API contract only. Construction returnsDistributedError::BackendUnavailablewhen the feature is off. The feature would unlock a real binding (UCC C library) — tracked in #1134 (replacing closed #459). Useis_gloo_available,is_mpi_available, andis_ucc_availableto discriminate at runtime.
§Quick start
use ferrotorch_distributed::backend::SimulatedBackend;
use ferrotorch_distributed::collective::{allreduce, ReduceOp};
use ferrotorch_distributed::ddp::DDP;
use ferrotorch_distributed::fsdp::FSDP;
use ferrotorch_distributed::rpc::{RpcContext, SimulatedRpcBackend};
use ferrotorch_distributed::pipeline::{Pipeline, PipelineStage, PipelineSchedule};§REQ status (per .design/ferrotorch-distributed/lib.md)
| REQ | Status | Evidence |
|---|---|---|
| REQ-1 (lint baseline) | SHIPPED | #![warn(clippy::all, clippy::pedantic)] / #![deny(rust_2018_idioms)] at top of lib.rs; consumers: every other .rs file in the crate inherits the policy and cargo clippy -- -D warnings is green. |
| REQ-2 (module surface) | SHIPPED | pub mod block in lib.rs declares 16 unconditional + 5 feature-gated modules; consumers: every sibling crate::<mod>:: path. |
| REQ-3 (re-exports) | SHIPPED | pub use block in lib.rs re-exports 35 symbols from 10 submodules; consumer ferrotorch/src/lib.rs (pub use ferrotorch_distributed::*;). |
| REQ-4 (crate-level docs) | SHIPPED | //! block in lib.rs enumerates every subsystem with rustdoc links; consumer cargo doc -p ferrotorch-distributed landing page. |
Re-exports§
pub use async_collective::PendingCollective;pub use async_collective::async_all_gather;pub use async_collective::async_reduce_scatter;pub use backend::Backend;pub use backend::SimulatedBackend;pub use backend::SubBackend;pub use backend::TcpBackend;pub use checkpoint::AsyncCheckpointer;pub use checkpoint::CheckpointFuture;pub use checkpoint::DistCheckpointError;pub use checkpoint::DistributedCheckpoint;pub use checkpoint::ShardMetadata;pub use checkpoint::TensorShardSpec;pub use checkpoint::flat_shard_metadata;pub use checkpoint::load_distributed;pub use checkpoint::reshard;pub use checkpoint::save_distributed;pub use collective::DEFAULT_COLLECTIVE_TIMEOUT;pub use collective::ReduceOp;pub use collective::all_gather;pub use collective::all_gather_with_timeout;pub use collective::all_to_all;pub use collective::all_to_all_single_uneven;pub use collective::all_to_all_with_timeout;pub use collective::allreduce;pub use collective::allreduce_with_timeout;pub use collective::barrier;pub use collective::broadcast;pub use collective::reduce_scatter;pub use collective::reduce_scatter_tensor;pub use collective::reduce_scatter_with_timeout;pub use ddp::DDP;pub use device_mesh::DeviceMesh;pub use dtensor::DTensor;pub use dtensor::Placement;pub use error::DistributedError;pub use fsdp::FSDP;pub use gloo_backend::GlooBackend;pub use gloo_backend::is_gloo_available;pub use mpi_backend::MpiBackend;pub use mpi_backend::is_mpi_available;pub use p2p::recv;pub use p2p::recv_into;pub use p2p::recv_into_with_timeout;pub use p2p::recv_with_timeout;pub use p2p::send;pub use p2p::sendrecv;pub use pipeline::Pipeline;pub use pipeline::PipelineSchedule;pub use rpc::RpcAgent;pub use rpc::RpcError;pub use rpc::TcpRpcBackend;pub use sync_batch_norm::SyncBatchNorm2d;pub use ucc_backend::UccBackend;pub use ucc_backend::is_ucc_available;
Modules§
- async_
collective - Asynchronous collective handles for overlapping collectives with compute.
- backend
- Communication backends for distributed training.
- checkpoint
- Distributed checkpointing with per-rank shard saving, loading, and resharding.
- collective
- Collective communication operations.
- ddp
- Distributed Data Parallel (DDP) wrapper.
- device_
mesh DeviceMesh— multi-dimensional rank layout. (#591)- dtensor
- Distributed tensor (DTensor) over a
DeviceMesh(#611). - error
- Error types for distributed operations.
- fsdp
- Fully Sharded Data Parallel (FSDP) wrapper.
- gloo_
backend - Gloo backend public surface (issue #1132).
- mpi_
backend - Native-Rust MPI-subset backend (#1133, replaces closed #459).
- p2p
- Tensor-level point-to-point primitives. (#591)
- pipeline
- Pipeline parallelism for distributed training.
- rpc
- Remote Procedure Call (RPC) framework for distributed training.
- sync_
batch_ norm - Synchronized Batch Normalization (SyncBatchNorm).
- ucc_
backend - Native-Rust UCC backend (#1134, replaces closed #459).