Expand description
Distributed training for ferrotorch.
This crate provides the building blocks for multi-rank training:
-
Backends (
backend) — Transport-agnostic communication.TcpBackendfor real multi-process training,SimulatedBackendfor in-process testing. -
Collectives (
collective) —allreduce,all_gather,reduce_scatter,broadcast, andbarrier. -
DDP (
ddp) —DDPwraps aModuleand synchronizes gradients across ranks after each backward pass. -
FSDP (
fsdp) —FSDPwraps aModuleand shards parameters across ranks, all-gathering during forward and reduce-scattering gradients during backward. -
RPC (
rpc) — Remote Procedure Call framework withRpcContextfor invoking functions on remote ranks, andRReffor holding references to remote data. -
Pipeline parallelism (
pipeline) —Pipelinesplits a model into sequential stages and processes microbatches through them. SupportsGPipeandInterleaved1F1Bschedules. -
GPU collectives ([
gpu_collective], requiresgpufeature) —gpu_allreduceandgpu_broadcasttransfer GPU tensors to CPU, run the standard TCP collective, and copy back. Portable alternative to NCCL.
§Quick start
use ferrotorch_distributed::backend::SimulatedBackend;
use ferrotorch_distributed::collective::{allreduce, ReduceOp};
use ferrotorch_distributed::ddp::DDP;
use ferrotorch_distributed::fsdp::FSDP;
use ferrotorch_distributed::rpc::{RpcContext, SimulatedRpcBackend};
use ferrotorch_distributed::pipeline::{Pipeline, PipelineStage, PipelineSchedule};Re-exports§
pub use backend::Backend;pub use backend::SimulatedBackend;pub use backend::TcpBackend;pub use checkpoint::AsyncCheckpointer;pub use checkpoint::CheckpointFuture;pub use checkpoint::DistCheckpointError;pub use checkpoint::DistributedCheckpoint;pub use checkpoint::ShardMetadata;pub use checkpoint::TensorShardSpec;pub use checkpoint::flat_shard_metadata;pub use checkpoint::load_distributed;pub use checkpoint::reshard;pub use checkpoint::save_distributed;pub use collective::DEFAULT_COLLECTIVE_TIMEOUT;pub use collective::ReduceOp;pub use collective::all_gather;pub use collective::all_gather_with_timeout;pub use collective::allreduce;pub use collective::allreduce_with_timeout;pub use collective::barrier;pub use collective::broadcast;pub use collective::reduce_scatter;pub use collective::reduce_scatter_with_timeout;pub use ddp::DDP;pub use error::DistributedError;pub use fsdp::FSDP;pub use pipeline::Pipeline;pub use pipeline::PipelineSchedule;pub use rpc::RpcAgent;pub use rpc::RpcError;pub use rpc::TcpRpcBackend;
Modules§
- backend
- Communication backends for distributed training.
- checkpoint
- Distributed checkpointing with per-rank shard saving, loading, and resharding.
- collective
- Collective communication operations.
- ddp
- Distributed Data Parallel (DDP) wrapper.
- error
- Error types for distributed operations.
- fsdp
- Fully Sharded Data Parallel (FSDP) wrapper.
- pipeline
- Pipeline parallelism for distributed training.
- rpc
- Remote Procedure Call (RPC) framework for distributed training.