Crate ferrotorch_distributed

Expand description

Distributed training for ferrotorch.

This crate provides the building blocks for multi-rank training:

Backends (backend) — Transport-agnostic communication. TcpBackend for real multi-process training, SimulatedBackend for in-process testing.
Collectives (collective) — allreduce, all_gather, reduce_scatter, broadcast, and barrier.
DDP (ddp) — DDP wraps a Module and synchronizes gradients across ranks after each backward pass.
FSDP (fsdp) — FSDP wraps a Module and shards parameters across ranks, all-gathering during forward and reduce-scattering gradients during backward.
RPC (rpc) — Remote Procedure Call framework with RpcContext for invoking functions on remote ranks, and RRef for holding references to remote data.
Pipeline parallelism (pipeline) — Pipeline splits a model into sequential stages and processes microbatches through them. Supports GPipe and Interleaved1F1B schedules.
GPU collectives ([gpu_collective], requires gpu feature) — gpu_allreduce and gpu_broadcast transfer GPU tensors to CPU, run the standard TCP collective, and copy back. Portable alternative to NCCL.

§Quick start

use ferrotorch_distributed::backend::SimulatedBackend;
use ferrotorch_distributed::collective::{allreduce, ReduceOp};
use ferrotorch_distributed::ddp::DDP;
use ferrotorch_distributed::fsdp::FSDP;
use ferrotorch_distributed::rpc::{RpcContext, SimulatedRpcBackend};
use ferrotorch_distributed::pipeline::{Pipeline, PipelineStage, PipelineSchedule};

Re-exports§

pub use backend::Backend;
pub use backend::SimulatedBackend;
pub use backend::TcpBackend;
pub use checkpoint::AsyncCheckpointer;
pub use checkpoint::CheckpointFuture;
pub use checkpoint::DistCheckpointError;
pub use checkpoint::DistributedCheckpoint;
pub use checkpoint::ShardMetadata;
pub use checkpoint::TensorShardSpec;
pub use checkpoint::flat_shard_metadata;
pub use checkpoint::load_distributed;
pub use checkpoint::reshard;
pub use checkpoint::save_distributed;
pub use collective::DEFAULT_COLLECTIVE_TIMEOUT;
pub use collective::ReduceOp;
pub use collective::all_gather;
pub use collective::all_gather_with_timeout;
pub use collective::allreduce;
pub use collective::allreduce_with_timeout;
pub use collective::barrier;
pub use collective::broadcast;
pub use collective::reduce_scatter;
pub use collective::reduce_scatter_with_timeout;
pub use ddp::DDP;
pub use error::DistributedError;
pub use fsdp::FSDP;
pub use pipeline::Pipeline;
pub use pipeline::PipelineSchedule;
pub use rpc::RpcAgent;
pub use rpc::RpcError;
pub use rpc::TcpRpcBackend;

Modules§

backend: Communication backends for distributed training.
checkpoint: Distributed checkpointing with per-rank shard saving, loading, and resharding.
collective: Collective communication operations.
ddp: Distributed Data Parallel (DDP) wrapper.
error: Error types for distributed operations.
fsdp: Fully Sharded Data Parallel (FSDP) wrapper.
pipeline: Pipeline parallelism for distributed training.
rpc: Remote Procedure Call (RPC) framework for distributed training.

Crate ferrotorch_distributed

Crate ferrotorch_distributed Copy item path

§Quick start

Re-exports§

Modules§

Crate ferrotorch_distributed