Skip to main content

Crate ferrotorch_distributed

Crate ferrotorch_distributed 

Source
Expand description

Distributed training for ferrotorch.

This crate provides the building blocks for multi-rank training:

  • Backends (backend) — Transport-agnostic communication. TcpBackend for real multi-process training, SimulatedBackend for in-process testing.

  • Collectives (collective) — allreduce, all_gather, reduce_scatter, broadcast, and barrier.

  • DDP (ddp) — DDP wraps a Module and synchronizes gradients across ranks after each backward pass.

  • FSDP (fsdp) — FSDP wraps a Module and shards parameters across ranks, all-gathering during forward and reduce-scattering gradients during backward.

  • RPC (rpc) — Remote Procedure Call framework with RpcContext for invoking functions on remote ranks, and RRef for holding references to remote data.

  • Pipeline parallelism (pipeline) — Pipeline splits a model into sequential stages and processes microbatches through them. Supports GPipe and Interleaved1F1B schedules.

  • GPU collectives ([gpu_collective], requires gpu feature) — gpu_allreduce and gpu_broadcast transfer GPU tensors to CPU, run the standard TCP collective, and copy back. Portable alternative to NCCL.

§Quick start

use ferrotorch_distributed::backend::SimulatedBackend;
use ferrotorch_distributed::collective::{allreduce, ReduceOp};
use ferrotorch_distributed::ddp::DDP;
use ferrotorch_distributed::fsdp::FSDP;
use ferrotorch_distributed::rpc::{RpcContext, SimulatedRpcBackend};
use ferrotorch_distributed::pipeline::{Pipeline, PipelineStage, PipelineSchedule};

Re-exports§

pub use backend::Backend;
pub use backend::SimulatedBackend;
pub use backend::TcpBackend;
pub use checkpoint::AsyncCheckpointer;
pub use checkpoint::CheckpointFuture;
pub use checkpoint::DistCheckpointError;
pub use checkpoint::DistributedCheckpoint;
pub use checkpoint::ShardMetadata;
pub use checkpoint::TensorShardSpec;
pub use checkpoint::flat_shard_metadata;
pub use checkpoint::load_distributed;
pub use checkpoint::reshard;
pub use checkpoint::save_distributed;
pub use collective::DEFAULT_COLLECTIVE_TIMEOUT;
pub use collective::ReduceOp;
pub use collective::all_gather;
pub use collective::all_gather_with_timeout;
pub use collective::allreduce;
pub use collective::allreduce_with_timeout;
pub use collective::barrier;
pub use collective::broadcast;
pub use collective::reduce_scatter;
pub use collective::reduce_scatter_with_timeout;
pub use ddp::DDP;
pub use error::DistributedError;
pub use fsdp::FSDP;
pub use pipeline::Pipeline;
pub use pipeline::PipelineSchedule;
pub use rpc::RpcAgent;
pub use rpc::RpcError;
pub use rpc::TcpRpcBackend;

Modules§

backend
Communication backends for distributed training.
checkpoint
Distributed checkpointing with per-rank shard saving, loading, and resharding.
collective
Collective communication operations.
ddp
Distributed Data Parallel (DDP) wrapper.
error
Error types for distributed operations.
fsdp
Fully Sharded Data Parallel (FSDP) wrapper.
pipeline
Pipeline parallelism for distributed training.
rpc
Remote Procedure Call (RPC) framework for distributed training.