Struct Ddp

Source

pub struct Ddp { /* private fields */ }

Expand description

Manual DDP coordinator for multi-GPU gradient sync.

For complex training patterns (GAN, RL, progressive) where transparent Graph-level DDP doesn’t fit. Provides explicit control over parameter broadcast and gradient averaging.

For standard training, use crate::graph::Graph::distribute instead.

Implementations§

Source §

impl Ddp

Source

pub fn wrap(models: &[&dyn Module], devices: &[Device]) -> Result<Self>

Wrap pre-created model replicas for manual DDP control.

Models must have identical architecture (same parameter count/shapes). Each model should already reside on its target device.

Source

pub fn sync_params(&self) -> Result<()>

Broadcast all parameters and buffers from rank 0 to all replicas.

Source

pub fn all_reduce_gradients(&self) -> Result<()>

AllReduce-average gradients across all replicas. Call after backward(), before optimizer.step().

Source

pub fn sync_buffers(&self) -> Result<()>

Broadcast buffers from rank 0 (BatchNorm running stats etc).

Source

pub fn weighted_all_reduce_gradients( &self, batch_counts: &[usize], ) -> Result<()>

AllReduce gradients weighted by per-device batch contribution.

For heterogeneous DDP where devices process different numbers of batches per sync step. Each replica’s gradient is scaled by (batch_counts[rank] / total) before AllReduce Sum, producing the correct mean gradient.

Use with ElChe::batch_counts for automatic weighting (see ElChe for the full heterogeneous DDP strategy):

ddp.weighted_all_reduce_gradients(cadence.batch_counts())?;

Source

pub fn world_size(&self) -> usize

Number of devices.

Source

pub fn devices(&self) -> &[Device]

Devices in use.

Source

pub fn setup<F, M, G, O>(model: &Graph, builder: F, optimizer: G) -> Result<()>
where F: Fn(Device) -> Result<M>, M: Module + 'static, G: Fn(&[Parameter]) -> O, O: Optimizer + 'static,

One-call setup: auto-detect GPUs, distribute the model, set the optimizer, and enable training mode.

Multi-GPU (2+ usable CUDA devices): replicates via Graph::distribute, creates per-replica optimizers, enables training.
Single-GPU / CPU: sets optimizer and training mode only (no DDP overhead).

Always prints a diagnostic summary to stderr showing detected hardware.

Ddp::setup(&model, |dev| build_model(dev), |p| Adam::new(p, 0.001))?;

// Training loop is identical for 1 or N GPUs:
for batch in model.epoch(epoch).activate() {
    let out = model.forward_batch(&batch?)?;
    loss.backward()?;
    model.step()?;
}

Source

pub fn setup_with<F, M, G, O>( model: &Graph, builder: F, optimizer: G, config: DdpConfig, ) -> Result<()>
where F: Fn(Device) -> Result<M>, M: Module + 'static, G: Fn(&[Parameter]) -> O, O: Optimizer + 'static,

One-call setup with explicit configuration.

Like setup() but accepts a DdpConfig for controlling El Che cadence, speed hints, and overhead targets.

Ddp::setup_with(&model, builder, optimizer,
    DdpConfig::new().speed_hint(1, 2.3))?;

Source

pub fn auto<F, M, G, O>(model: &Graph, builder: F, optimizer: G) -> Result<()>
where F: Fn(Device) -> Result<M>, M: Module + 'static, G: Fn(&[Parameter]) -> O, O: Optimizer + 'static,

👎Deprecated since 0.3.0:

Renamed to Ddp::setup()

Deprecated: renamed to setup().

Source

pub fn auto_with<F, M, G, O>( model: &Graph, builder: F, optimizer: G, config: DdpConfig, ) -> Result<()>
where F: Fn(Device) -> Result<M>, M: Module + 'static, G: Fn(&[Parameter]) -> O, O: Optimizer + 'static,

👎Deprecated since 0.3.0:

Renamed to Ddp::setup_with()

Deprecated: renamed to setup_with().

Source

pub fn builder<F, M, G, O, T>( model_factory: F, optim_factory: G, train_fn: T, ) -> DdpBuilder<F, M, G, O, T>
where F: Fn(Device) -> Result<M> + Send + Sync + 'static, M: Module + 'static, G: Fn(&[Parameter]) -> O + Send + Sync + 'static, O: Optimizer + 'static, T: Fn(&M, &[Tensor]) -> Result<Variable> + Send + Sync + 'static,

Create a builder for framework-managed multi-GPU training.

The framework owns the training loop, data pipeline, and epoch management. Each GPU gets its own model replica and optimizer. A coordinator triggers periodic parameter averaging based on the configured ApplyPolicy and AverageBackend.

Returns a DdpBuilder for fluent configuration. Call .run() to spawn training threads, then .join() on the returned DdpHandle to block until completion.

§Example

use flodl::*;

let handle = Ddp::builder(
    |dev| model_factory(dev),
    |params| Adam::new(params, 0.001),
    |model, batch| { /* forward + loss */ },
)
.dataset(dataset)
.batch_size(32)
.num_epochs(10)
.policy(ApplyPolicy::Cadence)
.backend(AverageBackend::Nccl)
.run()?;

let state = handle.join()?;

With fewer than 2 CUDA devices, training runs on the main thread with no coordination. The API is identical in both cases.