Skip to main content

Ddp

Struct Ddp 

Source
pub struct Ddp { /* private fields */ }
Expand description

Manual DDP coordinator for multi-GPU gradient sync.

For complex training patterns (GAN, RL, progressive) where transparent Graph-level DDP doesn’t fit. Provides explicit control over parameter broadcast and gradient averaging.

For standard training, use crate::graph::Graph::distribute instead.

Implementations§

Source§

impl Ddp

Source

pub fn wrap(models: &[&dyn Module], devices: &[Device]) -> Result<Self>

Wrap pre-created model replicas for manual DDP control.

Models must have identical architecture (same parameter count/shapes). Each model should already reside on its target device.

Source

pub fn sync_params(&self) -> Result<()>

Broadcast all parameters and buffers from rank 0 to all replicas.

Source

pub fn all_reduce_gradients(&self) -> Result<()>

AllReduce-average gradients across all replicas. Call after backward(), before optimizer.step().

Source

pub fn sync_buffers(&self) -> Result<()>

Broadcast buffers from rank 0 (BatchNorm running stats etc).

Source

pub fn weighted_all_reduce_gradients( &self, batch_counts: &[usize], ) -> Result<()>

AllReduce gradients weighted by per-device batch contribution.

For heterogeneous DDP where devices process different numbers of batches per sync step. Each replica’s gradient is scaled by (batch_counts[rank] / total) before AllReduce Sum, producing the correct mean gradient.

Use with ElChe::batch_counts for automatic weighting (see ElChe for the full heterogeneous DDP strategy):

ddp.weighted_all_reduce_gradients(cadence.batch_counts())?;
Source

pub fn world_size(&self) -> usize

Number of devices.

Source

pub fn devices(&self) -> &[Device]

Devices in use.

Source

pub fn setup<F, M, G, O>(model: &Graph, builder: F, optimizer: G) -> Result<()>
where F: Fn(Device) -> Result<M>, M: Module + 'static, G: Fn(&[Parameter]) -> O, O: Optimizer + 'static,

One-call setup: auto-detect GPUs, distribute the model, set the optimizer, and enable training mode.

  • Multi-GPU (2+ usable CUDA devices): replicates via Graph::distribute, creates per-replica optimizers, enables training.
  • Single-GPU / CPU: sets optimizer and training mode only (no DDP overhead).

Always prints a diagnostic summary to stderr showing detected hardware.

Ddp::setup(&model, |dev| build_model(dev), |p| Adam::new(p, 0.001))?;

// Training loop is identical for 1 or N GPUs:
for batch in model.epoch(epoch).activate() {
    let out = model.forward_batch(&batch?)?;
    loss.backward()?;
    model.step()?;
}
Source

pub fn setup_with<F, M, G, O>( model: &Graph, builder: F, optimizer: G, config: DdpConfig, ) -> Result<()>
where F: Fn(Device) -> Result<M>, M: Module + 'static, G: Fn(&[Parameter]) -> O, O: Optimizer + 'static,

One-call setup with explicit configuration.

Like setup() but accepts a DdpConfig for controlling El Che cadence, speed hints, and overhead targets.

Ddp::setup_with(&model, builder, optimizer,
    DdpConfig::new().speed_hint(1, 2.3))?;
Source

pub fn auto<F, M, G, O>(model: &Graph, builder: F, optimizer: G) -> Result<()>
where F: Fn(Device) -> Result<M>, M: Module + 'static, G: Fn(&[Parameter]) -> O, O: Optimizer + 'static,

👎Deprecated since 0.3.0:

Renamed to Ddp::setup()

Deprecated: renamed to setup().

Source

pub fn auto_with<F, M, G, O>( model: &Graph, builder: F, optimizer: G, config: DdpConfig, ) -> Result<()>
where F: Fn(Device) -> Result<M>, M: Module + 'static, G: Fn(&[Parameter]) -> O, O: Optimizer + 'static,

👎Deprecated since 0.3.0:

Renamed to Ddp::setup_with()

Deprecated: renamed to setup_with().

Source

pub fn builder<F, M, G, O, T>( model_factory: F, optim_factory: G, train_fn: T, ) -> DdpBuilder<F, M, G, O, T>
where F: Fn(Device) -> Result<M> + Send + Sync + 'static, M: Module + 'static, G: Fn(&[Parameter]) -> O + Send + Sync + 'static, O: Optimizer + 'static, T: Fn(&M, &[Tensor]) -> Result<Variable> + Send + Sync + 'static,

Create a builder for framework-managed multi-GPU training.

The framework owns the training loop, data pipeline, and epoch management. Each GPU gets its own model replica and optimizer. A coordinator triggers periodic parameter averaging based on the configured ApplyPolicy and AverageBackend.

Returns a DdpBuilder for fluent configuration. Call .run() to spawn training threads, then .join() on the returned DdpHandle to block until completion.

§Example
use flodl::*;

let handle = Ddp::builder(
    |dev| model_factory(dev),
    |params| Adam::new(params, 0.001),
    |model, batch| { /* forward + loss */ },
)
.dataset(dataset)
.batch_size(32)
.num_epochs(10)
.policy(ApplyPolicy::Cadence)
.backend(AverageBackend::Nccl)
.run()?;

let state = handle.join()?;

With fewer than 2 CUDA devices, training runs on the main thread with no coordination. The API is identical in both cases.

Auto Trait Implementations§

§

impl Freeze for Ddp

§

impl !RefUnwindSafe for Ddp

§

impl !Send for Ddp

§

impl !Sync for Ddp

§

impl Unpin for Ddp

§

impl UnsafeUnpin for Ddp

§

impl !UnwindSafe for Ddp

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.