pub struct Ddp { /* private fields */ }Expand description
Manual DDP coordinator for multi-GPU gradient sync.
For complex training patterns (GAN, RL, progressive) where transparent Graph-level DDP doesn’t fit. Provides explicit control over parameter broadcast and gradient averaging.
For standard training, use crate::graph::Graph::distribute instead.
Implementations§
Source§impl Ddp
impl Ddp
Sourcepub fn wrap(models: &[&dyn Module], devices: &[Device]) -> Result<Self>
pub fn wrap(models: &[&dyn Module], devices: &[Device]) -> Result<Self>
Wrap pre-created model replicas for manual DDP control.
Models must have identical architecture (same parameter count/shapes). Each model should already reside on its target device.
Sourcepub fn sync_params(&self) -> Result<()>
pub fn sync_params(&self) -> Result<()>
Broadcast all parameters and buffers from rank 0 to all replicas.
Sourcepub fn all_reduce_gradients(&self) -> Result<()>
pub fn all_reduce_gradients(&self) -> Result<()>
AllReduce-average gradients across all replicas. Call after backward(), before optimizer.step().
Sourcepub fn sync_buffers(&self) -> Result<()>
pub fn sync_buffers(&self) -> Result<()>
Broadcast buffers from rank 0 (BatchNorm running stats etc).
Sourcepub fn weighted_all_reduce_gradients(
&self,
batch_counts: &[usize],
) -> Result<()>
pub fn weighted_all_reduce_gradients( &self, batch_counts: &[usize], ) -> Result<()>
AllReduce gradients weighted by per-device batch contribution.
For heterogeneous DDP where devices process different numbers of
batches per sync step. Each replica’s gradient is scaled by
(batch_counts[rank] / total) before AllReduce Sum, producing
the correct mean gradient.
Use with ElChe::batch_counts for automatic weighting
(see ElChe for the full heterogeneous DDP strategy):
ddp.weighted_all_reduce_gradients(cadence.batch_counts())?;Sourcepub fn world_size(&self) -> usize
pub fn world_size(&self) -> usize
Number of devices.
Sourcepub fn setup<F, M, G, O>(model: &Graph, builder: F, optimizer: G) -> Result<()>
pub fn setup<F, M, G, O>(model: &Graph, builder: F, optimizer: G) -> Result<()>
One-call setup: auto-detect GPUs, distribute the model, set the optimizer, and enable training mode.
- Multi-GPU (2+ usable CUDA devices): replicates via
Graph::distribute, creates per-replica optimizers, enables training. - Single-GPU / CPU: sets optimizer and training mode only (no DDP overhead).
Always prints a diagnostic summary to stderr showing detected hardware.
Ddp::setup(&model, |dev| build_model(dev), |p| Adam::new(p, 0.001))?;
// Training loop is identical for 1 or N GPUs:
for batch in model.epoch(epoch).activate() {
let out = model.forward_batch(&batch?)?;
loss.backward()?;
model.step()?;
}Sourcepub fn setup_with<F, M, G, O>(
model: &Graph,
builder: F,
optimizer: G,
config: DdpConfig,
) -> Result<()>
pub fn setup_with<F, M, G, O>( model: &Graph, builder: F, optimizer: G, config: DdpConfig, ) -> Result<()>
Sourcepub fn auto<F, M, G, O>(model: &Graph, builder: F, optimizer: G) -> Result<()>
👎Deprecated since 0.3.0: Renamed to Ddp::setup()
pub fn auto<F, M, G, O>(model: &Graph, builder: F, optimizer: G) -> Result<()>
Renamed to Ddp::setup()
Deprecated: renamed to setup().
Sourcepub fn auto_with<F, M, G, O>(
model: &Graph,
builder: F,
optimizer: G,
config: DdpConfig,
) -> Result<()>
👎Deprecated since 0.3.0: Renamed to Ddp::setup_with()
pub fn auto_with<F, M, G, O>( model: &Graph, builder: F, optimizer: G, config: DdpConfig, ) -> Result<()>
Renamed to Ddp::setup_with()
Deprecated: renamed to setup_with().
Sourcepub fn builder<F, M, G, O, T>(
model_factory: F,
optim_factory: G,
train_fn: T,
) -> DdpBuilder<F, M, G, O, T>
pub fn builder<F, M, G, O, T>( model_factory: F, optim_factory: G, train_fn: T, ) -> DdpBuilder<F, M, G, O, T>
Create a builder for framework-managed multi-GPU training.
The framework owns the training loop, data pipeline, and epoch management.
Each GPU gets its own model replica and optimizer. A coordinator triggers
periodic parameter averaging based on the configured ApplyPolicy and
AverageBackend.
Returns a DdpBuilder for fluent configuration. Call .run() to
spawn training threads, then .join() on the returned DdpHandle
to block until completion.
§Example
use flodl::*;
let handle = Ddp::builder(
|dev| model_factory(dev),
|params| Adam::new(params, 0.001),
|model, batch| { /* forward + loss */ },
)
.dataset(dataset)
.batch_size(32)
.num_epochs(10)
.policy(ApplyPolicy::Cadence)
.backend(AverageBackend::Nccl)
.run()?;
let state = handle.join()?;With fewer than 2 CUDA devices, training runs on the main thread with no coordination. The API is identical in both cases.