Skip to main content

Module data_parallel

Module data_parallel 

Source
Expand description

Multi-GPU data parallelism for classification training

Provides DataParallelCoordinator that splits mini-batches across multiple GPUs, runs forward/backward independently per GPU, and averages gradients on CPU before the optimizer step.

§Architecture

Mini-batch [N samples]
  ├── Shard 0 [N/G samples] → GPU 0 → gradients₀
  ├── Shard 1 [N/G samples] → GPU 1 → gradients₁
  └── ...
       ↓ (CPU AllReduce: average LoRA gradients)
  Optimizer step (applied to all replicas)

§Contract (C-DP-001)

  • Precondition: All pipelines have identical weights at each step start
  • Postcondition: All pipelines have identical weights after optimizer step
  • Invariant: Loss within 1% of equivalent single-GPU run at step 100+

§Why CPU AllReduce is fine

LoRA rank-16 on Qwen3-4B = ~5.9M params = ~22MB. PCIe transfer: <2ms. This is negligible vs forward pass (~200ms per GPU).

Structs§

DataParallelCoordinator
Coordinates data-parallel training across multiple GPUs.

Functions§

average_gradients
Average gradient vectors from multiple workers.
has_non_finite
Check if any element is NaN or Inf.
shard_samples
Shard samples across N workers.