Expand description
Multi-GPU data parallelism for classification training
Provides DataParallelCoordinator that splits mini-batches across multiple
GPUs, runs forward/backward independently per GPU, and averages gradients
on CPU before the optimizer step.
§Architecture
Mini-batch [N samples]
├── Shard 0 [N/G samples] → GPU 0 → gradients₀
├── Shard 1 [N/G samples] → GPU 1 → gradients₁
└── ...
↓ (CPU AllReduce: average LoRA gradients)
Optimizer step (applied to all replicas)§Contract (C-DP-001)
- Precondition: All pipelines have identical weights at each step start
- Postcondition: All pipelines have identical weights after optimizer step
- Invariant: Loss within 1% of equivalent single-GPU run at step 100+
§Why CPU AllReduce is fine
LoRA rank-16 on Qwen3-4B = ~5.9M params = ~22MB. PCIe transfer: <2ms. This is negligible vs forward pass (~200ms per GPU).
Structs§
- Data
Parallel Coordinator - Coordinates data-parallel training across multiple GPUs.
Functions§
- average_
gradients - Average gradient vectors from multiple workers.
- has_
non_ finite - Check if any element is NaN or Inf.
- shard_
samples - Shard samples across N workers.