Module data_parallel

Expand description

Multi-GPU data parallelism for classification training

Provides DataParallelCoordinator that splits mini-batches across multiple GPUs, runs forward/backward independently per GPU, and averages gradients on CPU before the optimizer step.

§Architecture

Mini-batch [N samples]
  ├── Shard 0 [N/G samples] → GPU 0 → gradients₀
  ├── Shard 1 [N/G samples] → GPU 1 → gradients₁
  └── ...
       ↓ (CPU AllReduce: average LoRA gradients)
  Optimizer step (applied to all replicas)

§Contract (C-DP-001)

Precondition: All pipelines have identical weights at each step start
Postcondition: All pipelines have identical weights after optimizer step
Invariant: Loss within 1% of equivalent single-GPU run at step 100+

§Why CPU AllReduce is fine

LoRA rank-16 on Qwen3-4B = ~5.9M params = ~22MB. PCIe transfer: <2ms. This is negligible vs forward pass (~200ms per GPU).

Structs§

DataParallelCoordinator: Coordinates data-parallel training across multiple GPUs.

Functions§

average_gradients: Average gradient vectors from multiple workers.
has_non_finite: Check if any element is NaN or Inf.
shard_samples: Shard samples across N workers.

Module data_parallel

Module data_parallel Copy item path

§Architecture

§Contract (C-DP-001)

§Why CPU AllReduce is fine

Structs§

Functions§

Module data_parallel