Expand description
TCP gradient server for distributed training (coordinator side)
The GradientServer runs on the coordinator node and:
- Accepts worker connections
- Assigns shard ranges per training step
- Collects gradients from all workers
- Computes AllReduce (average) and broadcasts result
§Contract: F-DP-001 (Weight Consistency)
After broadcasting averaged gradients, all workers apply the same optimizer step, maintaining weight consistency.
§Contract: F-DP-003 (Gradient Stability)
If any worker sends NaN/Inf gradients, the server halts training (Jidoka).
Structs§
- AllReduce
Result - Result of one AllReduce step across all workers.
- Block
AllReduce Result - Result of per-block AllReduce for DDP pretraining.
- Gradient
Server - Gradient server running on the coordinator node.
- NonBlock
AllReduce Result - Result of non-block AllReduce for DDP pretraining.