Skip to main content

Module gradient_server

Module gradient_server 

Source
Expand description

TCP gradient server for distributed training (coordinator side)

The GradientServer runs on the coordinator node and:

  1. Accepts worker connections
  2. Assigns shard ranges per training step
  3. Collects gradients from all workers
  4. Computes AllReduce (average) and broadcasts result

§Contract: F-DP-001 (Weight Consistency)

After broadcasting averaged gradients, all workers apply the same optimizer step, maintaining weight consistency.

§Contract: F-DP-003 (Gradient Stability)

If any worker sends NaN/Inf gradients, the server halts training (Jidoka).

Structs§

AllReduceResult
Result of one AllReduce step across all workers.
BlockAllReduceResult
Result of per-block AllReduce for DDP pretraining.
GradientServer
Gradient server running on the coordinator node.
NonBlockAllReduceResult
Result of non-block AllReduce for DDP pretraining.