Skip to main content

Module gradient_server

entrenar::finetune

Module gradient_server

Expand description

TCP gradient server for distributed training (coordinator side)

The GradientServer runs on the coordinator node and:

Accepts worker connections
Assigns shard ranges per training step
Collects gradients from all workers
Computes AllReduce (average) and broadcasts result

§Contract: F-DP-001 (Weight Consistency)

After broadcasting averaged gradients, all workers apply the same optimizer step, maintaining weight consistency.

§Contract: F-DP-003 (Gradient Stability)

If any worker sends NaN/Inf gradients, the server halts training (Jidoka).

Structs§

AllReduceResult: Result of one AllReduce step across all workers.
BlockAllReduceResult: Result of per-block AllReduce for DDP pretraining.
GradientServer: Gradient server running on the coordinator node.
NonBlockAllReduceResult: Result of non-block AllReduce for DDP pretraining.