Skip to main content

Module coordinator

Module coordinator 

Source
Expand description

Checkpoint coordination for multi-node adapter training (GPU-SHARE Phase 3, §3.4).

The coordinator polls each node’s checkpoint directory for metadata, compares val_loss across adapters, maintains a leaderboard, and identifies the best adapter at end of training.

Remote nodes are polled via cat checkpoint_dir/best/metadata.json over SSH. Local nodes read the file directly.

Structs§

AdapterStatus
Status of a single adapter across the cluster.
CheckpointCoordinator
Coordinator for multi-node training checkpoint polling.
CheckpointMetadata
Metadata written by save_adapter_checkpoint() in multi_adapter_pipeline.
LeaderboardEntry
Leaderboard entry ranking adapters by loss.
NodeHealth
Result of a node health check.

Enums§

PollResult
Result of polling a single adapter’s checkpoint.

Functions§

build_launch_command
Build a remote launch command for an adapter job on a node.
check_cluster_health
Check health of all nodes in a cluster (GPU-SHARE §3.6).
exec_launch
Execute a training job on a remote or local node.