Expand description
Checkpoint coordination for multi-node adapter training (GPU-SHARE Phase 3, §3.4).
The coordinator polls each node’s checkpoint directory for metadata, compares val_loss across adapters, maintains a leaderboard, and identifies the best adapter at end of training.
Remote nodes are polled via cat checkpoint_dir/best/metadata.json over SSH.
Local nodes read the file directly.
Structs§
- Adapter
Status - Status of a single adapter across the cluster.
- Checkpoint
Coordinator - Coordinator for multi-node training checkpoint polling.
- Checkpoint
Metadata - Metadata written by
save_adapter_checkpoint()in multi_adapter_pipeline. - Leaderboard
Entry - Leaderboard entry ranking adapters by loss.
- Node
Health - Result of a node health check.
Enums§
- Poll
Result - Result of polling a single adapter’s checkpoint.
Functions§
- build_
launch_ command - Build a remote launch command for an adapter job on a node.
- check_
cluster_ health - Check health of all nodes in a cluster (GPU-SHARE §3.6).
- exec_
launch - Execute a training job on a remote or local node.