Expand description
GPU resource management and multi-node training infrastructure.
Implements the GPU sharing spec across three phases:
§Phase 1: VRAM Guard + Sequential Queue
- Ledger: flock-based VRAM reservation with lease expiry (GPU-SHARE-001)
- Guard: Pre-allocation VRAM check + post-init tracking (GPU-SHARE-002)
- Wait: Polling wait queue with timeout (GPU-SHARE-003)
- Profiler: Brick-phase profiler for GPU sharing ops (GPU-SHARE-005)
- MPS: Experimental CUDA MPS support (opt-in, §1.5)
§Phase 2: Multi-Adapter Single-Process
See crate::finetune::multi_adapter_pipeline.
§Phase 3: Multi-Node via Forjar
- Cluster: Cluster YAML config schema + validation (§3.2)
- Placement: Greedy job placement with FLOPS scoring (§3.3)
- Coordinator: Checkpoint polling + leaderboard (§3.4)
Modules§
- cluster
- Cluster configuration for multi-node GPU training (GPU-SHARE Phase 3, §3.2).
- coordinator
- Checkpoint coordination for multi-node adapter training (GPU-SHARE Phase 3, §3.4).
- error
- GPU sharing error types (GPU-SHARE-001/002/003).
- guard
- VRAM Guard (GPU-SHARE-002).
- ledger
- VRAM Reservation Ledger (GPU-SHARE-001).
- mps
- Experimental CUDA MPS (Multi-Process Service) support (GPU-SHARE §1.5).
- placement
- Job placement algorithm for multi-node adapter training (GPU-SHARE Phase 3, §3.3).
- profiler
- Brick-phase profiler for GPU sharing operations (GPU-SHARE-005).
- wait
- Wait-for-VRAM polling queue (GPU-SHARE-003).