Skip to main content

Module gpu

Module gpu 

Source
Expand description

GPU resource management and multi-node training infrastructure.

Implements the GPU sharing spec across three phases:

§Phase 1: VRAM Guard + Sequential Queue

  • Ledger: flock-based VRAM reservation with lease expiry (GPU-SHARE-001)
  • Guard: Pre-allocation VRAM check + post-init tracking (GPU-SHARE-002)
  • Wait: Polling wait queue with timeout (GPU-SHARE-003)
  • Profiler: Brick-phase profiler for GPU sharing ops (GPU-SHARE-005)
  • MPS: Experimental CUDA MPS support (opt-in, §1.5)

§Phase 2: Multi-Adapter Single-Process

See crate::finetune::multi_adapter_pipeline.

§Phase 3: Multi-Node via Forjar

  • Cluster: Cluster YAML config schema + validation (§3.2)
  • Placement: Greedy job placement with FLOPS scoring (§3.3)
  • Coordinator: Checkpoint polling + leaderboard (§3.4)

Modules§

cluster
Cluster configuration for multi-node GPU training (GPU-SHARE Phase 3, §3.2).
coordinator
Checkpoint coordination for multi-node adapter training (GPU-SHARE Phase 3, §3.4).
error
GPU sharing error types (GPU-SHARE-001/002/003).
guard
VRAM Guard (GPU-SHARE-002).
ledger
VRAM Reservation Ledger (GPU-SHARE-001).
mps
Experimental CUDA MPS (Multi-Process Service) support (GPU-SHARE §1.5).
placement
Job placement algorithm for multi-node adapter training (GPU-SHARE Phase 3, §3.3).
profiler
Brick-phase profiler for GPU sharing operations (GPU-SHARE-005).
wait
Wait-for-VRAM polling queue (GPU-SHARE-003).