Module gpu

Expand description

GPU resource management and multi-node training infrastructure.

Implements the GPU sharing spec across three phases:

Modules§

cluster: Cluster configuration for multi-node GPU training (GPU-SHARE Phase 3, §3.2).
coordinator: Checkpoint coordination for multi-node adapter training (GPU-SHARE Phase 3, §3.4).
error: GPU sharing error types (GPU-SHARE-001/002/003).
guard: VRAM Guard (GPU-SHARE-002).
ledger: VRAM Reservation Ledger (GPU-SHARE-001).
mps: Experimental CUDA MPS (Multi-Process Service) support (GPU-SHARE §1.5).
placement: Job placement algorithm for multi-node adapter training (GPU-SHARE Phase 3, §3.3).
profiler: Brick-phase profiler for GPU sharing operations (GPU-SHARE-005).
wait: Wait-for-VRAM polling queue (GPU-SHARE-003).