Skip to main content

entrenar/gpu/
mod.rs

1//! GPU resource management and multi-node training infrastructure.
2//!
3//! Implements the GPU sharing spec across three phases:
4//!
5//! ## Phase 1: VRAM Guard + Sequential Queue
6//! - **Ledger**: flock-based VRAM reservation with lease expiry (GPU-SHARE-001)
7//! - **Guard**: Pre-allocation VRAM check + post-init tracking (GPU-SHARE-002)
8//! - **Wait**: Polling wait queue with timeout (GPU-SHARE-003)
9//! - **Profiler**: Brick-phase profiler for GPU sharing ops (GPU-SHARE-005)
10//! - **MPS**: Experimental CUDA MPS support (opt-in, §1.5)
11//!
12//! ## Phase 2: Multi-Adapter Single-Process
13//! See [`crate::finetune::multi_adapter_pipeline`].
14//!
15//! ## Phase 3: Multi-Node via Forjar
16//! - **Cluster**: Cluster YAML config schema + validation (§3.2)
17//! - **Placement**: Greedy job placement with FLOPS scoring (§3.3)
18//! - **Coordinator**: Checkpoint polling + leaderboard (§3.4)
19
20pub mod cluster;
21pub mod coordinator;
22pub mod error;
23pub mod guard;
24pub mod ledger;
25pub mod mps;
26pub mod placement;
27pub mod profiler;
28pub mod wait;