pub struct MemoryWatchdog { /* private fields */ }Expand description
Background monitor that pauses training when free VRAM drops below a threshold.
Create a watchdog, wrap it in an Arc, and call start to
spawn the monitoring thread. Between training batches, call
wait_if_paused to block until memory pressure
is resolved.
use std::sync::Arc;
use std::time::Duration;
use ferrotorch_gpu::memory_guard::MemoryWatchdog;
use ferrotorch_gpu::GpuDevice;
let device = Arc::new(GpuDevice::new(0).unwrap());
let watchdog = Arc::new(MemoryWatchdog::new(
device,
512 * 1024 * 1024, // pause when <512 MiB free
Duration::from_secs(1),
));
let handle = Arc::clone(&watchdog).start();
// In training loop:
watchdog.wait_if_paused();Implementations§
Source§impl MemoryWatchdog
impl MemoryWatchdog
Sourcepub fn new(
device: Arc<GpuDevice>,
pressure_threshold_bytes: usize,
check_interval: Duration,
) -> Self
pub fn new( device: Arc<GpuDevice>, pressure_threshold_bytes: usize, check_interval: Duration, ) -> Self
Create a new watchdog. Does not start monitoring until start
is called.
Sourcepub fn start(self: Arc<Self>) -> JoinHandle<()>
pub fn start(self: Arc<Self>) -> JoinHandle<()>
Start the monitoring thread. Returns a JoinHandle that can be used
to wait for shutdown (after calling stop).
Sourcepub fn check_pressure(&self) -> bool
pub fn check_pressure(&self) -> bool
Returns true if the watchdog currently has training paused due to
memory pressure.
Sourcepub fn wait_if_paused(&self)
pub fn wait_if_paused(&self)
Block the calling thread until memory pressure is resolved. Call this between training batches.
Sourcepub fn wait_for_first_check(&self, timeout: Duration)
pub fn wait_for_first_check(&self, timeout: Duration)
Block until the watchdog has completed at least one check cycle. Useful in tests to avoid timing races.
Sourcepub fn pressure_threshold_bytes(&self) -> usize
pub fn pressure_threshold_bytes(&self) -> usize
The pressure threshold in bytes.
Trait Implementations§
Auto Trait Implementations§
impl !Freeze for MemoryWatchdog
impl RefUnwindSafe for MemoryWatchdog
impl Send for MemoryWatchdog
impl Sync for MemoryWatchdog
impl Unpin for MemoryWatchdog
impl UnsafeUnpin for MemoryWatchdog
impl UnwindSafe for MemoryWatchdog
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more
Source§impl<T> DistributionExt for Twhere
T: ?Sized,
impl<T> DistributionExt for Twhere
T: ?Sized,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more