pub struct AsyncCheckpointer { /* private fields */ }Expand description
Asynchronous checkpointer that writes shards to disk in a background thread.
Before writing, tensor data is staged (copied) into CPU memory so that training can continue on GPU without blocking. The actual file I/O happens in a separate thread.
§Usage
ⓘ
let ckpt = AsyncCheckpointer::new(
checkpoint_dir.clone(),
rank,
world_size,
shard_metadata.clone(),
);
// Non-blocking: training continues immediately.
let mut future = ckpt.save_async(&state_dict)?;
// ... continue training ...
// When you need to be sure it finished:
future.wait()?;Implementations§
Source§impl AsyncCheckpointer
impl AsyncCheckpointer
Sourcepub fn new(
dir: PathBuf,
rank: usize,
world_size: usize,
shard_spec: ShardMetadata,
) -> Self
pub fn new( dir: PathBuf, rank: usize, world_size: usize, shard_spec: ShardMetadata, ) -> Self
Create a new async checkpointer.
dir: directory where shard files will be written.rank: this process’s rank.world_size: total number of ranks.shard_spec: metadata describing how tensors are sharded.
Sourcepub fn world_size(&self) -> usize
pub fn world_size(&self) -> usize
Total number of ranks.
Sourcepub fn save_async(
&self,
state_dict: &HashMap<String, Tensor<f32>>,
) -> Result<CheckpointFuture, DistCheckpointError>
pub fn save_async( &self, state_dict: &HashMap<String, Tensor<f32>>, ) -> Result<CheckpointFuture, DistCheckpointError>
Start an asynchronous checkpoint.
This immediately copies all tensor data to CPU-owned Vecs (staging),
then spawns a background thread that serializes and writes the shard
file. For GPU tensors, the staging step transfers data to host memory.
Returns a CheckpointFuture that can be polled or waited on.
§Errors
Returns an error immediately if another async save is already in flight, or if staging (GPU-to-CPU copy) fails.
Auto Trait Implementations§
impl Freeze for AsyncCheckpointer
impl RefUnwindSafe for AsyncCheckpointer
impl Send for AsyncCheckpointer
impl Sync for AsyncCheckpointer
impl Unpin for AsyncCheckpointer
impl UnsafeUnpin for AsyncCheckpointer
impl UnwindSafe for AsyncCheckpointer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more
Source§impl<T> DistributionExt for Twhere
T: ?Sized,
impl<T> DistributionExt for Twhere
T: ?Sized,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more