pub struct Zero3CpuOffloadManager { /* private fields */ }Expand description
Main ZeRO-3 CPU offload manager that orchestrates all components
This is the primary interface for ZeRO-3 operations, providing a unified API that coordinates between all the specialized modules. It maintains the same interface as the original monolithic implementation for backward compatibility.
Implementations§
Source§impl Zero3CpuOffloadManager
impl Zero3CpuOffloadManager
Sourcepub fn new(
config: Zero3CpuOffloadConfig,
process_group: Arc<ProcessGroup>,
model_parameters: &ConfigModelParameters,
) -> TorshResult<Self>
pub fn new( config: Zero3CpuOffloadConfig, process_group: Arc<ProcessGroup>, model_parameters: &ConfigModelParameters, ) -> TorshResult<Self>
Create a new ZeRO-3 CPU offload manager
Initializes all component systems and establishes distributed coordination. The manager will automatically partition parameters, gradients, and optimizer states according to the ZeRO-3 algorithm.
§Arguments
config- Configuration for ZeRO-3 behavior and memory managementprocess_group- Distributed process group for coordinationmodel_parameters- Description of model parameters to be managed
§Returns
Returns a configured ZeRO-3 manager ready for training operations.
Sourcepub async fn forward_pass(
&mut self,
input: &Tensor<f32>,
layer_names: &[String],
) -> TorshResult<Tensor<f32>>
pub async fn forward_pass( &mut self, input: &Tensor<f32>, layer_names: &[String], ) -> TorshResult<Tensor<f32>>
Execute forward pass with ZeRO-3 CPU offloading
Processes each layer with intelligent parameter management:
- Prefetches parameters for upcoming layers
- Ensures current layer parameters are on GPU
- Executes layer computation
- Optionally offloads parameters back to CPU
- Performs memory optimization as needed
§Arguments
input- Input tensor for the forward passlayer_names- Ordered list of layer names to execute
§Returns
Returns the output tensor after processing all layers.
Sourcepub async fn backward_pass(
&mut self,
grad_output: &Tensor<f32>,
layer_names: &[String],
) -> TorshResult<()>
pub async fn backward_pass( &mut self, grad_output: &Tensor<f32>, layer_names: &[String], ) -> TorshResult<()>
Execute backward pass with ZeRO-3 CPU offloading
Processes layers in reverse order for gradient computation:
- Ensures parameters are available for gradient computation
- Computes gradients for each layer
- Partitions and manages gradients according to ZeRO-3
- Performs all-reduce synchronization across ranks
§Arguments
grad_output- Gradient tensor from the loss functionlayer_names- Ordered list of layer names (processed in reverse)
§Returns
Returns Ok(()) when backward pass completes successfully.
Sourcepub async fn optimizer_step(&mut self, learning_rate: f32) -> TorshResult<()>
pub async fn optimizer_step(&mut self, learning_rate: f32) -> TorshResult<()>
Update optimizer states and parameters with ZeRO-3 partitioning
Performs optimizer step with intelligent state management:
- Gathers partitioned gradients for owned parameters
- Fetches optimizer states from CPU if needed
- Computes parameter updates using optimizer algorithm
- Updates parameters and stores back to appropriate location
- Broadcasts updates to all ranks that need them
§Arguments
learning_rate- Learning rate for parameter updates
§Returns
Returns Ok(()) when optimizer step completes successfully.
Sourcepub fn get_performance_stats(&self) -> Zero3PerformanceStats
pub fn get_performance_stats(&self) -> Zero3PerformanceStats
Get comprehensive performance statistics
Returns detailed performance metrics including timing, throughput, memory usage, and efficiency measurements.
Sourcepub fn get_memory_stats(&self) -> Zero3MemoryStats
pub fn get_memory_stats(&self) -> Zero3MemoryStats
Get memory usage statistics
Returns current memory usage across CPU and GPU, including parameter distribution and compression effectiveness.
Sourcepub async fn force_memory_optimization(&self) -> TorshResult<()>
pub async fn force_memory_optimization(&self) -> TorshResult<()>
Force immediate memory optimization
Triggers aggressive memory optimization regardless of current pressure. Useful for cleaning up before checkpointing or when memory is critically low.
Sourcepub fn get_prefetch_status(&self) -> PrefetchQueueStatus
pub fn get_prefetch_status(&self) -> PrefetchQueueStatus
Get prefetch scheduler status
Returns information about current prefetch operations and queue status.
Sourcepub async fn adapt_performance(&self) -> TorshResult<()>
pub async fn adapt_performance(&self) -> TorshResult<()>
Adapt system performance based on runtime metrics
Analyzes recent performance and adjusts prefetch strategies, memory management policies, and other adaptive parameters.
Sourcepub async fn reset_state(&self) -> TorshResult<()>
pub async fn reset_state(&self) -> TorshResult<()>
Clear all caches and reset state
Useful for testing or when switching between different models.
Auto Trait Implementations§
impl !Freeze for Zero3CpuOffloadManager
impl !RefUnwindSafe for Zero3CpuOffloadManager
impl Send for Zero3CpuOffloadManager
impl Sync for Zero3CpuOffloadManager
impl Unpin for Zero3CpuOffloadManager
impl UnsafeUnpin for Zero3CpuOffloadManager
impl !UnwindSafe for Zero3CpuOffloadManager
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more