Crate runctl

Crate runctl 

Source
Expand description

runctl library

This library provides the core functionality for runctl CLI, a unified tool for ML training orchestration across multiple cloud providers (AWS, RunPod, Lyceum AI).

§Architecture

The library follows industry patterns from Terraform (plugin registry), Pulumi (component model), and Kubernetes (CRD extensibility). See docs/ARCHITECTURE.md for details.

§Key Modules

  • Provider System: provider and providers modules for multi-cloud abstraction
  • Error Handling: error module with structured error types and retry awareness
  • Resource Tracking: resource_tracking for cost awareness and lifecycle management
  • Retry Logic: retry module with exponential backoff for cloud API calls

§Usage

§Basic Example

use runctl::{Config, ResourceTracker};

// Load configuration
let config = Config::load(None)?;

// Track resources
let tracker = ResourceTracker::new();
let running = tracker.get_running().await;

§Using Convenience Re-exports

Common types are re-exported at the crate root for convenience:

use runctl::{Config, Result, TrainctlError};
use runctl::{CreateInstanceOptions, TrainInstanceOptions};

let config = Config::load(None)?;
// Use re-exported types directly

§Provider Trait (Future)

The provider trait system is defined but not yet used by the CLI. When multi-cloud support is enabled:

use runctl::{Config, TrainingProvider};

let config = Config::load(None)?;
// let provider = config.get_provider("aws")?;
// let resource_id = provider.create_resource("g4dn.xlarge", options).await?;

Re-exports§

pub use error::ConfigError;
pub use error::IsRetryable;
pub use error::Result;
pub use error::TrainctlError;
pub use provider::CreateResourceOptions;
pub use provider::ResourceState;
pub use provider::ResourceStatus;
pub use provider::TrainingJob;
pub use provider::TrainingProvider;
pub use providers::ProviderRegistry;
pub use resource_tracking::ResourceTracker;
pub use resource_tracking::ResourceUsage;
pub use resource_tracking::TrackedResource;
pub use retry::ExponentialBackoffPolicy;
pub use retry::RetryPolicy;
pub use safe_cleanup::safe_cleanup;
pub use safe_cleanup::CleanupResult;
pub use safe_cleanup::CleanupSafety;
pub use training::TrainingSession;
pub use training::TrainingStatus;
pub use validation::validate_path;
pub use validation::validate_path_path;
pub use aws::CreateInstanceOptions;
pub use aws::TrainInstanceOptions;
pub use config::Config;
pub use resources::estimate_instance_cost;

Modules§

aws
AWS EC2 operations module
aws_utils
Common AWS utilities shared across modules
checkpoint
Checkpoint management
config
Configuration management
dashboard
Interactive dashboard for monitoring resources and processes
data_transfer
Easy data transfer between local, S3, and training environments
diagnostics
Diagnostic and resource monitoring utilities
docker
Docker container support for runctl
ebs
EBS Volume Management for AWS
ebs_optimization
EBS Volume Optimization Utilities
error
Error types for runctl
error_helpers
Helper functions for creating actionable error messages
fast_data_loading
Fast data loading optimizations for training
local
Local training execution
monitor
Training monitoring
provider
Provider-agnostic trait definitions for cloud training platforms
providers
Provider implementations for different cloud platforms
resource_tracking
Resource tracking and cost awareness
resources
Resource management module
retry
Retry logic with exponential backoff
runpod
RunPod integration
s3
S3 operations module
safe_cleanup
Safe cleanup and teardown operations
ssh_sync
Native Rust SSH-based code syncing
training
Training session tracking
utils
Common utility functions for runctl
validation
Input validation utilities
workflow
Workflow commands for complete training workflows