Expand description
runctl library
This library provides the core functionality for runctl CLI, a unified tool for ML training orchestration across multiple cloud providers (AWS, RunPod, Lyceum AI).
§Architecture
The library follows industry patterns from Terraform (plugin registry), Pulumi (component model),
and Kubernetes (CRD extensibility). See docs/ARCHITECTURE.md for details.
§Key Modules
- Provider System:
providerandprovidersmodules for multi-cloud abstraction - Error Handling:
errormodule with structured error types and retry awareness - Resource Tracking:
resource_trackingfor cost awareness and lifecycle management - Retry Logic:
retrymodule with exponential backoff for cloud API calls
§Usage
§Basic Example
use runctl::{Config, ResourceTracker};
// Load configuration
let config = Config::load(None)?;
// Track resources
let tracker = ResourceTracker::new();
let running = tracker.get_running().await;§Using Convenience Re-exports
Common types are re-exported at the crate root for convenience:
use runctl::{Config, Result, TrainctlError};
use runctl::{CreateInstanceOptions, TrainInstanceOptions};
let config = Config::load(None)?;
// Use re-exported types directly§Provider Trait (Future)
The provider trait system is defined but not yet used by the CLI. When multi-cloud support is enabled:
use runctl::{Config, TrainingProvider};
let config = Config::load(None)?;
// let provider = config.get_provider("aws")?;
// let resource_id = provider.create_resource("g4dn.xlarge", options).await?;Re-exports§
pub use error::ConfigError;pub use error::IsRetryable;pub use error::Result;pub use error::TrainctlError;pub use provider::CreateResourceOptions;pub use provider::ResourceState;pub use provider::ResourceStatus;pub use provider::TrainingJob;pub use provider::TrainingProvider;pub use providers::ProviderRegistry;pub use resource_tracking::ResourceTracker;pub use resource_tracking::ResourceUsage;pub use resource_tracking::TrackedResource;pub use retry::ExponentialBackoffPolicy;pub use retry::RetryPolicy;pub use safe_cleanup::safe_cleanup;pub use safe_cleanup::CleanupResult;pub use safe_cleanup::CleanupSafety;pub use training::TrainingSession;pub use training::TrainingStatus;pub use validation::validate_path;pub use validation::validate_path_path;pub use aws::CreateInstanceOptions;pub use aws::TrainInstanceOptions;pub use config::Config;pub use resources::estimate_instance_cost;
Modules§
- aws
- AWS EC2 operations module
- aws_
utils - Common AWS utilities shared across modules
- checkpoint
- Checkpoint management
- config
- Configuration management
- dashboard
- Interactive dashboard for monitoring resources and processes
- data_
transfer - Easy data transfer between local, S3, and training environments
- diagnostics
- Diagnostic and resource monitoring utilities
- docker
- Docker container support for runctl
- ebs
- EBS Volume Management for AWS
- ebs_
optimization - EBS Volume Optimization Utilities
- error
- Error types for runctl
- error_
helpers - Helper functions for creating actionable error messages
- fast_
data_ loading - Fast data loading optimizations for training
- local
- Local training execution
- monitor
- Training monitoring
- provider
- Provider-agnostic trait definitions for cloud training platforms
- providers
- Provider implementations for different cloud platforms
- resource_
tracking - Resource tracking and cost awareness
- resources
- Resource management module
- retry
- Retry logic with exponential backoff
- runpod
- RunPod integration
- s3
- S3 operations module
- safe_
cleanup - Safe cleanup and teardown operations
- ssh_
sync - Native Rust SSH-based code syncing
- training
- Training session tracking
- utils
- Common utility functions for runctl
- validation
- Input validation utilities
- workflow
- Workflow commands for complete training workflows