docs.rs failed to build daa-prime-trainer-0.2.1
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
DAA Prime Trainer
Distributed SGD/FSDP trainer implementation for the Prime distributed machine learning framework. Provides fault-tolerant, scalable training coordination with built-in incentive mechanisms through DAA ecosystem integration.
Overview
DAA Prime Trainer implements a robust distributed training node that participates in federated learning and distributed training protocols. It provides:
- Distributed SGD: Scalable stochastic gradient descent across multiple nodes
- FSDP Support: Fully Sharded Data Parallel training for large models
- Fault Tolerance: Automatic recovery from node failures and network partitions
- DAA Integration: Token-based incentives and governance participation
- Flexible Architecture: Pluggable optimizers and aggregation strategies
Features
- 🚀 High Performance: Optimized gradient computation and communication
- 🔄 Fault Tolerant: Automatic failure detection and recovery
- 🏆 Incentivized: Token rewards for quality contributions
- 📊 Comprehensive Metrics: Detailed training and performance monitoring
- 🔒 Secure: Cryptographic verification of gradient updates
- 🌐 Network Agnostic: Works with any transport layer
Installation
Add this to your Cargo.toml:
[]
= "0.2.1"
= "0.2.1"
= "0.2.1"
= { = "1.0", = ["full"] }
Quick Start
Basic Training Node
use ;
use Result;
async
Custom Training Configuration
use ;
let config = TrainingConfig ;
// Create trainer with custom configuration
let trainer = with_config.await?;
Running as Binary
The crate also provides a standalone binary for direct execution:
# Start a trainer node
# Or install and run
Core Concepts
Training Lifecycle
use ;
let trainer = new.await?;
// Training goes through several phases:
// 1. Initialization - Set up local model and data
// 2. Gradient Computation - Compute local gradients
// 3. Communication - Share gradients with coordinators
// 4. Aggregation - Receive aggregated updates
// 5. Model Update - Apply updates to local model
trainer.start_training.await?;
Distributed Gradient Computation
use ;
use GradientUpdate;
// Compute gradients on local data
let computer = new;
let local_gradients = computer.compute_batch_gradients.await?;
// Create gradient update for sharing
let update = GradientUpdate ;
Fault Tolerance
use ;
let fault_config = FaultToleranceConfig ;
let trainer = with_fault_tolerance.await?;
Advanced Usage
Custom Data Loading
use ;
use async_trait;
// Implement custom data loader
// Use custom data loader with trainer
let data_loader = CustomDataLoader ;
let trainer = with_data_loader.await?;
Federated Learning
use ;
use ;
// Configure for federated learning
let fed_config = FederatedConfig ;
let trainer = with_federated_config.await?;
// Participate in federated training rounds
trainer.join_federated_round.await?;
Model Checkpointing
use ;
let trainer = new.await?;
// Save checkpoint
trainer.save_checkpoint.await?;
// Load checkpoint
trainer.load_checkpoint.await?;
// List available checkpoints
let checkpoints = trainer.list_checkpoints.await?;
for checkpoint in checkpoints
Performance Monitoring
use ;
let trainer = new.await?;
// Start metrics collection
let metrics_collector = new;
trainer.set_metrics_collector.await?;
// Get real-time metrics
loop
Integration with DHT
use TrainerNode;
use ;
// Create trainer with DHT integration
let dht = new;
let trainer = with_dht.await?;
// Trainer will automatically:
// - Store training checkpoints in DHT
// - Retrieve model updates from DHT
// - Share gradient updates via DHT
// - Discover other training nodes
Configuration
Training Parameters
use TrainingConfig;
let config = TrainingConfig ;
Network Configuration
use NetworkConfig;
let network_config = NetworkConfig ;
Command Line Interface
The trainer can be run as a standalone binary:
# Basic usage
# With custom configuration
# Specify coordinator endpoints
# Enable verbose logging
# Set data directory
# Join specific training round
Configuration File
Testing
Unit Tests
Integration Tests
Property-Based Testing
use *;
proptest!
Performance Optimization
Memory Management
use ;
let memory_config = MemoryConfig ;
let trainer = with_memory_config.await?;
Parallel Processing
use ;
use *;
let parallel_config = ParallelConfig ;
// Parallel gradient computation
let gradients: = data_batches
.par_iter
.map
.collect;
Benchmarking
Troubleshooting
Common Issues
-
Training Divergence
// Monitor gradient norms let grad_norm = calculate_gradient_norm; if grad_norm > 10.0 -
Memory Issues
// Enable gradient checkpointing for large models let config = TrainingConfig ; -
Network Timeouts
// Increase timeout for slow networks let network_config = NetworkConfig ;
Roadmap
- GPU acceleration support
- Model parallel training (tensor parallelism)
- Advanced aggregation algorithms (Byzantine fault tolerance)
- Differential privacy integration
- Automated hyperparameter tuning
- Real-time model serving integration
Contributing
Contributions are welcome! Please see our Contributing Guide for details.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Related Crates
daa-prime-core: Core types and protocol definitionsdaa-prime-dht: Distributed hash table for model storagedaa-prime-coordinator: Training coordination and governancedaa-prime-cli: Command-line interface
References
- Federated Learning - McMahan et al.
- FSDP Paper - Fully Sharded Data Parallel
- Byzantine ML - Byzantine-robust distributed learning