Expand description
Distributed training backend for PMetal.
Enables “Home Clusters” by synchronizing gradients across multiple devices (e.g., Mac Studio + MacBook Pro) over standard networks (TCP/IP, Wi-Fi).
§Features
- Zero-Configuration Discovery: Automatically finds peers using mDNS/Bonjour
- Ring All-Reduce: Bandwidth-optimal gradient synchronization
- Persistent Identity: Ed25519 keypairs stored at
~/.pmetal/node_keypair - Topology Awareness: Graph-based cluster management with petgraph
- Master Election: Distributed leader election for coordination
- Health Monitoring: Heartbeat-based peer health tracking
- Gradient Compression: TopK, quantization, and error feedback
- Network Isolation: PSK-based namespace isolation
- Observability: Comprehensive metrics and tracing
§Quick Start (Auto-Discovery)
ⓘ
use pmetal_distributed::{AutoDiscoveryBackend, DistributedContext};
use std::time::Duration;
// Create backend with automatic peer discovery
let backend = AutoDiscoveryBackend::new().await?;
// Wait for at least 1 peer to join
backend.wait_for_peers(1, Duration::from_secs(30)).await?;
// Create context for distributed operations
let ctx = DistributedContext::new(Box::new(backend));
// Synchronize gradients across cluster
ctx.all_reduce(&mut gradient_buffer).await?;§Manual Configuration
For advanced use cases, you can manually configure peers:
ⓘ
use pmetal_distributed::{DistributedConfig, RingBackend, DistributedContext};
let config = DistributedConfig::new(
vec!["192.168.1.10:52416".parse()?, "192.168.1.11:52416".parse()?],
0, // This node's rank
);
let backend = RingBackend::new(config).await?;
let ctx = DistributedContext::new(Box::new(backend));§Architecture
┌─────────────────────────────────────────────────────────────────┐
│ AutoDiscoveryBackend │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Identity │ │ Discovery │ │ Topology │ │
│ │ (Ed25519) │ │ (mDNS) │ │ (petgraph) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └────────────────┼─────────────────┘ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Election │ │ Health │ │ Collective │ │
│ │ (Master) │ │ (Heartbeat) │ │ (Strategies)│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └────────────────┼─────────────────┘ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Compression │ │ Metrics │ │ Namespace │ │
│ │ (TopK/Quant)│ │ (Observ.) │ │ (PSK) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘Re-exports§
pub use activation_codec::ActivationCodec;pub use activation_transport::ActivationMessage;pub use activation_transport::DtypeTag;pub use auto::AutoDiscoveryBackend;pub use auto::AutoDiscoveryConfig;pub use collective::AllReduceStrategy;pub use collective::BroadcastStrategy;pub use collective::CollectiveConfig;pub use collective::ReduceStrategy;pub use compression::CompressionStrategy;pub use compression::GradientCompressor;pub use compression::QuantizationType;pub use config::DistributedConfig;pub use election::ElectionConfig;pub use election::ElectionEvent;pub use election::ElectionManager;pub use election::ElectionState;pub use error::DistributedError;pub use error::DistributedResult;pub use health::HealthConfig;pub use health::HealthEvent;pub use health::HealthMonitor;pub use health::HealthStatus;pub use health::HealthSummary;pub use identity::NodeIdentity;pub use layer_assignment::assign_layers_bandwidth_aware;pub use layer_assignment::assign_layers_proportional;pub use metrics::DistributedMetrics;pub use metrics::MetricsSnapshot;pub use namespace::NetworkNamespace;pub use pipeline::PipelineGenerationLoop;pub use pipeline::PipelineStageConfig;pub use pipeline::PipelineStageRuntime;pub use pipeline::StreamMultiplexer;pub use ring::RingBackend;pub use topology::ClusterTopology;pub use topology::ConnectionProfile;pub use topology::NodeProfile;
Modules§
- activation_
codec - Activation compression for pipeline inference.
- activation_
transport - Activation transport for pipeline-parallel inference.
- auto
- Auto-discovery distributed backend.
- cloud_
bridge - PMetal Cloud-Bridge for local-to-cluster state transfer.
- collective
- Configurable collective operations with pluggable strategies.
- compression
- Gradient compression for bandwidth optimization.
- config
- discovery
- Automatic peer discovery using mDNS (Bonjour/Avahi).
- election
- Distributed master election algorithm.
- error
- Comprehensive error types for distributed operations.
- health
- Heartbeat and health check system for peer monitoring.
- identity
- Persistent node identity management.
- layer_
assignment - Layer assignment solvers for pipeline-parallel inference.
- metrics
- Metrics and observability for distributed operations.
- namespace
- Network namespace isolation via pre-shared keys.
- pipeline
- Pipeline-parallel inference runtime.
- prelude
- Prelude for convenient imports.
- ring
- solver
- Topology-aware layer assignment solver.
- topology
- Cluster topology management.
- transport
- TCP transport layer for distributed training.
Structs§
- Distributed
Context - A handle to the distributed runtime.
Enums§
- Reduce
Op - Reduction operation for
all_reduce.
Traits§
- Distributed
Backend - Interface for distributed operations.