Skip to main content

Crate pmetal_distributed

Crate pmetal_distributed 

Source
Expand description

Distributed training backend for PMetal.

Enables “Home Clusters” by synchronizing gradients across multiple devices (e.g., Mac Studio + MacBook Pro) over standard networks (TCP/IP, Wi-Fi).

§Features

  • Zero-Configuration Discovery: Automatically finds peers using mDNS/Bonjour
  • Ring All-Reduce: Bandwidth-optimal gradient synchronization
  • Persistent Identity: Ed25519 keypairs stored at ~/.pmetal/node_keypair
  • Topology Awareness: Graph-based cluster management with petgraph
  • Master Election: Distributed leader election for coordination
  • Health Monitoring: Heartbeat-based peer health tracking
  • Gradient Compression: TopK, quantization, and error feedback
  • Network Isolation: PSK-based namespace isolation
  • Observability: Comprehensive metrics and tracing

§Quick Start (Auto-Discovery)

use pmetal_distributed::{AutoDiscoveryBackend, DistributedContext};
use std::time::Duration;

// Create backend with automatic peer discovery
let backend = AutoDiscoveryBackend::new().await?;

// Wait for at least 1 peer to join
backend.wait_for_peers(1, Duration::from_secs(30)).await?;

// Create context for distributed operations
let ctx = DistributedContext::new(Box::new(backend));

// Synchronize gradients across cluster
ctx.all_reduce(&mut gradient_buffer).await?;

§Manual Configuration

For advanced use cases, you can manually configure peers:

use pmetal_distributed::{DistributedConfig, RingBackend, DistributedContext};

let config = DistributedConfig::new(
    vec!["192.168.1.10:52416".parse()?, "192.168.1.11:52416".parse()?],
    0, // This node's rank
);

let backend = RingBackend::new(config).await?;
let ctx = DistributedContext::new(Box::new(backend));

§Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     AutoDiscoveryBackend                         │
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │   Identity   │  │  Discovery   │  │  Topology    │           │
│  │  (Ed25519)   │  │   (mDNS)     │  │  (petgraph)  │           │
│  └──────────────┘  └──────────────┘  └──────────────┘           │
│          │                │                 │                    │
│          └────────────────┼─────────────────┘                    │
│                           ▼                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │  Election    │  │   Health     │  │  Collective  │           │
│  │  (Master)    │  │  (Heartbeat) │  │  (Strategies)│           │
│  └──────────────┘  └──────────────┘  └──────────────┘           │
│          │                │                 │                    │
│          └────────────────┼─────────────────┘                    │
│                           ▼                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │ Compression  │  │   Metrics    │  │  Namespace   │           │
│  │  (TopK/Quant)│  │ (Observ.)    │  │  (PSK)       │           │
│  └──────────────┘  └──────────────┘  └──────────────┘           │
└─────────────────────────────────────────────────────────────────┘

Re-exports§

pub use auto::AutoDiscoveryBackend;
pub use auto::AutoDiscoveryConfig;
pub use collective::AllReduceStrategy;
pub use collective::BroadcastStrategy;
pub use collective::CollectiveConfig;
pub use collective::ReduceStrategy;
pub use compression::CompressionStrategy;
pub use compression::GradientCompressor;
pub use compression::QuantizationType;
pub use config::DistributedConfig;
pub use election::ElectionConfig;
pub use election::ElectionEvent;
pub use election::ElectionManager;
pub use election::ElectionState;
pub use error::DistributedError;
pub use error::DistributedResult;
pub use health::HealthConfig;
pub use health::HealthEvent;
pub use health::HealthMonitor;
pub use health::HealthStatus;
pub use health::HealthSummary;
pub use identity::NodeIdentity;
pub use metrics::DistributedMetrics;
pub use metrics::MetricsSnapshot;
pub use metrics::SharedMetrics;
pub use namespace::NetworkNamespace;
pub use ring::RingBackend;
pub use topology::ClusterTopology;
pub use topology::ConnectionProfile;
pub use topology::NodeProfile;
pub use topology::SharedTopology;

Modules§

auto
Auto-discovery distributed backend.
cloud_bridge
PMetal Cloud-Bridge for local-to-cluster state transfer.
collective
Configurable collective operations with pluggable strategies.
compression
Gradient compression for bandwidth optimization.
config
discovery
Automatic peer discovery using mDNS (Bonjour/Avahi).
election
Distributed master election algorithm.
error
Comprehensive error types for distributed operations.
health
Heartbeat and health check system for peer monitoring.
identity
Persistent node identity management.
metrics
Metrics and observability for distributed operations.
namespace
Network namespace isolation via pre-shared keys.
prelude
Prelude for convenient imports.
ring
topology
Cluster topology management.
transport
TCP transport layer for distributed training.

Structs§

DistributedContext
A handle to the distributed runtime.

Traits§

DistributedBackend
Interface for distributed operations.