Skip to main content

Crate pmetal_distributed

Crate pmetal_distributed 

Source
Expand description

Distributed training backend for PMetal.

Enables “Home Clusters” by synchronizing gradients across multiple devices (e.g., Mac Studio + MacBook Pro) over standard networks (TCP/IP, Wi-Fi).

§Features

  • Zero-Configuration Discovery: Automatically finds peers using mDNS/Bonjour
  • Ring All-Reduce: Bandwidth-optimal gradient synchronization
  • Persistent Identity: Ed25519 keypairs stored at ~/.pmetal/node_keypair
  • Topology Awareness: Graph-based cluster management with petgraph
  • Master Election: Distributed leader election for coordination
  • Health Monitoring: Heartbeat-based peer health tracking
  • Gradient Compression: TopK, quantization, and error feedback
  • Network Isolation: PSK-based namespace isolation
  • Observability: Comprehensive metrics and tracing

§Quick Start (Auto-Discovery)

use pmetal_distributed::{AutoDiscoveryBackend, DistributedContext};
use std::time::Duration;

// Create backend with automatic peer discovery
let backend = AutoDiscoveryBackend::new().await?;

// Wait for at least 1 peer to join
backend.wait_for_peers(1, Duration::from_secs(30)).await?;

// Create context for distributed operations
let ctx = DistributedContext::new(Box::new(backend));

// Synchronize gradients across cluster
ctx.all_reduce(&mut gradient_buffer).await?;

§Manual Configuration

For advanced use cases, you can manually configure peers:

use pmetal_distributed::{DistributedConfig, RingBackend, DistributedContext};

let config = DistributedConfig::new(
    vec!["192.168.1.10:52416".parse()?, "192.168.1.11:52416".parse()?],
    0, // This node's rank
);

let backend = RingBackend::new(config).await?;
let ctx = DistributedContext::new(Box::new(backend));

§Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     AutoDiscoveryBackend                         │
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │   Identity   │  │  Discovery   │  │  Topology    │           │
│  │  (Ed25519)   │  │   (mDNS)     │  │  (petgraph)  │           │
│  └──────────────┘  └──────────────┘  └──────────────┘           │
│          │                │                 │                    │
│          └────────────────┼─────────────────┘                    │
│                           ▼                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │  Election    │  │   Health     │  │  Collective  │           │
│  │  (Master)    │  │  (Heartbeat) │  │  (Strategies)│           │
│  └──────────────┘  └──────────────┘  └──────────────┘           │
│          │                │                 │                    │
│          └────────────────┼─────────────────┘                    │
│                           ▼                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │ Compression  │  │   Metrics    │  │  Namespace   │           │
│  │  (TopK/Quant)│  │ (Observ.)    │  │  (PSK)       │           │
│  └──────────────┘  └──────────────┘  └──────────────┘           │
└─────────────────────────────────────────────────────────────────┘

Re-exports§

pub use activation_codec::ActivationCodec;
pub use activation_transport::ActivationMessage;
pub use activation_transport::DtypeTag;
pub use auto::AutoDiscoveryBackend;
pub use auto::AutoDiscoveryConfig;
pub use collective::AllReduceStrategy;
pub use collective::BroadcastStrategy;
pub use collective::CollectiveConfig;
pub use collective::ReduceStrategy;
pub use compression::CompressionStrategy;
pub use compression::GradientCompressor;
pub use compression::QuantizationType;
pub use config::DistributedConfig;
pub use election::ElectionConfig;
pub use election::ElectionEvent;
pub use election::ElectionManager;
pub use election::ElectionState;
pub use error::DistributedError;
pub use error::DistributedResult;
pub use health::HealthConfig;
pub use health::HealthEvent;
pub use health::HealthMonitor;
pub use health::HealthStatus;
pub use health::HealthSummary;
pub use identity::NodeIdentity;
pub use layer_assignment::assign_layers_bandwidth_aware;
pub use layer_assignment::assign_layers_proportional;
pub use metrics::DistributedMetrics;
pub use metrics::MetricsSnapshot;
pub use metrics::SharedMetrics;
pub use namespace::NetworkNamespace;
pub use pipeline::PipelineGenerationLoop;
pub use pipeline::PipelineStageConfig;
pub use pipeline::PipelineStageRuntime;
pub use pipeline::StreamMultiplexer;
pub use ring::RingBackend;
pub use topology::ClusterTopology;
pub use topology::ConnectionProfile;
pub use topology::NodeProfile;
pub use topology::SharedTopology;

Modules§

activation_codec
Activation compression for pipeline inference.
activation_transport
Activation transport for pipeline-parallel inference.
auto
Auto-discovery distributed backend.
cloud_bridge
PMetal Cloud-Bridge for local-to-cluster state transfer.
collective
Configurable collective operations with pluggable strategies.
compression
Gradient compression for bandwidth optimization.
config
discovery
Automatic peer discovery using mDNS (Bonjour/Avahi).
election
Distributed master election algorithm.
error
Comprehensive error types for distributed operations.
health
Heartbeat and health check system for peer monitoring.
identity
Persistent node identity management.
layer_assignment
Layer assignment solvers for pipeline-parallel inference.
metrics
Metrics and observability for distributed operations.
namespace
Network namespace isolation via pre-shared keys.
pipeline
Pipeline-parallel inference runtime.
prelude
Prelude for convenient imports.
ring
solver
Topology-aware layer assignment solver.
topology
Cluster topology management.
transport
TCP transport layer for distributed training.

Structs§

DistributedContext
A handle to the distributed runtime.

Enums§

ReduceOp
Reduction operation for all_reduce.

Traits§

DistributedBackend
Interface for distributed operations.