pmetal-distributed 0.3.13

Distributed training backend for PMetal (Home Clusters)
Documentation

pmetal-distributed

Distributed training backend for home clusters on Apple Silicon.

Overview

This crate provides peer-to-peer distributed training infrastructure designed for small Apple Silicon clusters (2-8 nodes). It features zero-configuration mDNS discovery, ring all-reduce gradient synchronization, and pluggable compression strategies for bandwidth-efficient training.

Architecture

                    ┌──────────────────────────┐
                    │  DistributedContext       │
                    │  (Backend-agnostic API)   │
                    └────────────┬─────────────┘
                                 │
              ┌──────────────────┼──────────────────┐
              │                                     │
              ▼                                     ▼
┌──────────────────────────┐         ┌──────────────────────────┐
│  AutoDiscoveryBackend    │         │  RingBackend             │
│  (Zero-config mDNS)      │         │  (Manual peer list)      │
└────────────┬─────────────┘         └──────────────────────────┘
             │
     ┌───────┼───────┬───────────┐
     ▼       ▼       ▼           ▼
 Identity  Discovery  Topology  Election
 (Ed25519) (libp2p)  (petgraph) (Seniority)
     │       │        │           │
     └───────┼────────┴───────────┘
             ▼
     ┌───────────────┐  ┌───────────────┐  ┌───────────────┐
     │  Transport    │  │  Compression  │  │  Health       │
     │  (TCP Ring)   │  │  (TopK/Quant) │  │  (Heartbeat)  │
     └───────────────┘  └───────────────┘  └───────────────┘

Features

  • Zero-Configuration Discovery: Automatic peer detection via mDNS/Bonjour on local networks
  • Ring All-Reduce: Bandwidth-optimal gradient synchronization with scatter-reduce and all-gather phases
  • Persistent Identity: Ed25519 keypairs stored at ~/.pmetal/node_keypair
  • Topology Awareness: Graph-based cluster representation with node capability and connection profiling
  • Master Election: Seniority-based distributed leader election with PeerId tiebreaking
  • Health Monitoring: Heartbeat-based peer tracking with exponential moving average latency
  • Gradient Compression: TopK, random sparsification, FP16/BF16/INT8 quantization, and PowerSGD with error feedback
  • Network Isolation: SHA3-256 PSK namespacing to prevent cross-cluster communication
  • Metrics: Counters, gauges, and histograms for all-reduce duration, bytes processed, and failures

Usage

Auto-Discovery (Zero-Config)

use pmetal_distributed::{AutoDiscoveryBackend, DistributedContext};
use std::time::Duration;

let backend = AutoDiscoveryBackend::new().await?;
backend.wait_for_peers(1, Duration::from_secs(30)).await?;
backend.establish_ring().await?;

let ctx = DistributedContext::new(Box::new(backend));
ctx.all_reduce(&mut gradient_buffer).await?;

Manual Configuration

use pmetal_distributed::{DistributedConfig, RingBackend, DistributedContext};

let config = DistributedConfig::new(
    vec!["192.168.1.10:52416".parse()?, "192.168.1.11:52416".parse()?],
    0, // This node's rank
);

let backend = RingBackend::new(config).await?;
let ctx = DistributedContext::new(Box::new(backend));

Gradient Compression

use pmetal_distributed::{GradientCompressor, CompressionStrategy};

let mut compressor = GradientCompressor::new(
    CompressionStrategy::TopK { ratio: 0.1 },
    true, // enable error feedback
);

let compressed = compressor.compress(&gradients);

Compression Strategies

Strategy Description Ratio
TopK Keep top-k% gradients by magnitude Configurable
Random Probabilistic sparsification Configurable
FP16 Half-precision quantization 2x
BF16 Brain float quantization 2x
INT8 8-bit quantization 4x
PowerSGD Low-rank approximation Rank-dependent

Collective Operations

Strategy Latency Bandwidth Best For
Ring O(n) O(1)/node Large gradients, balanced clusters
Tree O(log n) O(log n)/node Small messages, low latency
Centralized O(n) O(n)/root Very small clusters (2-3 nodes)

Configuration

Parameter Default Description
gradient_port 52416 TCP port for gradient exchange
discovery_port 52415 mDNS discovery port
min_peers 1 Minimum peers before ring formation
peer_timeout 30s Discovery timeout
connection_timeout_ms 5000 TCP connection timeout
max_retries 3 Connection retry limit

Modules

Module Description
auto Auto-discovery backend with mDNS
ring Ring-based all-reduce for manual configuration
discovery libp2p peer discovery service
transport TCP transport with connection pooling
collective Pluggable collective operation strategies
compression Gradient compression with error feedback
topology Cluster graph with node and connection profiles
identity Persistent Ed25519 keypair management
election Distributed master election
health Heartbeat-based peer monitoring
namespace PSK network isolation
metrics Observability counters and gauges

License

MIT OR Apache-2.0