# pmetal-distributed
Distributed training backend for home clusters on Apple Silicon.
## Overview
This crate provides peer-to-peer distributed training infrastructure designed for small Apple Silicon clusters (2-8 nodes). It features zero-configuration mDNS discovery, ring all-reduce gradient synchronization, and pluggable compression strategies for bandwidth-efficient training.
## Architecture
```
┌──────────────────────────┐
│ DistributedContext │
│ (Backend-agnostic API) │
└────────────┬─────────────┘
│
┌──────────────────┼──────────────────┐
│ │
▼ ▼
┌──────────────────────────┐ ┌──────────────────────────┐
│ AutoDiscoveryBackend │ │ RingBackend │
│ (Zero-config mDNS) │ │ (Manual peer list) │
└────────────┬─────────────┘ └──────────────────────────┘
│
┌───────┼───────┬───────────┐
▼ ▼ ▼ ▼
Identity Discovery Topology Election
(Ed25519) (libp2p) (petgraph) (Seniority)
│ │ │ │
└───────┼────────┴───────────┘
▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Transport │ │ Compression │ │ Health │
│ (TCP Ring) │ │ (TopK/Quant) │ │ (Heartbeat) │
└───────────────┘ └───────────────┘ └───────────────┘
```
## Features
- **Zero-Configuration Discovery**: Automatic peer detection via mDNS/Bonjour on local networks
- **Ring All-Reduce**: Bandwidth-optimal gradient synchronization with scatter-reduce and all-gather phases
- **Persistent Identity**: Ed25519 keypairs stored at `~/.pmetal/node_keypair`
- **Topology Awareness**: Graph-based cluster representation with node capability and connection profiling
- **Master Election**: Seniority-based distributed leader election with PeerId tiebreaking
- **Health Monitoring**: Heartbeat-based peer tracking with exponential moving average latency
- **Gradient Compression**: TopK, random sparsification, FP16/BF16/INT8 quantization, and PowerSGD with error feedback
- **Network Isolation**: SHA3-256 PSK namespacing to prevent cross-cluster communication
- **Metrics**: Counters, gauges, and histograms for all-reduce duration, bytes processed, and failures
## Usage
### Auto-Discovery (Zero-Config)
```rust
use pmetal_distributed::{AutoDiscoveryBackend, DistributedContext};
use std::time::Duration;
let backend = AutoDiscoveryBackend::new().await?;
backend.wait_for_peers(1, Duration::from_secs(30)).await?;
backend.establish_ring().await?;
let ctx = DistributedContext::new(Box::new(backend));
ctx.all_reduce(&mut gradient_buffer).await?;
```
### Manual Configuration
```rust
use pmetal_distributed::{DistributedConfig, RingBackend, DistributedContext};
let config = DistributedConfig::new(
vec!["192.168.1.10:52416".parse()?, "192.168.1.11:52416".parse()?],
0, // This node's rank
);
let backend = RingBackend::new(config).await?;
let ctx = DistributedContext::new(Box::new(backend));
```
### Gradient Compression
```rust
use pmetal_distributed::{GradientCompressor, CompressionStrategy};
let mut compressor = GradientCompressor::new(
CompressionStrategy::TopK { ratio: 0.1 },
true, // enable error feedback
);
let compressed = compressor.compress(&gradients);
```
## Compression Strategies
| **TopK** | Keep top-k% gradients by magnitude | Configurable |
| **Random** | Probabilistic sparsification | Configurable |
| **FP16** | Half-precision quantization | 2x |
| **BF16** | Brain float quantization | 2x |
| **INT8** | 8-bit quantization | 4x |
| **PowerSGD** | Low-rank approximation | Rank-dependent |
## Collective Operations
| **Ring** | O(n) | O(1)/node | Large gradients, balanced clusters |
| **Tree** | O(log n) | O(log n)/node | Small messages, low latency |
| **Centralized** | O(n) | O(n)/root | Very small clusters (2-3 nodes) |
## Configuration
| `gradient_port` | 52416 | TCP port for gradient exchange |
| `discovery_port` | 52415 | mDNS discovery port |
| `min_peers` | 1 | Minimum peers before ring formation |
| `peer_timeout` | 30s | Discovery timeout |
| `connection_timeout_ms` | 5000 | TCP connection timeout |
| `max_retries` | 3 | Connection retry limit |
## Modules
| `auto` | Auto-discovery backend with mDNS |
| `ring` | Ring-based all-reduce for manual configuration |
| `discovery` | libp2p peer discovery service |
| `transport` | TCP transport with connection pooling |
| `collective` | Pluggable collective operation strategies |
| `compression` | Gradient compression with error feedback |
| `topology` | Cluster graph with node and connection profiles |
| `identity` | Persistent Ed25519 keypair management |
| `election` | Distributed master election |
| `health` | Heartbeat-based peer monitoring |
| `namespace` | PSK network isolation |
| `metrics` | Observability counters and gauges |
## License
MIT OR Apache-2.0