1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
//! Distributed Computing Module for NumRS2
//!
//! This module provides Pure Rust distributed computing capabilities for large-scale
//! numerical computations across multiple processes and machines.
//!
//! # Overview
//!
//! The distributed module implements MPI-like functionality in Pure Rust (no C bindings),
//! following the COOLJAPAN policy. It provides:
//!
//! - **Communication Layer**: Point-to-point and collective communication using tokio
//! - **Distributed Arrays**: Automatic partitioning and synchronization across processes
//! - **Collective Operations**: Reduce, broadcast, gather, scatter with various strategies
//! - **Distributed Linear Algebra**: Matrix operations distributed across processes
//! - **Process Management**: Process groups and communicators
//! - **Network Optimization**: Topology-aware communication and computation overlap
//! - **Distributed Training**: Data parallelism, model parallelism, and distributed optimizers
//!
//! # Architecture
//!
//! ```text
//! ┌─────────────────────────────────────────────────────────────────┐
//! │ Application Layer │
//! ├─────────────────────────────────────────────────────────────────┤
//! │ Distributed Arrays │ Distributed Linear Algebra │
//! ├──────────────────────┴──────────────────────────────────────────┤
//! │ Collective Operations │
//! ├─────────────────────────────────────────────────────────────────┤
//! │ Process Management │ Communication Layer │
//! ├──────────────────────┴──────────────────────────────────────────┤
//! │ Network Optimization │
//! └─────────────────────────────────────────────────────────────────┘
//! ```
//!
//! # Features
//!
//! - **Pure Rust**: No MPI C bindings, fully safe Rust implementation
//! - **Async Communication**: Built on tokio for efficient async I/O
//! - **Type Safety**: Generic implementations with trait bounds
//! - **Error Handling**: Comprehensive error handling with `Result<T>`
//! - **No Unwrap**: Follows COOLJAPAN no-unwrap policy
//! - **Oxicode Serialization**: Fast binary serialization without C dependencies
//!
//! # Example
//!
//! ```rust,no_run
//! use numrs2::distributed::prelude::*;
//!
//! # async fn example() -> Result<(), Box<dyn std::error::Error>> {
//! // Initialize distributed environment
//! let world = init().await?;
//! let rank = world.rank();
//! let size = world.size();
//!
//! // Create distributed array
//! let local_data = vec![rank as f64; 100];
//! let global_size = 400; // Total size across all processes
//! let dist_array = DistributedArray::from_local(
//! local_data,
//! DistributionStrategy::Block,
//! global_size,
//! &world
//! )?;
//!
//! // Perform collective operation
//! let sum = allreduce(dist_array.local_data(), ReduceOp::Sum, &world).await?;
//!
//! // Distributed matrix multiplication
//! let result = distributed_matmul(&dist_array, &dist_array).await?;
//!
//! // Finalize
//! finalize(world).await?;
//! # Ok(())
//! # }
//! ```
//!
//! # Performance Considerations
//!
//! - Use block distribution for large contiguous arrays
//! - Use cyclic distribution for load balancing irregular workloads
//! - Enable network optimization for topology-aware communication
//! - Overlap computation and communication using async operations
//! - Consider data compression for network-bound operations
//!
//! # See Also
//!
//! - [`comm`]: Low-level communication primitives
//! - [`collective`]: High-level collective operations
//! - [`mod@array`]: Distributed array structures
//! - [`linalg`]: Distributed linear algebra
//! - [`process`]: Process management and communicators
/// Re-exports for convenient use