1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
// This is currently impractical to test as we lack the capability to simulate mock processor configurations.
//! [Criterion][1] benchmark harness designed to compare different modes of distributing work in a
//! many-processor system with multiple memory regions. This helps highlight the performance impact of
//! cross-memory-region data transfers, cross-processor data transfers and multi-threaded logic.
//!
//! This is part of the [Folo project](https://github.com/folo-rs/folo) that provides mechanisms for
//! high-performance hardware-aware programming in Rust.
//!
//! # Execution model
//!
//! The benchmark harness selects **pairs of processors** that will execute each iteration of a
//! benchmark scenario, preparing and processing **payloads**. The iteration time is the maximum
//! duration of any worker (whichever worker takes longest to process the payload it is given).
//!
//! The criteria for processor pairs selection is determined by the specified [`WorkDistribution`],
//! with the final selection randomized for each iteration if there are multiple equally valid candidate
//! processor pairs.
//!
//! # Usage
//!
//! For each benchmark scenario, define a type that implements the [`Payload`] trait. Executing a
//! benchmark scenario consists of the following major steps:
//!
//! 1. For each processor pair, [a payload pair is created][3].
//! 1. Each payload is moved to its assigned processor and [prepared][4]. This is where the payload data
//! set is typically generated.
//! 1. Depending on the work distribution mode, the payloads may now be exchanged between the assigned
//! processors, to ensure that we process "foreign" data on each processor.
//! 1. The payload is [processed][5] by each worker in the pair. This is the timed step.
//!
//! The reference to "foreign" data here implies that if the two workers are in different memory
//! regions, the data is likely to be present in a different memory region than used by the processor
//! used to process the payload.
//!
//! This is because physical memory pages of heap-allocated objects are allocated in the memory region
//! of the processor that initializes the memory (in the "prepare" step), so despite the payload later
//! being moved to a different worker's thread, any heap-allocated data referenced by the payload
//! remains where it is, which may be in physical memory modules that are not directly connected to
//! the processor that will process the payload.
//!
//! # Example
//!
//! A simple scenario that merely copies memory from a foreign buffer to a local one
//! (`benches/many_cpus_harness_demo.rs`):
//!
//! ```rust ignore (benchmark)
//! const COPY_BYTES_LEN: usize = 64 * 1024 * 1024;
//!
//! /// Sample benchmark scenario that copies bytes between the two paired payloads.
//! ///
//! /// The source buffers are allocated in the "prepare" step and become local to the "prepare" worker.
//! /// The destination buffers are allocated in the "process" step. The end result is that we copy
//! /// from remote memory (allocated in the "prepare" step) to local memory in the "process" step.
//! ///
//! /// There is no deep meaning behind this scenario, just a sample benchmark that showcases comparing
//! /// different work distribution modes to identify performance differences from hardware-awareness.
//! #[derive(Debug, Default)]
//! struct CopyBytes {
//! from: Option<Vec<u8>>,
//! }
//!
//! impl Payload for CopyBytes {
//! fn new_pair() -> (Self, Self) {
//! (Self::default(), Self::default())
//! }
//!
//! fn prepare(&mut self) {
//! self.from = Some(vec![99; COPY_BYTES_LEN]);
//! }
//!
//! fn process(&mut self) {
//! let from = self.from.as_ref().unwrap();
//! let mut to = Vec::with_capacity(COPY_BYTES_LEN);
//!
//! // SAFETY: The pointers are valid, the length is correct, all is well.
//! unsafe {
//! ptr::copy_nonoverlapping(from.as_ptr(), to.as_mut_ptr(), COPY_BYTES_LEN);
//! }
//!
//! // SAFETY: We just filled these bytes, it is all good.
//! unsafe {
//! to.set_len(COPY_BYTES_LEN);
//! }
//!
//! // Read from the destination to prevent the compiler from optimizing the copy away.
//! _ = black_box(to[0]);
//! }
//! }
//! ```
//!
//! This scenario is executed in a Criterion benchmark by calling [`execute_runs()`][6] and providing
//! the desired work distribution modes to use:
//!
//! ```rust ignore (benchmark)
//! fn entrypoint(c: &mut Criterion) {
//! execute_runs::<CopyBytes, 1>(c, WorkDistribution::all());
//! }
//! ```
//!
//! Example output (in `target/criterion/report` after benchmarking):
//!
//! <img src="https://media.githubusercontent.com/media/folo-rs/folo/refs/heads/main/packages/many_cpus_benchmarking/images/work_distribution_comparison.png">
//!
//! # Step-by-step guides for common scenarios
//!
//! The following sections provide detailed guidance for implementing the most important
//! multi-threaded benchmarking scenarios using this crate.
//!
//! ## Scenario 1: Multiple threads performing the same action on shared data
//!
//! This scenario is useful for measuring how memory locality affects performance when multiple
//! threads perform identical operations on shared data structures. Examples include concurrent
//! readers of a shared cache, multiple workers processing items from a shared queue using the
//! same algorithm, or parallel searchers scanning the same dataset.
//!
//! ### Step-by-step implementation
//!
//! 1. **Define the payload struct** with shared data wrapped in thread-safe containers:
//!
//! ```rust
//! use std::collections::HashMap;
//! use std::sync::{Arc, RwLock};
//!
//! use many_cpus_benchmarking::Payload;
//!
//! #[derive(Debug, Default)]
//! struct SharedDataSameAction {
//! // Shared data structure accessible by both workers
//! shared_map: Arc<RwLock<HashMap<u64, u64>>>,
//!
//! // Flag to designate which worker initializes the data
//! is_initializer: bool,
//! }
//! ```
//!
//! 2. **Implement `new_pair()`** to create connected payload instances:
//!
//! ```rust
//! # use std::collections::HashMap;
//! # use std::sync::{Arc, RwLock};
//! # use many_cpus_benchmarking::Payload;
//! # #[derive(Debug, Default)]
//! # struct SharedDataSameAction {
//! # shared_map: Arc<RwLock<HashMap<u64, u64>>>,
//! # is_initializer: bool,
//! # }
//! impl Payload for SharedDataSameAction {
//! fn new_pair() -> (Self, Self) {
//! let shared_map = Arc::new(RwLock::new(HashMap::new()));
//!
//! let worker1 = Self {
//! shared_map: Arc::clone(&shared_map),
//! is_initializer: true,
//! };
//!
//! let worker2 = Self {
//! shared_map,
//! is_initializer: false,
//! };
//!
//! (worker1, worker2)
//! }
//!
//! fn prepare(&mut self) {
//! // Only one worker initializes the shared data
//! if self.is_initializer {
//! let mut map = self.shared_map.write().unwrap();
//! for i in 0..1000 {
//! map.insert(i, i * 2);
//! }
//! }
//! }
//!
//! fn process(&mut self) {
//! // Both workers perform the same operation
//! let map = self.shared_map.read().unwrap();
//! for key in 0..1000 {
//! std::hint::black_box(map.get(&key));
//! }
//! }
//! }
//! ```
//!
//! 3. **Choose appropriate work distribution modes**. Since both workers perform the same action,
//! payload exchange does not change the benchmark semantics, so you can exclude "self" modes:
//!
//! ```rust ignore (benchmark)
//! use criterion::Criterion;
//! use many_cpus_benchmarking::{WorkDistribution, execute_runs};
//!
//! fn benchmark_shared_reads(c: &mut Criterion) {
//! // Focus on distribution modes that use different processors
//! execute_runs::<SharedDataSameAction, 100>(
//! c,
//! WorkDistribution::all_with_unique_processors_without_self()
//! );
//! }
//! ```
//!
//! ### Key considerations for this scenario
//!
//! - Use thread-safe containers like `Arc<RwLock<T>>` or lock-free data structures
//! - Designate one worker as the initializer to avoid race conditions during setup
//! - Both workers should perform identical operations in `process()`
//! - Consider using `WorkDistribution::all_with_unique_processors_without_self()` since payload
//! exchange does not affect the benchmark when workers do the same thing
//! - Memory locality effects will be most visible when workers are in different memory regions
//!
//! ## Scenario 2: Multiple threads performing different actions (producer-consumer pattern)
//!
//! This scenario measures performance when threads have complementary roles, such as
//! producer-consumer pairs, sender-receiver communication, or writer-reader patterns.
//! This is essential for understanding how memory locality affects inter-thread communication.
//!
//! ### Step-by-step implementation
//!
//! 1. **Define the payload struct** with communication mechanisms and role identifiers:
//!
//! ```rust
//! use std::sync::mpsc;
//!
//! use many_cpus_benchmarking::Payload;
//!
//! #[derive(Debug)]
//! struct ProducerConsumerPattern {
//! // Communication channels
//! sender: mpsc::Sender<u64>,
//! receiver: mpsc::Receiver<u64>,
//!
//! // Role identifier
//! is_producer: bool,
//! }
//! ```
//!
//! 2. **Implement `new_pair()`** to create complementary worker roles:
//!
//! ```rust
//! # use std::sync::mpsc;
//! # use many_cpus_benchmarking::Payload;
//! # #[derive(Debug)]
//! # struct ProducerConsumerPattern {
//! # sender: mpsc::Sender<u64>,
//! # receiver: mpsc::Receiver<u64>,
//! # is_producer: bool,
//! # }
//! impl Payload for ProducerConsumerPattern {
//! fn new_pair() -> (Self, Self) {
//! let (tx1, rx1) = mpsc::channel();
//! let (tx2, rx2) = mpsc::channel();
//!
//! let producer = Self {
//! sender: tx1,
//! receiver: rx2,
//! is_producer: true,
//! };
//!
//! let consumer = Self {
//! sender: tx2,
//! receiver: rx1,
//! is_producer: false,
//! };
//!
//! (producer, consumer)
//! }
//!
//! fn prepare(&mut self) {
//! // Pre-populate channels to avoid deadlocks
//! for i in 0..1000 {
//! let _ = self.sender.send(i);
//! }
//! }
//!
//! fn process(&mut self) {
//! if self.is_producer {
//! // Producer: mostly sends data
//! for i in 0..5000 {
//! let _ = self.sender.send(i);
//! if i % 10 == 0 {
//! if let Ok(response) = self.receiver.try_recv() {
//! std::hint::black_box(response);
//! }
//! }
//! }
//! } else {
//! // Consumer: mostly receives and processes data
//! for _ in 0..5000 {
//! // Use blocking recv() to ensure consistent work per iteration
//! if let Ok(data) = self.receiver.recv() {
//! let processed = data * 2;
//! std::hint::black_box(processed);
//! if data % 5 == 0 {
//! let _ = self.sender.send(processed);
//! }
//! }
//! }
//! }
//! }
//! }
//! ```
//!
//! 3. **Use all work distribution modes** since different worker roles make payload exchange meaningful:
//!
//! ```rust ignore (benchmark)
//! use criterion::Criterion;
//! use many_cpus_benchmarking::{WorkDistribution, execute_runs};
//!
//! fn benchmark_producer_consumer(c: &mut Criterion) {
//! // All distribution modes are relevant for different worker roles
//! execute_runs::<ProducerConsumerPattern, 200>(
//! c,
//! WorkDistribution::all_with_unique_processors()
//! );
//! }
//! ```
//!
//! ### Key considerations for this scenario
//!
//! - Design complementary roles that represent realistic workload patterns
//! - Use communication primitives appropriate for your use case (channels, shared memory, etc.)
//! - Pre-populate communication channels in `prepare()` to avoid deadlocks
//! - Include all work distribution modes since role differences make payload exchange meaningful
//! - Consider bidirectional communication to create realistic interaction patterns
//! - Be aware that some distribution modes like `PinnedSameProcessor` may not be suitable
//! for scenarios requiring real-time collaboration between workers
//!
//! ## Choosing the right work distribution modes
//!
//! Different scenarios benefit from different work distribution mode selections:
//!
//! - **Same action scenarios**: Use `WorkDistribution::all_with_unique_processors_without_self()`
//! to focus on memory locality effects without payload exchange overhead
//! - **Different action scenarios**: Use `WorkDistribution::all_with_unique_processors()` to
//! include both memory locality and payload exchange effects
//! - **All scenarios**: Use `WorkDistribution::all()` to get the complete picture including
//! same-processor execution modes
//!
//! For complete working examples, see:
//! - `examples/shared_data_same_action.rs` - Multiple readers of shared `HashMap`
//! - `examples/shared_data_different_actions.rs` - Producer-consumer channel communication
//!
//! # Payload multiplier
//!
//! It may sometimes be desirable to multiply the size of a benchmark scenario, e.g. if a scenario is
//! very fast and completes too quickly for meaningful or comparable measurements due to the
//! worker orchestration overhead.
//!
//! Use the second generic parameter of `execute_runs` to apply a multiplier to the payload size. This
//! simply uses multiple payloads for each iteration (on the same worker), allowing the impact from the
//! benchmark harness overheads to be reduced, so the majority of the time is spent on payload
//! processing.
//!
//! [1]: https://bheisler.github.io/criterion.rs/book/index.html
//! [3]: crate::Payload::new_pair
//! [4]: crate::Payload::prepare
//! [5]: crate::Payload::process
//! [6]: crate::execute_runs
pub
pub use *;
pub use *;
pub use *;