many_cpus_benchmarking/lib.rs
1//! [Criterion][1] benchmark harness designed to compare different modes of distributing work in a
2//! many-processor system with multiple memory regions. This helps highlight the performance impact of
3//! cross-memory-region data transfers, cross-processor data transfers and multi-threaded logic.
4//!
5//! This is part of the [Folo project](https://github.com/folo-rs/folo) that provides mechanisms for
6//! high-performance hardware-aware programming in Rust.
7//!
8//! # Execution model
9//!
10//! The benchmark harness selects **pairs of processors** that will execute each iteration of a
11//! benchmark scenario, preparing and processing **payloads**. The iteration time is the maximum
12//! duration of any worker (whichever worker takes longest to process the payload it is given).
13//!
14//! The criteria for processor pairs selection is determined by the specified [`WorkDistribution`],
15//! with the final selection randomized for each iteration if there are multiple equally valid candidate
16//! processor pairs.
17//!
18//! # Usage
19//!
20//! For each benchmark scenario, define a type that implements the [`Payload`] trait. Executing a
21//! benchmark scenario consists of the following major steps:
22//!
23//! 1. For each processor pair, [a payload pair is created][3].
24//! 1. Each payload is moved to its assigned processor and [prepared][4]. This is where the payload data
25//! set is typically generated.
26//! 1. Depending on the work distribution mode, the payloads may now be exchanged between the assigned
27//! processors, to ensure that we process "foreign" data on each processor.
28//! 1. The payload is [processed][5] by each worker in the pair. This is the timed step.
29//!
30//! The reference to "foreign" data here implies that if the two workers are in different memory
31//! regions, the data is likely to be present in a different memory region than used by the processor
32//! used to process the payload.
33//!
34//! This is because physical memory pages of heap-allocated objects are allocated in the memory region
35//! of the processor that initializes the memory (in the "prepare" step), so despite the payload later
36//! being moved to a different worker's thread, any heap-allocated data referenced by the payload
37//! remains where it is, which may be in physical memory modules that are not directly connected to
38//! the processor that will process the payload.
39//!
40//! # Example
41//!
42//! A simple scenario that merely copies memory from a foreign buffer to a local one
43//! (`benches/many_cpus_harness_demo.rs`):
44//!
45//! ```rust ignore (benchmark)
46//! const COPY_BYTES_LEN: usize = 64 * 1024 * 1024;
47//!
48//! /// Sample benchmark scenario that copies bytes between the two paired payloads.
49//! ///
50//! /// The source buffers are allocated in the "prepare" step and become local to the "prepare" worker.
51//! /// The destination buffers are allocated in the "process" step. The end result is that we copy
52//! /// from remote memory (allocated in the "prepare" step) to local memory in the "process" step.
53//! ///
54//! /// There is no deep meaning behind this scenario, just a sample benchmark that showcases comparing
55//! /// different work distribution modes to identify performance differences from hardware-awareness.
56//! #[derive(Debug, Default)]
57//! struct CopyBytes {
58//! from: Option<Vec<u8>>,
59//! }
60//!
61//! impl Payload for CopyBytes {
62//! fn new_pair() -> (Self, Self) {
63//! (Self::default(), Self::default())
64//! }
65//!
66//! fn prepare(&mut self) {
67//! self.from = Some(vec![99; COPY_BYTES_LEN]);
68//! }
69//!
70//! fn process(&mut self) {
71//! let from = self.from.as_ref().unwrap();
72//! let mut to = Vec::with_capacity(COPY_BYTES_LEN);
73//!
74//! // SAFETY: The pointers are valid, the length is correct, all is well.
75//! unsafe {
76//! ptr::copy_nonoverlapping(from.as_ptr(), to.as_mut_ptr(), COPY_BYTES_LEN);
77//! }
78//!
79//! // SAFETY: We just filled these bytes, it is all good.
80//! unsafe {
81//! to.set_len(COPY_BYTES_LEN);
82//! }
83//!
84//! // Read from the destination to prevent the compiler from optimizing the copy away.
85//! _ = black_box(to[0]);
86//! }
87//! }
88//! ```
89//!
90//! This scenario is executed in a Criterion benchmark by calling [`execute_runs()`][6] and providing
91//! the desired work distribution modes to use:
92//!
93//! ```rust ignore (benchmark)
94//! fn entrypoint(c: &mut Criterion) {
95//! execute_runs::<CopyBytes, 1>(c, WorkDistribution::all());
96//! }
97//! ```
98//!
99//! Example output (in `target/criterion/report` after benchmarking):
100//!
101//! <img src="https://media.githubusercontent.com/media/folo-rs/folo/refs/heads/main/crates/many_cpus_benchmarking/images/work_distribution_comparison.png">
102//!
103//! # Payload multiplier
104//!
105//! It may sometimes be desirable to multiply the size of a benchmark scenario, e.g. if a scenario is
106//! very fast and completes too quickly for meaningful or comparable measurements due to the
107//! worker orchestration overhead.
108//!
109//! Use the second generic parameter of `execute_runs` to apply a multiplier to the payload size. This
110//! simply uses multiple payloads for each iteration (on the same worker), allowing the impact from the
111//! benchmark harness overheads to be reduced, so the majority of the time is spent on payload
112//! processing.
113//!
114//! [1]: https://bheisler.github.io/criterion.rs/book/index.html
115//! [3]: crate::Payload::new_pair
116//! [4]: crate::Payload::prepare
117//! [5]: crate::Payload::process
118//! [6]: crate::execute_runs
119
120mod cache;
121mod payload;
122mod run;
123mod work_distribution;
124
125pub(crate) use cache::*;
126pub use payload::*;
127pub use run::*;
128pub use work_distribution::*;