scx_utils 1.1.0

Utilities for sched_ext schedulers
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
// Copyright (c) Meta Platforms, Inc. and affiliates.

// This software may be used and distributed according to the terms of the
// GNU General Public License version 2.

//! # SCX Load Calculator
//!
//! A crate providing abstractions for calculating load between scheduling
//! domains.
//!
//! Load Balancing Primer
//! ---------------------
//!
//! Let's begin by doing a brief overview of load balancing. In order to
//! understand load balancing, we'll introduce the relevant terms, and explain
//! what load balancing is from there:
//!
//! *Weight*
//!
//! A positive integer value representing a task's priority in the system. The
//! meaning of weight is ultimately up to the caller to determine and apply, but
//! conceptually speaking weight is a linear scaling factor of a task's
//! perceived load (defined below) in the system.
//!
//! *Duty Cycle*
//!
//! A task's duty cycle is the proportion of time that it could have used a CPU
//! in some time window. For example, say that a task does nothing but spin in a
//! tight loop for 10ms, and then sleep for 10ms. Then in a 20ms time window,
//! its duty cycle would be 0.5.
//!
//! Note that this is not the same thing as the proportion of time it's
//! _runnable_. For example, if you had two such tasks sharing a CPU, one of
//! them may wait for 10ms to use the core while the other one runs, and then it
//! would run for its 10ms duty cycle. It was runnable for 20ms, but could only
//! actually use the CPU for 10ms, so its duty cycle was 10ms / 20ms = 0.5.
//!
//! *Load*
//!
//! A scheduling entity's load l_x is simply the product of its weight and duty
//! cycle:
//!
//! l_x = w_x * d_x
//!
//! *Infeasible Weights*
//!
//! At a conceptual level, the goal of a load balancer is of course to balance
//! load across the system. If a scheduling entity has load l_x, and the total
//! sum of all entities' loads on the system is L, then entity x should receive
//! l_x / L proportion of the available CPU capacity on the system.
//!
//! This formulation works fine in many cases, but can break down when an
//! entity's proportion of load (and thus allotted CPU capacity) exceeds the
//! amount of CPU it can consume.
//!
//! For example, say you were on a system with 2 sets of 2-core groups (for a
//! total of 4 cores on the system), eight tasks with the following duty cycle
//! and weight:
//!
//! ID          Weight          Duty Cycle              Load
//!
//! 0           1               1.0                     1
//! 1           1               1.0                     1
//! 2           1               1.0                     1
//! 3           1               1.0                     1
//! 4           1               1.0                     1
//! 5           1               1.0                     1
//! 6           1               1.0                     1
//! 7           10000           1.0                     10000
//!
//!
//! The load sum L of the system is 1 * 7 + 1 * 10000 = 10007. This means that
//! tasks 0-6 have a load proportion of 1 / 10007 ~= 0.0001, and task 7 has a
//! load proportion of 0.9993. In other words, tasks 0-6 are entitled to
//! 0.0001 * 4 = 0.0004 CPUs worth of capacity, and task 7 is entitled to
//! 0.9993 * 4 = 3.9972 CPUs worth of capacity.
//!
//! Task 7 can only consume at most 1.0 CPU due to its duty cycle being 1.0, so
//! its weight is "infeasible" as the amount of CPU capacity it would be
//! afforded exceeds what it can actually use.
//!
//! What we instead want is to find an adjusted weight w' that we can assign to
//! Task 7 such that it gets its full duty cycle of 1.0, but allows the
//! remaining 3 cores worth of compute to be equally spread amongst the
//! remaining tasks. The task of finding this weight w' is known as the
//! "infeasible weights problem", and is solved by this crate.
//!
//! More details on load balancing and the infeasible weights problem are
//! provided in the following Google Drive document:
//!
//! <https://drive.google.com/file/d/1fAoWUlmW-HTp6akuATVpMxpUpvWcGSAv>
//!
//! Using the Crate
//! ---------------
//!
//! This crate has two primary sets of APIs:
//!
//! (1) APIs for aggregating domain loads specified by the caller
//! (2) APIs for querying those loads, after being adjusted for infeasibility
//!
//! LoadAggregator
//! --------------
//!
//! The object that addresses (1) is the LoadAggregator. This object receives
//! load passed by the user, and once all load has been received, runs the
//! numbers to create load sums and weights that are adjusted for infeasibility
//! as described above. LoadAggregator objects can be created and used as
//! follows:
//!
//! Assume we're on a 16-core (32-CPU) host with two core complexes:
//!
//!```rust
//!     use scx_utils::LoadAggregator;
//!     use log::info;
//!
//!     // Create a LoadAggregator object, specifying the number of CPUs on the
//!     // system, and whether it should only aggregate duty cycle.
//!     let mut aggregator = LoadAggregator::new(32, false);
//!
//!     // Domain 0, weight 1, has duty cycle 1.0.
//!     aggregator.record_dom_load(0, 1, 1.0);
//!
//!     // Domain 1, weight 1, has duty cycle 1.0.
//!     aggregator.record_dom_load(1, 1, 1.0);
//!     // ...
//!     aggregator.record_dom_load(31, 1, 1.0);
//!     // ...
//!     aggregator.record_dom_load(63, 1, 1.0);
//!
//!     // Domain 64, weight 10000, has duty cycle 1.0.
//!     aggregator.record_dom_load(64, 10000, 1.0);
//!
//!     // Note that it is allowed to record load for a domain more than once.
//!     // For a given domain you may only record load for a specific weight
//!     // once, but you may record loads for multiple different weights for a
//!     // single domain.
//!
//!     // Create the LoadLedger object
//!     let ledger = aggregator.calculate();
//!
//!     // Outputs: 66.06451612903226
//!     info!("{}", ledger.global_load_sum());
//!```
//!
//! In the above example, we have 65 domains, all with duty cycle 1.0. 64 of
//! them have weight 1, and one of them has weight 10000. If there were multiple
//! tasks per domain, we would record different or additional values. For
//! example, if we had two tasks with weight 1 in domain 0, and an additional
//! task with weight 100 in domain 1, we would record their loads as follows:
//!
//!```rust
//!     use scx_utils::LoadAggregator;
//!     let mut aggregator = LoadAggregator::new(32, false);
//!
//!     // In this version, domain 0 has 2 tasks with weight 1.0 and duty cycle
//!     // 1.0.
//!     aggregator.record_dom_load(0, 1, 2.0);
//!
//!     // In this version, domain 1 also has a task with weight 100 and duty
//!     // cycle 1.0.
//!     aggregator.record_dom_load(1, 100, 1.0);
//!```
//!
//! Note that the abstractions here are meant to be generic across schedulers.
//! LoadAggregator assumes nothing about the scheduling domains being passed to
//! it. For example, they may span layers defined in a scheduler, domains
//! specified by the user via cpumask strings, or domains that span L3 caches.
//!
//! LoadLedger
//! ----------
//!
//! Once you have fed all load values to the LoadAggregator, you can use it to
//! calculate load sums (including adjusting for weight infeasibility), and
//! create a read-only LoadLedger object that can be used to query the values as
//! described in (2).
//!
//! There are various values of interest that can be queried from a LoadLedger
//! object. For example, you may ask for the sum of load (adjusted for
//! infeasibility) for the whole system, or the sum of duty cycle for the whole
//! system, or the sum of load for each domain (adjusted for infeasibility):
//!
//! ```rust
//!     use scx_utils::LoadAggregator;
//!     use log::info;
//!     let mut aggregator = LoadAggregator::new(32, false);
//!     aggregator.record_dom_load(0, 1, 1.0);
//!     // ...
//!     aggregator.record_dom_load(63, 1, 1.0);
//!     aggregator.record_dom_load(64, 10000, 1.0);
//!
//!     let ledger = aggregator.calculate();
//!
//!     info!("Global load sum: {}", ledger.global_load_sum());
//!     info!("Global duty cycle sum: {}", ledger.global_dcycle_sum());
//!
//!     let dom_dcycles = ledger.dom_dcycle_sums();
//!     let dom_loads = ledger.dom_load_sums();
//!     let effective_max_weight = ledger.effective_max_weight();
//!
//!     // ...
//! ```

use anyhow::bail;
use anyhow::Result;
use std::collections::BTreeMap;

const MIN_WEIGHT: usize = 1;

#[derive(Debug)]
pub struct LoadLedger {
    dom_load_sums: Vec<f64>,
    dom_dcycle_sums: Vec<f64>,
    global_dcycle_sum: f64,
    global_load_sum: f64,
    effective_max_weight: f64,
}

impl LoadLedger {
    /// Return the global, host-wide sum of duty cycles.
    pub fn global_dcycle_sum(&self) -> f64 {
        self.global_dcycle_sum
    }

    /// Return the global, host-wide load sum; adjusted for infeasibility.
    pub fn global_load_sum(&self) -> f64 {
        self.global_load_sum
    }

    /// Return an array of domain duty cycle sums, indexed by ID.
    pub fn dom_dcycle_sums(&self) -> &[f64] {
        &self.dom_dcycle_sums
    }

    /// Return an array of domain load sums, indexed by ID, and adjusted for
    /// infeasibility.
    pub fn dom_load_sums(&self) -> &[f64] {
        &self.dom_load_sums
    }

    /// If applicable, return the adjusted weight for all infeasible scheduling
    /// entities.
    pub fn effective_max_weight(&self) -> f64 {
        self.effective_max_weight
    }
}

#[derive(Debug)]
struct Domain {
    loads: BTreeMap<usize, f64>,
    dcycle_sum: f64,
    load_sum: f64,
}

fn approx_eq(a: f64, b: f64) -> bool {
    (a - b).abs() <= 0.0001f64
}

fn approx_ge(a: f64, b: f64) -> bool {
    a > b || approx_eq(a, b)
}

#[derive(Debug)]
pub struct LoadAggregator {
    doms: BTreeMap<usize, Domain>,
    global_loads: BTreeMap<usize, f64>,
    nr_cpus: usize,
    max_weight: usize,
    global_dcycle_sum: f64,
    global_load_sum: f64,
    effective_max_weight: f64,
    dcycle_only: bool,
}

impl LoadAggregator {
    /// Create a LoadAggregator object that can be given domains and loads by
    /// the caller, and then used to create a LoadLedger object.
    pub fn new(nr_cpus: usize, dcycle_only: bool) -> LoadAggregator {
        LoadAggregator {
            doms: BTreeMap::new(),
            global_loads: BTreeMap::new(),
            nr_cpus,
            max_weight: 0,
            global_dcycle_sum: 0.0f64,
            global_load_sum: 0.0f64,
            effective_max_weight: 10000.0f64,
            dcycle_only,
        }
    }

    /// Given a LoadAggregator with recorded domain loads, compute the
    /// system-wide load, adjusting for infeasible weights when necessary.
    pub fn calculate(&mut self) -> LoadLedger {
        if !self.dcycle_only && approx_ge(self.max_weight as f64, self.infeasible_threshold()) {
            self.adjust_infeas_weights();
        }

        let mut dom_load_sums = Vec::new();
        let mut dom_dcycle_sums = Vec::new();

        for (_, dom) in self.doms.iter() {
            dom_load_sums.push(dom.load_sum);
            dom_dcycle_sums.push(dom.dcycle_sum);
        }

        LoadLedger {
            dom_load_sums,
            dom_dcycle_sums,
            global_dcycle_sum: self.global_dcycle_sum,
            global_load_sum: self.global_load_sum,
            effective_max_weight: self.effective_max_weight,
        }
    }

    /// Init a domain and set default load values.
    /// Does nothing if the domain already exists.
    pub fn init_domain(&mut self, dom_id: usize) {
        self.doms.entry(dom_id).or_insert(Domain {
            loads: BTreeMap::new(),
            dcycle_sum: 0.0f64,
            load_sum: 0.0f64,
        });
    }

    /// Record an instance of some domain's load (by specifying its weight and
    /// dcycle). Returns an error if duty cycle is specified more than once
    /// for a given (Domain, weight) tuple.
    pub fn record_dom_load(&mut self, dom_id: usize, weight: usize, dcycle: f64) -> Result<()> {
        if weight < MIN_WEIGHT {
            bail!(
                "weight {} is less than minimum weight {}",
                weight,
                MIN_WEIGHT
            );
        }

        let domain = self.doms.entry(dom_id).or_insert(Domain {
            loads: BTreeMap::new(),
            dcycle_sum: 0.0f64,
            load_sum: 0.0f64,
        });

        if domain.loads.insert(weight, dcycle).is_some() {
            bail!("Domain {} already had load for weight {}", dom_id, weight);
        }

        let weight_dcycle = self.global_loads.entry(weight).or_insert(0.0f64);
        *weight_dcycle += dcycle;

        let load = weight as f64 * dcycle;

        domain.dcycle_sum += dcycle;
        domain.load_sum += load;

        self.global_dcycle_sum += dcycle;
        self.global_load_sum += load;

        if weight > self.max_weight {
            self.max_weight = weight;
        }

        Ok(())
    }

    fn infeasible_threshold(&self) -> f64 {
        // If the sum of duty cycle on the system is >= P, any weight w_x of a
        // task that exceeds L / P is guaranteed to be infeasible. Furthermore,
        // if any weight w_x == L / P then we know that task t_x can get its
        // full duty cycle, as:
        //
        // c_x = P * (w_x * d_x) / L
        //     = P * (L/P * d_x) / L
        //     = d_x / L / L
        //     = d_x
        //
        // If there is no scheduling entity whose weight exceeds L / P that has
        // a nonzero duty cycle, then all weights are feasible and we can use
        // the data we collected above without having to adjust for
        // infeasibility. Otherwise, we have at least one infeasible weight.
        //
        // See the comment in adjust_infeas_weights() for a more comprehensive
        // description of the algorithm for adjusting for infeasible weights.
        self.global_load_sum / self.nr_cpus as f64
    }

    fn apply_infeasible_threshold(&mut self, lambda_x: f64) {
        self.effective_max_weight = lambda_x;
        self.global_load_sum = 0.0f64;

        for dom in self.doms.values_mut() {
            dom.load_sum = dom
                .loads
                .iter()
                .map(|(weight, dcycle)| (*weight as f64).min(lambda_x) * dcycle)
                .sum();

            self.global_load_sum += dom.load_sum;
        }
    }

    fn adjust_infeas_weights(&mut self) {
        // At this point we have the following data points:
        //
        // P : The number of cores on the system
        // L : The total load sum of the system before any adjustments for
        //     infeasibility
        // Lf: The load sum of all feasible scheduling entities
        // D : The total sum of duty cycles across all domains in the system
        // Di: The duty cycle sum of all infeasible tasks
        //
        // We need to find a weight lambda_x such that every infeasible
        // scheduling entity in the system will be granted a CPU allocation
        // equal to their duty cycle, and all the remaining compute capacity in
        // the system will be divided fairly amongst the feasible tasks
        // according to their load. Our goal is to find a value lambda_x such
        // that every infeasible entity is allocated its duty cycle, and the
        // remaining compute capacity is shared fairly amongst the feasible
        // entities on the system.
        //
        // If L' is the load sum on the system after clamping all weights
        // w_x > lambda_x to lambda_x, then lambda_x can be defined as follows:
        //
        // lambda_x = L' / P
        //
        // => L'                  = lambda_x * Di + Lf
        // => lambda_x * P'       = lambda_x * Di + Lf
        // => lambda_x (P' - D_I) = Lf
        // => lambda_x            = Lf / (P' - Di)
        //
        // Thus, need to iterate over different values of x until we find a
        // lambda_x such that:
        //
        //      w_x >= lambda_x >= w_x+1
        //
        // Once we find a lambda_x, we need to:
        //
        // 1. Adjust the maximum weights of any w_x > lambda_x -> lambda_x
        // 2. Subtract (w_i - lambda_x) from the load sums that the infeasible
        //    entities were contributing to.
        // 3. Re-calculate the per-domain load, and the global load average.
        //
        // Note that we should always find a lambda_x at this point, as we
        // verified in the caller that there is at least one infeasible entity
        // in the system.
        //
        // All of this is described and proven in detail in the following pdf:
        //
        // <https://drive.google.com/file/d/1fAoWUlmW-HTp6akuATVpMxpUpvWcGSAv>
        let p = self.nr_cpus as f64;
        let mut curr_dcycle_sum = 0.0f64;
        let mut curr_load_sum = self.global_load_sum;
        let mut lambda_x = curr_load_sum / p;

        for (weight, dcycles) in self.global_loads.iter().rev() {
            if approx_ge(lambda_x, *weight as f64) {
                self.apply_infeasible_threshold(lambda_x);
                return;
            }

            curr_dcycle_sum += dcycles;
            curr_load_sum -= *weight as f64 * dcycles;
            lambda_x = curr_load_sum / (p - curr_dcycle_sum);
        }

        // We can fail to find an infeasible weight if the host is
        // under-utilized. In this case, just fall back to using weights. If
        // this is happening due to a stale system-wide util value due to the
        // tuner not having run recently enough, it is a condition that should
        // self-correct soon. If it is the result of the user configuring us to
        // use weights even when the system is under-utilized, they were warned
        // when the scheduler was launched.
    }
}