easybench/
lib.rs

1/*!
2A lightweight micro-benchmarking library which:
3
4* uses linear regression to screen off constant error;
5* handles benchmarks which mutate state;
6* is very easy to use!
7
8Easybench is designed for benchmarks with a running time in the range `1 ns <
9x < 1 ms` - results may be unreliable if benchmarks are very quick or very
10slow. It's inspired by [criterion], but doesn't do as much sophisticated
11analysis (no outlier detection, no HTML output).
12
13[criterion]: https://hackage.haskell.org/package/criterion
14
15```
16use easybench::{bench,bench_env};
17
18# fn fib(_: usize) -> usize { 0 }
19#
20// Simple benchmarks are performed with `bench`.
21println!("fib 200: {}", bench(|| fib(200) ));
22println!("fib 500: {}", bench(|| fib(500) ));
23
24// If a function needs to mutate some state, use `bench_env`.
25println!("reverse: {}", bench_env(vec![0;100], |xs| xs.reverse() ));
26println!("sort:    {}", bench_env(vec![0;100], |xs| xs.sort()    ));
27```
28
29Running the above yields the following results:
30
31```none
32fib 200:         38 ns   (R²=1.000, 26053497 iterations in 154 samples)
33fib 500:        110 ns   (R²=1.000, 9131584 iterations in 143 samples)
34reverse:         54 ns   (R²=0.999, 5669992 iterations in 138 samples)
35sort:            93 ns   (R²=1.000, 4685942 iterations in 136 samples)
36```
37
38Easy! However, please read the [caveats](#caveats) below before using.
39
40# Benchmarking algorithm
41
42An *iteration* is a single execution of your code. A *sample* is a measurement,
43during which your code may be run many times. In other words: taking a sample
44means performing some number of iterations and measuring the total time.
45
46The first sample we take performs only 1 iteration, but as we continue taking
47samples we increase the number of iterations exponentially. We stop when a
48global time limit is reached (currently 1 second).
49
50If a benchmark must mutate some state while running, before taking a sample
51`n` copies of the initial state are prepared, where `n` is the number of
52iterations in that sample.
53
54Once we have the data, we perform OLS linear regression to find out how
55the sample time varies with the number of iterations in the sample. The
56gradient of the regression line tells us how long it takes to perform a
57single iteration of the benchmark. The R² value is a measure of how much
58noise there is in the data.
59
60# Caveats
61
62## Caveat 1: Harness overhead
63
64**TL;DR: Compile with `--release`; the overhead is likely to be within the
65noise of your benchmark.**
66
67Any work which easybench does once-per-sample is ignored (this is the purpose of the linear
68regression technique described above). However, work which is done once-per-iteration *will* be
69counted in the final times.
70
71* In the case of [`bench`] this amounts to incrementing the loop counter.
72* In the case of [`bench_env`], we also do a lookup into a big vector in
73  order to get the environment for that iteration.
74* If you compile your program unoptimised, there may be additional overhead.
75
76[`bench`]: fn.bench.html
77[`bench_env`]: fn.bench_env.html
78
79The cost of the above operations depend on the details of your benchmark;
80namely: (1) how large is the return value? and (2) does the benchmark evict
81the environment vector from the CPU cache? In practice, these criteria are only
82satisfied by longer-running benchmarks, making these effects hard to measure.
83
84If you have concerns about the results you're seeing, please take a look at
85[the inner loop of `bench_env`][source]. The whole library `cloc`s in at
86under 100 lines of code, so it's pretty easy to read.
87
88[source]: ../src/easybench/lib.rs.html#229-237
89
90## Caveat 2: Sufficient data
91
92**TL;DR: Measurements are unreliable when code takes too long (> 1 ms) to run.**
93
94Each benchmark collects data for 1 second. This means that in order to
95collect a statistically significant amount of data, your code should run
96much faster than this.
97
98When inspecting the results, make sure things look statistically
99significant. In particular:
100
101* Make sure the number of samples is big enough. More than 100 is probably OK.
102* Make sure the R² isn't suspiciously low. It's easy to achieve a high R²
103  value when the number of samples is small, so unfortunately the definition
104  of "suspiciously low" depends on how many samples were taken.  As a rule
105  of thumb, expect values greater than 0.99.
106
107## Caveat 3: Pure functions
108
109**TL;DR: Return enough information to prevent the optimiser from eliminating
110code from your benchmark.**
111
112Benchmarking pure functions involves a nasty gotcha which users should be
113aware of. Consider the following benchmarks:
114
115```
116# use easybench::{bench,bench_env};
117#
118# fn fib(_: usize) -> usize { 0 }
119#
120let fib_1 = bench(|| fib(500) );                     // fine
121let fib_2 = bench(|| { fib(500); } );                // spoiler: NOT fine
122let fib_3 = bench_env(0, |x| { *x = fib(500); } );   // also fine, but ugly
123# let _ = (fib_1, fib_2, fib_3);
124```
125
126The results are a little surprising:
127
128```none
129fib_1:        110 ns   (R²=1.000, 9131585 iterations in 144 samples)
130fib_2:          0 ns   (R²=1.000, 413289203 iterations in 184 samples)
131fib_3:        109 ns   (R²=1.000, 9131585 iterations in 144 samples)
132```
133
134Oh, `fib_2`, why do you lie? The answer is: `fib(500)` is pure, and its
135return value is immediately thrown away, so the optimiser replaces the call
136with a no-op (which clocks in at 0 ns).
137
138What about the other two? `fib_1` looks very similar, with one exception:
139the closure which we're benchmarking returns the result of the `fib(500)`
140call. When it runs your code, easybench takes the return value and tricks the
141optimiser into thinking it's going to use it for something, before throwing
142it away. This is why `fib_1` is safe from having code accidentally eliminated.
143
144In the case of `fib_3`, we actually *do* use the return value: each
145iteration we take the result of `fib(500)` and store it in the iteration's
146environment. This has the desired effect, but looks a bit weird.
147*/
148
149use std::f64;
150use std::fmt::{self, Display, Formatter};
151use std::time::*;
152
153// Each time we take a sample we increase the number of iterations:
154//      iters = ITER_SCALE_FACTOR ^ sample_no
155const ITER_SCALE_FACTOR: f64 = 1.1;
156
157// We try to spend this many seconds (roughly) in total on each benchmark.
158const BENCH_TIME_LIMIT: Duration = Duration::from_secs(1);
159
160/// Statistics for a benchmark run.
161#[derive(Debug, PartialEq, Clone)]
162pub struct Stats {
163    /// The time, in nanoseconds, per iteration. If the benchmark generated
164    /// fewer than 2 samples in the allotted time then this will be NaN.
165    pub ns_per_iter: f64,
166    /// The coefficient of determination, R².
167    ///
168    /// This is an indication of how noisy the benchmark was, where 1 is
169    /// good and 0 is bad. Be suspicious of values below 0.9.
170    pub goodness_of_fit: f64,
171    /// How many times the benchmarked code was actually run.
172    pub iterations: usize,
173    /// How many samples were taken (ie. how many times we allocated the
174    /// environment and measured the time).
175    pub samples: usize,
176}
177
178impl Display for Stats {
179    fn fmt(&self, f: &mut Formatter) -> fmt::Result {
180        if self.ns_per_iter.is_nan() {
181            write!(
182                f,
183                "Only generated {} sample(s) - we can't fit a regression line to that! \
184                 Try making your benchmark faster.",
185                self.samples
186            )
187        } else {
188            let per_iter = Duration::from_nanos(self.ns_per_iter as u64);
189            let per_iter = format!("{:?}", per_iter);
190            write!(
191                f,
192                "{:>11} (R²={:.3}, {} iterations in {} samples)",
193                per_iter, self.goodness_of_fit, self.iterations, self.samples
194            )
195        }
196    }
197}
198
199/// Run a benchmark.
200///
201/// The return value of `f` is not used, but we trick the optimiser into
202/// thinking we're going to use it. Make sure to return enough information
203/// to prevent the optimiser from eliminating code from your benchmark! (See
204/// the module docs for more.)
205pub fn bench<F, O>(mut f: F) -> Stats
206where
207    F: FnMut() -> O,
208{
209    bench_env((), |_| f())
210}
211
212/// Run a benchmark with an environment.
213///
214/// The value `env` is a clonable prototype for the "benchmark
215/// environment". Each iteration receives a freshly-cloned mutable copy of
216/// this environment. The time taken to clone the environment is not included
217/// in the results.
218///
219/// Nb: it's very possible that we will end up allocating many (>10,000)
220/// copies of `env` at the same time. Probably best to keep it small.
221///
222/// See `bench` and the module docs for more.
223///
224/// ## Overhead
225///
226/// Every iteration, `bench_env` performs a lookup into a big vector in
227/// order to get the environment for that iteration. If your benchmark
228/// is memory-intensive then this could, in the worst case, amount to a
229/// systematic cache-miss (ie. this vector would have to be fetched from
230/// DRAM at the start of every iteration). In this case the results could be
231/// affected by a hundred nanoseconds. This is a worst-case scenario however,
232/// and I haven't actually been able to trigger it in practice... but it's
233/// good to be aware of the possibility.
234pub fn bench_env<F, I, O>(env: I, f: F) -> Stats
235where
236    F: FnMut(&mut I) -> O,
237    I: Clone,
238{
239    bench_gen_env(move || env.clone(), f)
240}
241
242/// Run a benchmark with a generated environment.
243///
244/// The function `gen_env` creates the "benchmark environment" for the
245/// computation. Each iteration receives a freshly-created environment. The
246/// time taken to create the environment is not included in the results.
247///
248/// Nb: it's very possible that we will end up generating many (>10,000)
249/// copies of `env` at the same time. Probably best to keep it small.
250///
251/// See `bench` and the module docs for more.
252///
253/// ## Overhead
254///
255/// Every iteration, `bench_gen_env` performs a lookup into a big vector
256/// in order to get the environment for that iteration. If your benchmark
257/// is memory-intensive then this could, in the worst case, amount to a
258/// systematic cache-miss (ie. this vector would have to be fetched from
259/// DRAM at the start of every iteration). In this case the results could be
260/// affected by a hundred nanoseconds. This is a worst-case scenario however,
261/// and I haven't actually been able to trigger it in practice... but it's
262/// good to be aware of the possibility.
263pub fn bench_gen_env<G, F, I, O>(mut gen_env: G, mut f: F) -> Stats
264where
265    G: FnMut() -> I,
266    F: FnMut(&mut I) -> O,
267{
268    let mut data = Vec::new();
269    // The time we started the benchmark (not used in results)
270    let bench_start = Instant::now();
271
272    // Collect data until BENCH_TIME_LIMIT is reached.
273    while bench_start.elapsed() < BENCH_TIME_LIMIT {
274        let iters = ITER_SCALE_FACTOR.powi(data.len() as i32).round() as usize;
275        // Prepare the environments - one per iteration
276        let mut xs = std::iter::repeat_with(&mut gen_env)
277            .take(iters)
278            .collect::<Vec<I>>();
279        // Start the clock
280        let iter_start = Instant::now();
281        // We iterate over `&mut xs` rather than draining it, because we
282        // don't want to drop the env values until after the clock has stopped.
283        for x in &mut xs {
284            // Run the code and pretend to use the output
285            std::hint::black_box(f(x));
286        }
287        let time = iter_start.elapsed();
288        data.push((iters, time));
289    }
290
291    // If the first iter in a sample is consistently slow, that's fine -
292    // that's why we do the linear regression. If the first sample is slower
293    // than the rest, however, that's not fine.  Therefore, we discard the
294    // first sample as a cache-warming exercise.
295    data.remove(0);
296
297    // Compute some stats
298    let (grad, r2) = regression(&data[..]);
299    Stats {
300        ns_per_iter: grad,
301        goodness_of_fit: r2,
302        iterations: data.iter().map(|&(x, _)| x).sum(),
303        samples: data.len(),
304    }
305}
306
307/// Compute the OLS linear regression line for the given data set, returning
308/// the line's gradient and R². Requires at least 2 samples.
309//
310// Overflows:
311//
312// * sum(x * x): num_samples <= 0.5 * log_k (1 + 2 ^ 64 (FACTOR - 1))
313fn regression(data: &[(usize, Duration)]) -> (f64, f64) {
314    if data.len() < 2 {
315        return (f64::NAN, f64::NAN);
316    }
317    // Do all the arithmetic using f64, because it can happen that the
318    // squared numbers to overflow using integer arithmetic if the
319    // tests are too fast (so we run too many iterations).
320    let data: Vec<_> = data
321        .iter()
322        .map(|&(x, y)| (x as f64, y.as_nanos() as f64))
323        .collect();
324    let n = data.len() as f64;
325    let nxbar = data.iter().map(|&(x, _)| x).sum::<f64>(); // iter_time > 5e-11 ns
326    let nybar = data.iter().map(|&(_, y)| y).sum::<f64>(); // TIME_LIMIT < 2 ^ 64 ns
327    let nxxbar = data.iter().map(|&(x, _)| x * x).sum::<f64>(); // num_iters < 13_000_000_000
328    let nyybar = data.iter().map(|&(_, y)| y * y).sum::<f64>(); // TIME_LIMIT < 4.3 e9 ns
329    let nxybar = data.iter().map(|&(x, y)| x * y).sum::<f64>();
330    let ncovar = nxybar - ((nxbar * nybar) / n);
331    let nxvar = nxxbar - ((nxbar * nxbar) / n);
332    let nyvar = nyybar - ((nybar * nybar) / n);
333    let gradient = ncovar / nxvar;
334    let r2 = (ncovar * ncovar) / (nxvar * nyvar);
335    assert!(r2.is_nan() || r2 <= 1.0);
336    (gradient, r2)
337}
338
339#[cfg(test)]
340mod tests {
341    use super::*;
342    use std::thread;
343    use std::time::Duration;
344
345    fn fib(n: usize) -> usize {
346        let mut i = 0;
347        let mut sum = 0;
348        let mut last = 0;
349        let mut curr = 1usize;
350        while i < n - 1 {
351            sum = curr.wrapping_add(last);
352            last = curr;
353            curr = sum;
354            i += 1;
355        }
356        sum
357    }
358
359    // This is only here because doctests don't work with `--nocapture`.
360    #[test]
361    #[ignore]
362    fn doctests_again() {
363        println!();
364        println!("fib 200: {}", bench(|| fib(200)));
365        println!("fib 500: {}", bench(|| fib(500)));
366        println!("reverse: {}", bench_env(vec![0; 100], |xs| xs.reverse()));
367        println!("sort:    {}", bench_env(vec![0; 100], |xs| xs.sort()));
368
369        // This is fine:
370        println!("fib 1:   {}", bench(|| fib(500)));
371        // This is NOT fine:
372        println!(
373            "fib 2:   {}",
374            bench(|| {
375                fib(500);
376            })
377        );
378        // This is also fine, but a bit weird:
379        println!(
380            "fib 3:   {}",
381            bench_env(0, |x| {
382                *x = fib(500);
383            })
384        );
385    }
386
387    #[test]
388    fn very_quick() {
389        println!();
390        println!("very quick: {}", bench(|| {}));
391    }
392
393    #[test]
394    fn very_slow() {
395        println!();
396        println!(
397            "very slow: {}",
398            bench(|| thread::sleep(Duration::from_millis(400)))
399        );
400    }
401
402    #[test]
403    fn test_sleep() {
404        println!();
405        println!(
406            "sleep 1 ms: {}",
407            bench(|| thread::sleep(Duration::from_millis(1)))
408        );
409    }
410
411    #[test]
412    fn noop() {
413        println!();
414        println!("noop base: {}", bench(|| {}));
415        println!("noop 0:    {}", bench_env(vec![0u64; 0], |_| {}));
416        println!("noop 16:   {}", bench_env(vec![0u64; 16], |_| {}));
417        println!("noop 64:   {}", bench_env(vec![0u64; 64], |_| {}));
418        println!("noop 256:  {}", bench_env(vec![0u64; 256], |_| {}));
419        println!("noop 512:  {}", bench_env(vec![0u64; 512], |_| {}));
420    }
421
422    #[test]
423    fn ret_value() {
424        println!();
425        println!(
426            "no ret 32:    {}",
427            bench_env(vec![0u64; 32], |x| { x.clone() })
428        );
429        println!("return 32:    {}", bench_env(vec![0u64; 32], |x| x.clone()));
430        println!(
431            "no ret 256:   {}",
432            bench_env(vec![0u64; 256], |x| { x.clone() })
433        );
434        println!(
435            "return 256:   {}",
436            bench_env(vec![0u64; 256], |x| x.clone())
437        );
438        println!(
439            "no ret 1024:  {}",
440            bench_env(vec![0u64; 1024], |x| { x.clone() })
441        );
442        println!(
443            "return 1024:  {}",
444            bench_env(vec![0u64; 1024], |x| x.clone())
445        );
446        println!(
447            "no ret 4096:  {}",
448            bench_env(vec![0u64; 4096], |x| { x.clone() })
449        );
450        println!(
451            "return 4096:  {}",
452            bench_env(vec![0u64; 4096], |x| x.clone())
453        );
454        println!(
455            "no ret 50000: {}",
456            bench_env(vec![0u64; 50000], |x| { x.clone() })
457        );
458        println!(
459            "return 50000: {}",
460            bench_env(vec![0u64; 50000], |x| x.clone())
461        );
462    }
463}