Crate bench_diff

Expand description

This library supports reliable latency comparison between two functions/closures. This is trickier than it may seem, due to time-dependent random noise and ordering effects. This library provides commonly used latency metrics for the target functions (mean, standard deviation, median, percentiles, min, max), but it differentiates itself by providing support for the statistically rigorous comparison of latencies between two functions.

One could simply try to run tools like Criterion or Divan on each function and compare the results. However, these challenges arise:

Time-dependent random noise – Random noise can and often does change over time. This can significantly distort the comparison. (This is also a contributor to the ordering effect – see below). Increasing the sample size (number of function executions) can narrow the variance of a function’s median or mean latency as measured at a point in time, but it is not effective by itself in mitigating the time-dependent variability.
Ordering effect – When running two benchmarks, one after the other, it is not uncommon to observe that the first one may get an edge over the second one. This can also significantly distort the comparison.

Due to these effects, at the microseconds latency magnitude, a function that is known by construction to be 5% faster than another function can often show a mean latency and/or median latency that is higher than the corresponding metric for the slower function. This effect is less pronounced for latencies at the milliseconds magnitude or higher, but it can still distort results. In general, the smaller the difference between the latencies of the two functions, the harder it is to distinguish the faster function from the slower one with the usual benchmarking approaches.

One could tackle these challenges by running the individual benchmarks, one after the other, repeating that multiple times, and using appropriate statistical methods to compare the two observed latency distributions. This is, however, quite time consuming and requires careful planning and analysis.

The present library has been validated by extensive benchmark testing and provides a convenient, efficient, and statistically sound alternative to the above more cumbersome methodology.

For the mathematically inclined reader, a later section presents a simple model to reason about time-dependent random noise.

§Quick Start

To run a benchmark with bench_diff, follow these simple steps:

Create a bench file – see Simple Bench Example for a representative example.
Add the bench_diff dependency in Cargo.toml:
```
[dev-dependencies]
bench_diff = "1.1"
```
Create in Cargo.toml a bench section corresponding to the bench file. For example, given a file benches/simple_bench.rs, add the following to Cargo.toml:
```
[[bench]]
name = "simple_bench"
harness = false
```
Execute the benchmark the usual way:
```
cargo bench --bench simple_bench
```

§Simple Bench Example

This example includes two comparisons of latencies of two functions and prints statistics for each comparison and each function.

//! Simple example benchmark.
//!
//! To run the bench (assuming the source is at `benches/simple_bench.rs`):
//! ```
//! cargo bench --bench simple_bench
//! ```

use bench_diff::{DiffOut, LatencyUnit, bench_diff_with_status, stats_types::AltHyp};
use std::{thread, time::Duration};

/// This function's latency is at least 21ms.
fn f1() {
    thread::sleep(Duration::from_millis(21));
}

/// This function's latency is at least 20ms.
fn f2() {
    thread::sleep(Duration::from_millis(20));
}

// The latency ratio between these functions should be approximately 1.05.

// Because the magnitude of latencies involved is milliseconds, it is
// a good idea to use `LatencyUnit::Micro` below. As a rule of thumb, always use
// the closest finer-grained latency unit.

fn main() {
    // Output values are in the selected latench unit, i.e., microseconds.

    println!("*** 1st benchmark ***");
    {
        let out: DiffOut = bench_diff_with_status(LatencyUnit::Micro, f1, f2, 100, |_, _| {
            println!("Comparing latency of f1 vs. f2.");
            println!();
        });
        print_diff_out(&out);
    }

    println!("*** 2nd benchmark ***");
    {
        let out: DiffOut = bench_diff_with_status(LatencyUnit::Micro, f1, f1, 100, |_, _| {
            println!("Comparing latency of f1 vs. f1.");
            println!();
        });
        print_diff_out(&out);
    }
}

fn print_diff_out(out: &DiffOut) {
    const ALPHA: f64 = 0.05;

    println!();
    println!("ratio_medians_f1_f2={}", out.ratio_medians_f1_f2(),);
    println!("student_ratio_ci={:?}", out.welch_ratio_ci(ALPHA),);
    println!(
        "student_diff_ln_test_gt:{:?}",
        out.welch_ln_test(AltHyp::Gt, ALPHA)
    );
    println!();
    println!("summary_f1={:?}", out.summary_f1());
    println!();
    println!("summary_f2={:?}", out.summary_f2());
    println!();
}

§Implementation Approach

This library addresses the ordering effect and random noise challenges as follows. Given two functions f1 and f2:

It repeatedly executes duos of pairs (f1, f2), (f2, f1). This ensures the following:
- Obviously, both functions are executed the same number of times.
- Each function is executed as many times preceded by itself as preceded by the other function. This removes the ordering effect.
- Because the function excutions are paired, those executions are in close time proximity to each other, so the time-dependent variation is effectively neutralized in the comparison of latencies (even though it may persist in the latency distributions of the individual functions).
- Latencies can be compared pairwise, as well as overall for both functions. This enables the analysis of data with statistical methods targeting either two independent samples or a single paired sample.
The number of function executions can be specified according to the desired levels of confidence and fine-grained discrimination.
It warms-up by executing the above pattern for some time (3 seconds by default) before it starts tallying the results. This is similar to the Criterion warm-up strategy.

An important tool used for data collection in this library is the hdrhistogram crate, which provides an efficient histogram implementation with high fidelity for wide ranges of positive values. Additionally, accumulators are used to support some inferential statistics.

§Statistics

Standard summary statistics (mean, standard deviation, median, percentiles, min, max) and a variety of inferential statistics (e.g., hypothesis tests, confidence intervals, t values, p values) are implemented for the DiffOut struct, which is output from the library’s core benchmarking functions.

Much of the statistical inference supported by this library is based on the assumption that latency distributions tend to have an approximately lognormal distribution. This is a widely made assumption, which is supported by theory as well as empirical data. While the lognormal assumption underlies much of the statistical inference capability in this library, benchmarks using functions with controlled latencies have been conducted (see Testing below) to validate the soundness of the statistics.

t-tests

The DiffOut methods whose names include the string welch correspond to statistics associated with the Welch two-sample t-test for the difference between the means of the natural logarithms of the latencies of f1 and f2.

Note: There are also DiffOut methods whose names include the string student. Those methods correspond to statistics associated with the Student one-sample t-test for equality to 0 of the mean of the differences of the natural logarithms of the paired latencies of f1 and f2. Those methods have been deprecated as the results of further benchmark testing showed that they have substantially higher Type I error rates than the corresponding welch statistics.

§Testing

This framework has been extensively tested using both standard Rust tests and test benchmarks.

Tests of inferential statistics – The implementations of inferential statistics (hypothesis tests, confidence intervals, t values, p values, etc.) have been tested independently on their own, using Rust’s standard testing approach. Sources for comparison included published data sets with their corresponding statistics, as well as calculations using R.

Test benchmarks – Specific benchmarking scenarios were defined to test the library on its own and in comparison with “traditional” benchmarking.

Comparative test benchmarks – Using deterministic functions with specified latency medians, we compared results from bench_diff against results from the traditional separate benchmarking of the functions.
- Deterministic functions f1 and f2 were defined such that the relative difference between the latency of f1 and the latency of f2 was known by construction.
  - Separate comparative tests were conducted with relative differences of 1%, 2%, 5%, and 10%.
  - Two base median latencies were used: 100 microseconds and 20 milliseconds.
- A sample size (exec_count) of 2,000 was used with the base median of 200 microseconds and a sample size of 200 was used with the base median of 20 milliseconds.
- For each comparative test:
  - A traditional benchmarking suite was conducted by benchmarking f1 with exec_count executions, then benchmarking f2 with exec_count executions, then benchmarking f1 with exec_count executions, then benchmarking f2 with exec_count executions, and so on, until both f1 and f2 had been benchmarked 100 times.
  - A bench_diff benchmarking suite was conducted by running bench_diff, with f1, f2, and exec_count as arguments, 100 times.
Standalone bench_diff test benchmarks – These used both deterministic functions with known latency medians and non-deterministic functions with specified latency medians and variances. Each benchmark was run repeatedly (100 times) to empirically verify the frequency of Type I and Type II statistical hypothesis testing errors.
- Benchmarks included functions with relative latency differences of 0%, 1%, and 10%. The 0% cases are important to verify that the benchmark does not mistakenly conclude that one function is faster than the other when they in fact have the same median latency (i.e., a statistical Type I error).
- Two base median latencies were used: 100 microseconds and 20 milliseconds.
- A sample size (exec_count) of 2,000 was used with the base median of 200 microseconds and a sample size of 200 was used with the base median of 20 milliseconds.
- Three variance levels were used: 0 (deterministic functions), low (standard deviation of the log of the distribution = ln(1.2) / 2, which corresponds to a multiplicative effect of 1.095 at the 68th percentile for a lognormal distribution), and high (standard deviation of the log of the distribution = ln(2.4) / 2, which corresponds to a multiplicative effect of 1.55 at the 68th percentile for a lognormal distribution).

General observations – Following are general observations from the test benchmark results:

Comparative test benchmarks

The raw data shows that relative latency differences measured with bench_diff are invariably more accurate than those measured with the traditional method.
We use the following definitions in the comparison table below:
- A median and/or mean reversal occurs when the measured median and/or mean latency of f2 is higher than that of f1. Recall that, by construction, f1 has higher latency than f2.
- A median and/or mean anomaly occurs when the measured relative difference of the medians and/or means of the two functions’ latencies differs by more than 40% from the known relative difference.

Comparison table – The following conclusions can be gleaned from the table below:

Reversals and anomalies were much more frequent with the traditional method than with bench_diff.
For latency magnitudes of hundreds of microseconds, reversals and anomalies were pervasive with the traditional method, while they were very low or non-existent with bench_diff. Furthermore, even when reversals were present with bench_diff, the t-tests always correctly identified which of the two functions was faster.
For latency magnitudes of tens of milliseconds, reversals and anomalies were lower for both the traditional method and bench_diff. They were practically non-existent with bench_diff for relative latency differences as low as 1%. By contrast, the frequencies of reversals and anomalies were still substantial with the traditional method for relative latency differences of 2% or lower.

Benchmark type	Base median latency	Sample size	Relative difference	Reversals	Anomalies	t_test pass
traditional	100 µs	2,000	1%	47	94	N/A
`bench_diff`	100 µs	2,000	1%	2	13	all
traditional	100 µs	2,000	2%	35	89	N/A
`bench_diff`	100 µs	2,000	2%	0	5	all
traditional	100 µs	2,000	5%	21	74	N/A
`bench_diff`	100 µs	2,000	5%	0	1	all
traditional	100 µs	2,000	10%	19	57	N/A
`bench_diff`	100 µs	2,000	10%	0	0	all
traditional	20 ms	200	1%	31	82	N/A
`bench_diff`	20 ms	200	1%	0	2	all
traditional	20 ms	200	2%	24	75	N/A
`bench_diff`	20 ms	200	2%	0	0	all
traditional	20 ms	200	5%	0	6	N/A
`bench_diff`	20 ms	200	5%	0	0	all
traditional	20 ms	200	10%	0	1	N/A
`bench_diff`	20 ms	200	10%	0	0	all

Standalone bench_diff test benchmarks
- For comparisons between deterministic functions (i.e., no randomness in the functions themselves):
  - Type I error frequency was in line (within 2 sigma) with a chosen alpha of 0.05 and and the corresponding binomial distribution (p=0.05, n=100).
  - Type II error frequency was in line (within 2 sigma) with a beta of 0.05 and the corresponding binomial distribution (p=0.05, n=100) for the sample sizes (exec_count) used, which were the same as those used in the comparative test benchmarks.
- For comparisons where at least one of the functions was non-deterministic, with base median of 100 microseconds and sample size of 2,000:
  - Type I error frequency was in line (within 2 sigma) with a chosen alpha of 0.05 and and the corresponding binomial distribution (p=0.05, n=100).
  - In all but one case, Type II error frequency was in line (within 2 sigma) with a beta of 0.05 and the corresponding binomial distribution (p=0.05, n=100) for the sample size used. The only exception was a comparison involving a small relative latency difference (1%) and a high latency variance.
- For comparisons where at least one of the functions was non-deterministic, with base median of 20 milliseconds and sample size of 200:
  - Type I error frequency was in line (within 2 sigma) with a chosen alpha of 0.05 and and the corresponding binomial distribution (p=0.05, n=100).
  - In most cases, Type II error frequency was in line (within 2 sigma) with a beta of 0.05 and the corresponding binomial distribution (p=0.05, n=100) for the sample size used. The exceptions were the comparisons involving either a small relative latency difference (1%) together with low or high variance, or a moderate relative latency difference (10%) together with a high latency variance.
- Unsurprisingly, the observed Type II errors are higher when the difference in means/medians is smaller and/or the variance is higher. Of course, Type II errors can be reduced by increasing the sample size (and testing time), which reduces beta.

§A Model of Time-Dependent Random Noise

Following is a simple mathematical model of time-dependent random noise. This model is useful for reasoning about time-dependent noise and helps in understanding why bench_diff is more effective than traditional benchmarking for the comparison of latencies between two functions.

Nonetheless, the model’s fit with reality is not necessary to validate bench_diff as the test benchmarks discussed previously provide independent validation of the library.

§The Model

The first section defines the model. Subsequent sections develop estimates for some relevant statistics and parameters.

Definitions and Assumptions

This section defines the model per se.

Let ln(x) be the natural logarithm of x.
Let L(f, t) be the latency of function f at time t.
Let λ₁ be the baseline (ideal) latency of function f₁ in the absence of noise; respectively, λ₂ for f₂.
Given a random variable χ, let E(χ) and Stdev(χ) be the expectation and standard deviation of χ, respectively.
Assume time-dependent noise is ν(t) =_def α(t) * β(t), where:
- α(t) is a differentiable deterministic function of t, such that there are positive constants A_L, A_U, and A_D for which A_L ≤ α(t) ≤ A_U and |α’(t)| ≤ A_D, for all t.
- β(t) is a family of mutually independent log-normal random variables indexed by t, such that E(ln(β(t))) = 0 and Stdev(ln(β(t))) = σ, where σ is a constant that does not depend on t.
Assume L(f₁, t) = λ₁ * ν(t) and L(f₂, t) = λ₂ * ν(t) for all t.

A couple of points to notice about this model:
- λ₁ and λ₂ can’t be determined independently of α. If we multiply λ₁ and λ₂ by a constant factor r and divide α by the same factor, we end up with an exactly equivalent model. Nonetheless, the ratio λ₁/λ₂ remains the same regardless of the multiplier r, so the model is useful to reason about the ratio of the function latencies or, equivalently, the difference of the logarithm of the latencies.
  - This spurious degree of freedom in the model can be removed, without loss of generality, by assuming the lowest value of α(t) is 1 = A_L.
- As a log-normally distributed random variable, β(t) can take arbitrarily small positive values, which implies that the function latencies can take arbitrarily small values. That is clearly not 100% realistic. It is, however, a reasonable approximation of reality, especially when σ is close to 0, in which case the values of β(t) are clustered around 1.
Let mean_diff_ln be the sample mean difference between the natural logarithms of the observed latencies.
When we measure f₁’s latency at a time t₁, getting L(f₁, t₁), and right after we measure f₂’s latency, the measurement for f₂ occurs at a time t₂ that is very close to L(f₁, t₁). We assume that t₂ = t₁ + L(f₁, t₁).

Game Plan

The goal is to obtain a bound on how closely mean_diff_ln estimates ln(λ₁/λ₂):

Obtain an upper bound BE on |E(mean_diff_ln) - ln(λ₁/λ₂)|.
Obtain an upper bound BA on E(|mean_diff_ln - ln(λ₁/λ₂)|).
mean_diff_ln is approximately normal, so its median and mean are approximately the same.
The median minimizes the mean absolute deviation, so E(|mean_diff_ln - E(mean_diff_ln))|) ≤ BA.
A normal distribution with a mean absolute deviation from its mean ≤ BA has stdev ≤ BA * √(π/2).
The 99% confidence interval around the mean for a normal distribution is stdev * 2.58.
|mean_diff_ln - ln(λ₁/λ₂)| ≤ |mean_diff_ln - E(mean_diff_ln)| + |E(mean_diff_ln) - ln(λ₁/λ₂)|.
With 99% confidence, |mean_diff_ln - ln(λ₁/λ₂)| ≤ BA * √(π/2) * 2.58 + BE.

Expansion of mean_diff_ln

This section expands mean_diff_ln (see Definition 7) to facilitate subsequent upper bound calculations.

When f₁ is executed at time t₁ and f₂ is executed right after at time t₂ = t₁ + Δt₁, using Assumptions 5, 6, 8:
- L(f₁, t₁) = λ₁ * α(t₁) * β(t₁)
- L(f₂, t₂) = λ₂ * α(t₁ + Δt₁) * β(t₂) [ where Δt₁ = L(f₁, t₁) ]
Applying natural logarithms on Point 2:
- ln(L(f₁, t₁)) = ln(λ₁) + ln(α(t₁)) + ln(β(t₁))
- ln(L(f₂, t₂)) = ln(λ₂) + ln(α(t₁ + Δt₁)) + ln(β(t₂))
By Lagrange’s mean value theorem for ln(α(t)) and the bounds on α(t) and α’(t) from Assumption 5:
- ln(α(t₁ + Δt₁))
  
  = ln(α(t₁)) + Δt₁ * α’(t_p)/α(t_p) [ for some t_p between t₁ and t₁ + Δt₁ ]
  
  = ln(α(t₁)) + Δt₁ * α’(t_p)/α(t₁) * α(t₁)/α(t_p)
  
  = ln(α(t₁)) + Δt₁ * α’(t_p)/α(t₁) * (1 + α(t₁)/α(t_p) - α(t₁)/α(t₁))
  
  = ln(α(t₁)) + Δt₁ * α’(t_p)/α(t₁) * (1 + α(t₁) * 1/(α(t_p) - 1/α(t₁)))
  
  By Lagrange’s mean value theorem for 1/α(t):
  
  = ln(α(t₁)) + Δt₁ * α’(t_p)/α(t₁) * (1 + α(t₁) * 1/(α(t_p) - 1/α(t₁)))
  
  = ln(α(t₁)) + Δt₁ * α’(t_p)/α(t₁) * (1 + α(t₁) * -α’(t_q)/α(t_q)² * (t_p - t₁)) [ for some t_p between t₁ and t_p ]
  
  = ln(α(t₁)) + Δt₁ * ε₁(t₁)/α(t₁)
  
  [ where ε₁(t₁) =_def α’(t_p)*(1-α(t₁)*α’(t_q)/α(t_q)²*(t_p-t₁)), and thus |ε₁(t₁)| ≤ A_D*(1+A_D*A_U/A_L²*Δt₁) ]
Applying Point 3 to Point 2:
- ln(L(f₁, t₁)) = ln(λ₁) + ln(α(t₁)) + ln(β(t₁))
- ln(L(f₂, t₂))
  
  = ln(λ₂) + ln(α(t₁)) + Δt₁ * ε₁(t₁)/α(t₁) + ln(β(t₂))
  
  Since Δt₁ = L(f₁, t₁):
  
  = ln(λ₂) + ln(α(t₁)) + L(f₁, t₁) * ε₁(t₁)/α(t₁) + ln(β(t₂))
  
  Using Assumptions 5 and 6 to expand L(f₁, t₁):
  
  = ln(λ₂) + ln(α(t₁)) + (λ₁ * α(t₁) * β(t₁)) * ε₁(t₁)/α(t₁) + ln(β(t₂))
  
  = ln(λ₂) + ln(α(t₁)) + λ₁ * β(t₁) * ε₁(t₁) + ln(β(t₂))
Subtracting the second equation from the first in Point 5:
- ln(L(f₁, t₁) - L(f₂, t₂))
  
  = ln(λ₁) + ln(α(t₁)) + ln(β(t₁)) - ln(λ₂) - ln(α(t₁)) - λ₁ * β(t₁) * ε₁(t₁) - ln(β(t₂))
  
  = ln(λ₁/λ₂) - λ₁ * β(t₁) * ε₁(t₁) + ln(β(t₁)) - ln(β(t₂))
Using Assumptions 5, 6, 8 to expand L(f₁, t₁) and rewrite the bound on |ε₁(t₁)|:
- |ε₁(t₁)|
  
  ≤ A_D*(1+A_D*A_U/A_L²*Δt₁)
  
  = A_D*(1+A_D*A_U/A_L²*L(f₁, t₁))
  
  = A_D*(1+A_D*A_U/A_L²*(λ₁ * α(t₁) * β(t₁)))
  
  ≤ A_D*(1+A_D*A_U/A_L²*(λ₁ * A_U * β(t₁)))
  
  = A_D*(1+λ₁*A_D*A_U²/A_L²*β(t₁))
With bench_diff, measurements are done pairs, with one half of the pairs having f₁ followed by f₂ and the other half having f₂ followed by f₁. The equation in Point 5 above pertains to the first case. The analogous equation for the second case is:
- ln(L(f₂, t_2’) - ln(L(f₁, t_1’)) = ln(λ₂/λ₁) - λ₂ * β(t_2’) * ε₂(t_2’) + ln(β(t_1’)) - ln(β(t_2’))
Or, equivalently:
- ln(L(f₁, t_1’)) - ln(L(f₂, t_2’) = ln(λ₁/λ₂) + λ₂ * β(t_2’) * ε₂(t_2’) - ln(β(t_1’)) + ln(β(t_2’))
  
  [ where ε₂(t_2’) =_def α’(t_p’)*(1-α(t_2’)*α’(t_q’)/α(t_q’)²*(t_p’-t_2’)), and thus |ε₂(t_2’)| ≤ A_D*(1+λ₂*A_D*A_U²/A_L²*β(t_2’))) (see Point 6) ]
Assuming the number of latency observations for each function is n and considering the two cases as described in Point 7, we can calculate the sample mean difference between the natural logarithms of the observed latencies (see Definition 7):
- mean_diff_ln = (1/n) * ∑_i=1..n (ln(L(f₁, t_1,i)) - ln(L(f₂, t_2,i)))
  
  = (1/n) * (∑_i:odd (ln(L(f₁, t_1,i)) - ln(L(f₂, t_2,i))) + ∑_i:even (ln(L(f₁, t_1,i)) - ln(L(f₂, t_2,i))))
  
  = (1/n) * ∑_i:odd (ln(λ₁/λ₂) - λ₁ * β(t_1,i) * ε₁(t_1,i) + ln(β(t_1,i)) - ln(β(t_2,i))) +
  
  (1/n) * ∑_i:even (ln(λ₁/λ₂) + λ₂ * β(t_2,i) * ε₂(t_2,i) - ln(β(t_1,i)) + ln(β(t_2,i)))
  
  = ln(λ₁/λ₂) +
  
  (1/n) * ∑_i:odd (-λ₁ * β(t_1,i) * ε₁(t_1,i) + ln(β(t_1,i)) - ln(β(t_2,i))) +
  
  (1/n) * ∑_i:even (λ₂ * β(t_2,i) * ε₂(t_2,i) - ln(β(t_1,i)) + ln(β(t_2,i)))

Expectation of mean_diff_ln

This section calculates a bound on |E(mean_diff_ln) - ln(λ₁/λ₂)| (see Definition 7).

From Expansion Point 8:
- E(mean_diff_ln) - ln(λ₁/λ₂)
  
  Due to the linearity of E() and the fact that E(β(t)) = 0:
  
  = (1/n) * -λ₁ * ∑_i:odd E(β(t_1,i) * ε₁(t_1,i)) + (1/n) * λ₂ * ∑_i:even E(β(t_2,i) * ε₂(t_2,i))
From Point 1 and the bounds on ε₁(t_1,i) and ε₂(t_2,i) from Expansion Points 6 and 7:
- |E(mean_diff_ln) - ln(λ₁/λ₂)|
  
  ≤ (1/n) * λ₁ * ∑_i:odd E(β(t_1,i) * |ε₁(t_1,i)|) + (1/n) * λ₂ * ∑_i:even E(β(t_2,i) * |ε₂(t_2,i)|)
  
  ≤ (1/n) * λ₁ * ∑_i:odd E(β(t_1,i) * A_D*(1+λ₁*A_D*A_U²/A_L²*β(t_1,i))) +
  (1/n) * λ₂ * ∑_i:even E(β(t_2,i) * A_D*(1+λ₂*A_D*A_U²/A_L²*β(t_2,i)))
  
  = (1/n) * λ₁ * ∑_i:odd (A_D*E(β(t_1,i) + λ₁*A_D²*A_U²/A_L²*E(β(t_1,i)²)) + (1/n) * λ₂ * ∑_i:even (A_D*E(β(t_2,i) + λ₂*A_D²*A_U²/A_L²*E(β(t_2,i)²))
  
  By the expectation of log-normal distributions:
  
  = (1/n) * λ₁ * ∑_i:odd (A_D*exp(σ²/2) + λ₁*A_D²*A_U²/A_L²*exp((2*σ)²/2)) +
  (1/n) * λ₂ * ∑_i:even (A_D*exp(σ²/2) + λ₂*A_D²*A_U²/A_L²*exp((2*σ)²/2))
  
  = (1/n) * λ₁ * ∑_i:odd (A_D*exp(σ²/2) + λ₁*A_D²*A_U²/A_L²*exp(2*σ²)) +
  (1/n) * λ₂ * ∑_i:even (A_D*exp(σ²/2) + λ₂*A_D²*A_U²/A_L²*exp(2*σ²))
  
  = λ₁ / 2 * (A_D*exp(σ²/2) + λ₁*A_D²*A_U²/A_L²*exp(2*σ²)) +
  λ₂ / 2 * (A_D*exp(σ²/2) + λ₂*A_D²*A_U²/A_L²*exp(2*σ²))
  
  = (λ₁+λ₂)/2 * A_D*exp(σ²/2) + (λ₁²+λ₂²)/2 * A_D²*A_U²/A_L²*exp(2*σ²)
Define BE as the right-hand-side of the immediately above equality. Thus:
- |E(mean_diff_ln) - ln(λ₁/λ₂)| ≤ BE.

Mean Absolute Error of mean_diff_ln

This section calculates the expected absolute error of mean_diff_ln (see Definition 7) as an estimator of ln(λ₁/λ₂).

From Expansion Point 8::
- absolute_error(mean_diff_ln) =_def |mean_diff_ln - ln(λ₁/λ₂)|
  
  = |(1/n) * ∑_i:odd (-λ₁ * β(t_1,i) * ε₁(t_1,i) + ln(β(t_1,i)) - ln(β(t_2,i))) +
  
  (1/n) * ∑_i:even (λ₂ * β(t_2,i) * ε₂(t_2,i) - ln(β(t_1,i)) + ln(β(t_2,i)))|
  
  = |(1/n) * ∑_i:odd -λ₁ * β(t_1,i) * ε₁(t_1,i) + (1/n) * ∑_i:even λ₂ * β(t_2,i) * ε₂(t_2,i) +
  
  (1/n) * ∑_i:odd (ln(β(t_1,i)) - ln(β(t_2,i))) - ∑_i:even (ln(β(t_1,i)) - ln(β(t_2,i)))|
  
  ≤ (1/n) * |∑_i:odd -λ₁ * β(t_1,i) * ε₁(t_1,i) + (1/n) * ∑_i:even λ₂ * β(t_2,i) * ε₂(t_2,i)| +
  
  (1/n) * |∑_i:odd (ln(β(t_1,i)) - ln(β(t_2,i))) - ∑_i:even (ln(β(t_1,i)) - ln(β(t_2,i)))|
From the above point, defining the mean absolute error:
- mean_absolute_error(mean_diff_ln) =_def E(absolute_error(mean_diff_ln))
  
  ≤ (1/n) * λ₁ * ∑_i:odd E(β(t_1,i) * |ε₁(t_1,i)|) + (1/n) * λ₂ * ∑_i:even E(β(t_2,i) * |ε₂(t_2,i)|) +
  (1/n) * E(|∑_i:odd (ln(β(t_1,i)) - ln(β(t_2,i))) - ∑_i:even (ln(β(t_1,i)) - ln(β(t_2,i)))|)
  
  By Expectation Point 2, the first line on the right hand side of the above inequality is bounded by the last line of Expectation Point 2. Thus:
  
  ≤ (λ₁+λ₂)/2 * A_D*exp(σ²/2) + (λ₁²+λ₂²)/2 * A_D²*A_U²/A_L²*exp(2*σ²) +
  
  (1/n) * E(|∑_i:odd (ln(β(t_1,i)) - ln(β(t_2,i))) - ∑_i:even (ln(β(t_1,i)) - ln(β(t_2,i)))|)
  
  By Expectation Point 3:
  
  = BE +
  (1/n) * E(|∑_i:odd (ln(β(t_1,i)) - ln(β(t_2,i))) - ∑_i:even (ln(β(t_1,i)) - ln(β(t_2,i)))|)
  
  On the line immediately above, term inside the absolute value is the sum of 2*n independently distributed normal random variables with mean 0 and standard deviation σ. It is, therefore, a normal distribution with mean 0 and standard deviation σ*√(2*n). From the formula for the expectation of the absolute value of a normal random variable, we get:
  
  = BE + (1/n) * σ*√(2*n) * √(2/π)
  
  = BE + 2 * σ / √(n*π)
Define BA as the right-hand-side of the immediately above equality. Thus:
- mean_absolute_error(mean_diff_ln) ≤ BA.
For a large sample size n, the above upper bound becomes, approximately:
- BA
  
  ≈ BE
  
  = (λ₁+λ₂)/2 * A_D*exp(σ²/2) + (λ₁²+λ₂²)/2 * A_D²*A_U²/A_L²*exp(2*σ²)

Confidence Interval

mean_diff_ln is approximately normal, so its median and mean are approximately the same.
The median minimizes the mean absolute deviation. From Point 1 and Mean Absolute Error Point 3:
- E(|mean_diff_ln - E(mean_diff_ln))|)
  
  ≈ E(|mean_diff_ln - median(mean_diff_ln))|)
  
  ≤ mean_absolute_error(mean_diff_ln)
  
  ≤ BA
A normal distribution with a mean absolute deviation from its mean ≤ BA has stdev ≤ BA * √(π/2). Thus, from Points 1 and 2:
- Stdev(mean_diff_ln) ≤ BA * √(π/2)
The 99% confidence interval around the mean for a normal distribution is stdev * 2.58. Thus:
- |mean_diff_ln - E(mean_diff_ln))| ≤ BA * √(π/2) * 2.58 [ with 99% confidence ]
99% confidence bound for |mean_diff_ln - ln(λ₁/λ₂)|:
- |mean_diff_ln - ln(λ₁/λ₂)|
  
  ≤ |mean_diff_ln - E(mean_diff_ln)| + |E(mean_diff_ln) - ln(λ₁/λ₂)|
  
  From Point 4 and Expectation Point 3:
  
  ≤ BA * √(π/2) * 2.58 + BE [ with 99% confidence ]
  
  ≤ (BE + 2 * σ / √(n*π)) * √(π/2) * 2.58 + BE [ with 99% confidence ]
  
  = BE * (√(π/2) * 2.58 + 1) + 2 * σ / √(n*π) * √(π/2) * 2.58 [ with 99% confidence ]
  
  = BE * (√(π/2) * 2.58 + 1) + √(2/n) * σ * 2.58 [ with 99% confidence ]
Define BC as the right-hand-side of the immediately above equality. Thus:
- |mean_diff_ln - ln(λ₁/λ₂)| ≤ BC [ with 99% confidence ]
For a large sample size n, the above confidence bound becomes, approximately:
- BC
  
  ≈ BE * (√(π/2) * 2.58 + 1)
  
  = ((λ₁+λ₂)/2 * A_D*exp(σ²/2) + (λ₁²+λ₂²)/2 * A_D²*A_U²/A_L²*exp(2*σ²)) * (√(π/2) * 2.58 + 1)

Reasonable Parameter Values

Assume, without loss of generality, that the lowest value of α(t) is 1 = A_L.

In practice, one could reasonably expect the time-dependent random noise parameters to be no greater (and likely substantially lower) than the values below:

Rate of change of α(t) < 5% per second, which means A_D < .00005 when time is measured in milliseconds and A_D < .00000005 when time is measured in microseconds.
A_U/A_L < 2
σ < 0.3

From the above, given λ₁ and λ₂, we can compute BC (assume a large sample size):

λ₁ = λ₂ = 12 ⇒ BC < 0.003
λ₁ = λ₂ = 120 ⇒ BC < 0.028

Notice that while the variability of the statistic mean_diff_ln depends on the sample size (exec_count), the BC upper bound does not, provided that the sample size is large enough (see Confidence Interval Item 7).

§Comparative Example

We will define an example of the above model and compare how bench_diff and the traditional benchmarking method fare with respect to the model. The example is admittedly contrived in order to facilitate approximate calculations and also to highlight the potential disparity of results between the two benchmarking methods.

Model parameters

The two functions, f₁ and f₂ are identical (call it f), with λ₁ = λ₂ = λ = 12 ms.
The number of executions of each function is exec_count = 2500. So, the total execution time, ignoring warm-up, is 1 minute.
α(t) = 1.5 + 1/2 * sin(t * 2*π / 60000), where t is the number of milliseconds elapsed since the start of the benchmark.
- α’(t)
  
  = 1/2 * 2*π / 60000 * cos(t * 2*π / 60000)
  
  = π / 60000 * cos(t * 2*π / 60000)
Therefore:
- |α’(t)| ≤ π / 60000.
And we have the following bounds for α(t):
- A_L = 1
- A_U = 2
- A_D = π / 60000
β(t) has σ = 0.28.
- Therefore, E(β(t)) = exp(σ²/2) ≈ 1.0400.
The baseline median latency of f is the median latency when α(t) = 1. Its value is λ = λ₁ = λ₂ = 12 ms.
The baseline mean latency of f is the mean latency when α(t) = 1. Its value μ = μ1 = μ2 = λ * exp(σ²/2) ≈ 12 * 1.0400 = 12.48 ms.
The ratio of baseline medians is always equal to the ratio of baseline means even when f₁ ≠ f₂ because σ does not depend on f₁ or f₂.

bench_diff calculations

Given exec_count = 2500, the large sample size approximation for the BC bound can be used:
- BC ≈ ((λ₁+λ₂)/2 * A_D*exp(σ²/2) + (λ₁²+λ₂²)/2 * A_D²*A_U²/A_L²*exp(2*σ²)) * (√(π/2) * 2.58 + 1)
  
  ≈ 0.00277
From the model (Confidence Interval Point 7):
- |mean_diff_ln - ln(λ₁/λ₂)| ≤ 0.00277 [ with 99% confidence ]
So, the multiplicative error on the estimate of λ₁/λ₂ (= λ/λ = 1 in our example) is at most, with 99% confidence:
- exp(0.00277) ≈ 1.00278, i.e., less than 3/10 of 1% error.
Recall that the error bound does not depend on the number of executions, so it is the same with only half the number of executions. Also, given the high exec_count assumed, the bench_diff results during the first half should be very close to those obtained during the second half.
If instead of λ = λ₁ = λ₂ = 12 ms and exec_count = 2500 we assume λ = λ₁ = λ₂ = 120 ms and exec_count = 250, then, with 99% confidence:
- |mean_diff_ln - ln(λ₁/λ₂)| ≤ 0.0284
- Multiplicative error ≤ 1.0289, i.e., about 3% error.

Traditional method calculations

With the traditional method, we benchmark f₁ with exec_count = 2500 and then benchmark f₂ (which is the same as f₁ in our example) with exec_count = 2500.
The first benchmark of f takes place during the first 30 seconds.
The calculated sample mean latency is approximately:
- μ * 1/30000 * ∫₀³⁰⁰⁰⁰ α(t) dt
  
  = μ / 30000 * (45000 + 1/2 * (-cos(30000 * 2*π / 60000) + cos(0)) / (2*π / 60000))
  
  = 1.5*μ + μ/30000 * 1/2 * (-cos(π) + cos(0)) / (2*π / 60000)
  
  = 1.5*μ + μ * (-cos(π) + cos(0)) / (2*π)
  
  = 1.5*μ + μ * (1 + 1) / (2*π)
  
  = μ * (1.5 + 1/π)
  
  ≈ μ * 1.8183
The second benchmark of f takes place during the second 30 seconds.
The calculated sample mean latency is approximately:
- μ * 1/30000 * ∫₃₀₀₀₀⁶⁰⁰⁰⁰ α(t) dt
  
  = μ / 30000 * (45000 + 1/2 * (-cos(60000 * 2*π / 60000) + cos(30000 * 2*π / 60000)) / (2*π / 60000))
  
  = 1.5*μ + μ/30000 * 1/2 * (-cos(2*π) + cos(π)) / (2*π / 60000)
  
  = 1.5*μ + μ * (-cos(2*π) + cos(π)) / (2*π)
  
  = 1.5*μ + μ * (-1 + -1) / (2*π)
  
  = μ * (1.5 - 1/π)
  
  ≈ μ * 1.1817
The estimated ratio of mean latencies is approximately 1.8183 / 1.1817 ≈ 1.5387, an estimating error of 54% in comparison with the baseline ratio of means μ1/μ2 = 1 (which is also equal to the baseline ratio of medians λ₁/λ₂).
If instead of λ = λ₁ = λ₂ = 12 ms and exec_count = 2500 we assume λ = λ₁ = λ₂ = 120 ms and exec_count = 250, then the results are roughly the same, i.e., an estimating error of 54%.

Closing comments for the example

In this example, the estimate of λ₁/λ₂ (or equivalently μ1/μ2) provided by bench_diff is accurate to within 3/10 of 1% of the true value of 1. By contrast, the estimate of μ1/μ2 provided by the traditional method is inflated by 54%.
If we “run” bench_diff during the model’s first 30 seconds, we still get a very accurate estimate of λ₁/λ₂ but the individual (sample mean) estimates of μ1 and μ2 are both close to μ * 1.8183, just as with the traditional method. Likewise, if we “run” bench_diff during the model’s last 30 seconds, we still get a very accurate estimate of λ₁/λ₂ but the individual (sample mean) estimates of μ1 and μ2 are both close to μ * 1.1817, just as with the traditional method.
The key point of bench_diff is to repeatedly run both functions in close time proximity to each other so that the ratios of the two functions’ latencies are close to the baseline even if the individual latencies themselves are distorted by time-dependent noise.
If instead of λ = λ₁ = λ₂ = 12 ms and exec_count = 2500 we assume λ = λ₁ = λ₂ = 120 ms and exec_count = 250, then the accuracy of the bench_diff results is worse, but it is still much better (at around a 3% error) than the traditional result (which remains about the same at around a 54% error).

§Limitations

This library works well for latencies at the microseconds or milliseconds order of magnitude, but not for latencies at the nanoseconds order of magnitude.

Modules§

statisticsDeprecated: Structs and enums for confidence intervals and hypothesis tests.
stats_types: Structs and enums for confidence intervals and hypothesis tests.

Structs§

DiffOut: Contains the data resulting from a benchmark comparing two closures f1 and f2.
SummaryStats: Common summary statistics useful in latency testing/benchmarking.

Enums§

LatencyUnit: Unit of time used to record latencies. Used as an argument in benchmarking functions.

Functions§

bench_diff: Compares latencies for two closures f1 and f2.
bench_diff_with_status: Compares latencies for two closures f1 and f2 and outputs information about the benchmark and its execution status. Execution status is output to stderr.
bench_diff_x: Compares latencies for two closures f1 and f2 and optionally outputs information about the benchmark and its execution status.
get_warmup_millis: The currently defined number of milliseconds used to “warm-up” the benchmark. The default is 3,000 ms.
set_warmup_millis: Changes the number of milliseconds used to “warm-up” the benchmark. The default is 3,000 ms.

Crate bench_diffCopy item path