Expand description
This library supports reliable latency comparison between two functions/closures. This is trickier than it may seem, due to time-dependent random noise and ordering effects. This library provides commonly used latency metrics for the target functions (mean, standard deviation, median, percentiles, min, max), but it differentiates itself by providing support for the statistically rigorous comparison of latencies between two functions.
One could simply try to run tools like Criterion or Divan on each function and compare the results. However, these challenges arise:
- Time-dependent random noise – Random noise can and often does change over time. This can significantly distort the comparison. (This is also a contributor to the ordering effect – see below). Increasing the sample size (number of function executions) can narrow the variance of a function’s median or mean latency as measured at a point in time, but it is not effective by itself in mitigating the time-dependent variability.
- Ordering effect – When running two benchmarks, one after the other, it is not uncommon to observe that the first one may get an edge over the second one. This can also significantly distort the comparison.
Due to these effects, at the microseconds latency magnitude, a function that is known by construction to be 5% faster than another function can often show a mean latency and/or median latency that is higher than the corresponding metric for the slower function. This effect is less pronounced for latencies at the milliseconds magnitude or higher, but it can still distort results. In general, the smaller the difference between the latencies of the two functions, the harder it is to distinguish the faster function from the slower one with the usual benchmarking approaches.
One could tackle these challenges by running the individual benchmarks, one after the other, repeating that multiple times, and using appropriate statistical methods to compare the two observed latency distributions. This is, however, quite time consuming and requires careful planning and analysis.
The present library has been validated by extensive benchmark testing and provides a convenient, efficient, and statistically sound alternative to the above more cumbersome methodology.
For the mathematically inclined reader, a later section presents a simple model to reason about time-dependent random noise.
§Quick Start
To run a benchmark with bench_diff
, follow these simple steps:
-
Create a bench file – see Simple Bench Example for a representative example.
-
Add the
bench_diff
dependency inCargo.toml
:[dev-dependencies] bench_diff = "1"
-
Create in
Cargo.toml
abench
section corresponding to the bench file. For example, given a filebenches/simple_bench.rs
, add the following toCargo.toml
:[[bench]] name = "simple_bench" harness = false
-
Execute the benchmark the usual way:
cargo bench --bench simple_bench
§Simple Bench Example
This example includes two comparisons of latencies of two functions and prints statistics for each comparison and each function.
//! Simple example benchmark.
//!
//! To run the bench (assuming the source is at `benches/simple_bench.rs`):
//! ```
//! cargo bench --bench simple_bench
//! ```
use bench_diff::{DiffOut, LatencyUnit, bench_diff_with_status, stats_types::AltHyp};
use std::{thread, time::Duration};
/// This function's latency is at least 21ms.
fn f1() {
thread::sleep(Duration::from_millis(21));
}
/// This function's latency is at least 20ms.
fn f2() {
thread::sleep(Duration::from_millis(20));
}
// The latency ratio between these functions should be approximately 1.05.
// Because the magnitude of latencies involved is milliseconds, it is
// a good idea to use `LatencyUnit::Micro` below. As a rule of thumb, always use
// the closest finer-grained latency unit.
fn main() {
// Output values are in the selected latench unit, i.e., microseconds.
println!("*** 1st benchmark ***");
{
let out: DiffOut = bench_diff_with_status(LatencyUnit::Micro, f1, f2, 100, |_, _| {
println!("Comparing latency of f1 vs. f2.");
println!();
});
print_diff_out(&out);
}
println!("*** 2nd benchmark ***");
{
let out: DiffOut = bench_diff_with_status(LatencyUnit::Micro, f1, f1, 100, |_, _| {
println!("Comparing latency of f1 vs. f1.");
println!();
});
print_diff_out(&out);
}
}
fn print_diff_out(out: &DiffOut) {
const ALPHA: f64 = 0.05;
println!();
println!("ratio_medians_f1_f2={}", out.ratio_medians_f1_f2(),);
println!("student_ratio_ci={:?}", out.student_ratio_ci(ALPHA),);
println!(
"student_diff_ln_test_gt:{:?}",
out.student_diff_ln_test(AltHyp::Gt, ALPHA)
);
println!();
println!("summary_f1={:?}", out.summary_f1());
println!();
println!("summary_f2={:?}", out.summary_f2());
println!();
}
§Implementation Approach
This library addresses the ordering effect and random noise challenges as follows. Given two functions f1
and f2
:
- It repeatedly executes duos of pairs (
f1
,f2
), (f2
,f1
). This ensures the following:- Obviously, both functions are executed the same number of times.
- Each function is executed as many times preceded by itself as preceded by the other function. This removes the ordering effect.
- Because the function excutions are paired, those executions are in close time proximity to each other, so the time-dependent variation is effectively neutralized in the comparison of latencies (even though it may persist in the latency distributions of the individual functions).
- Latencies can be compared pairwise, as well as overall for both functions. This enables the analysis of data with statistical methods targeting either two independent samples or a single paired sample.
- The number of function executions can be specified according to the desired levels of confidence and fine-grained discrimination. More on this in the Statistical Details section.
- It warms-up by executing the above pattern for some time (3 seconds by default) before it starts tallying the results. This is similar to the
Criterion
warm-up strategy.
An important tool used for data collection in this library is the hdrhistogram crate, which provides an efficient histogram implementation with high fidelity for wide ranges of positive values. Additionally, accumulators are used to support some inferential statistics.
§Statistics
Standard summary statistics (mean, standard deviation, median, percentiles, min, max) and a variety of inferential statistics (e.g., hypothesis tests, confidence intervals, t values, p values) are implemented for the DiffOut
struct, which is output from the library’s core benchmarking functions.
Much of the statistical inference supported by this library is based on the assumption that latency distributions tend to have an approximately lognormal distribution. This is a widely made assumption, which is supported by theory as well as empirical data. While the lognormal assumption underlies much of the statistical inference capability in this library, benchmarks using functions with controlled latencies have been conducted (see Testing below) to validate the soundness of the statistics.
t-tests
The DiffOut
struct has two sets of methods associated with t-tests:
- The methods whose names include the string
welch
correspond to statistics associated with the Welch two-sample t-test for the difference between the means of the natural logarithms of the latencies off1
andf2
. - The methods whose names include the string
student
correspond to statistics associated with the Student one-sample t-test for equality to 0 of the mean of the differences of the natural logarithms of the paired latencies off1
andf2
. - In benchmark tests (with typical large sample sizes) the Welch and Student variants produced roughly the same results in terms of Type I and Type II error rates, though the Student confidence intervals tended to be a little tighter.
§Testing
This framework has been extensively tested using both standard Rust tests and test benchmarks.
Tests of inferential statistics – The implementations of inferential statistics (hypothesis tests, confidence intervals, t values, p values, etc.) have been tested independently on their own, using Rust’s standard testing approach. Sources for comparison included published data sets with their corresponding statistics, as well as calculations using R.
Test benchmarks – Specific benchmarking scenarios were defined to test the library on its own and in comparison with “traditional” benchmarking.
- Comparative test benchmarks – Using deterministic functions with specified latency medians, we compared results from
bench_diff
against results from the traditional separate benchmarking of the functions.- Deterministic functions
f1
andf2
were defined such that the relative difference between the latency off1
and the latency off2
was known by construction.- Separate comparative tests were conducted with relative differences of 1%, 2%, 5%, and 10%.
- Two base median latencies were used: 100 microseconds and 20 milliseconds.
- A sample size (
exec_count
) of 2,000 was used with the base median of 200 microseconds and a sample size of 200 was used with the base median of 20 milliseconds. - For each comparative test:
- A traditional benchmarking suite was conducted by benchmarking
f1
withexec_count
executions, then benchmarkingf2
withexec_count
executions, then benchmarkingf1
withexec_count
executions, then benchmarkingf2
withexec_count
executions, and so on, until bothf1
andf2
had been benchmarked 100 times. - A bench_diff benchmarking suite was conducted by running
bench_diff
, withf1
,f2
, andexec_count
as arguments, 100 times.
- A traditional benchmarking suite was conducted by benchmarking
- Deterministic functions
- Standalone
bench_diff
test benchmarks – These used both deterministic functions with known latency medians and non-deterministic functions with specified latency medians and variances. Each benchmark was run repeatedly (100 times) to empirically verify the frequency of Type I and Type II statistical hypothesis testing errors.- Benchmarks included functions with relative latency differences of 0%, 1%, and 10%. The 0% cases are important to verify that the benchmark does not mistakenly conclude that one function is faster than the other when they in fact have the same median latency (i.e., a statistical Type I error).
- Two base median latencies were used: 100 microseconds and 20 milliseconds.
- A sample size (
exec_count
) of 2,000 was used with the base median of 200 microseconds and a sample size of 200 was used with the base median of 20 milliseconds. - Three variance levels were used: 0 (deterministic functions), low (standard deviation of the log of the distribution = ln(1.2) / 2, which corresponds to a multiplicative effect of 1.095 at the 68th percentile for a lognormal distribution), and high (standard deviation of the log of the distribution = ln(2.4) / 2, which corresponds to a multiplicative effect of 1.55 at the 68th percentile for a lognormal distribution).
General observations – Following are general observations from the test benchmark results:
-
Comparative test benchmarks
-
The raw data shows that relative latency differences measured with
bench_diff
are invariably more accurate than those measured with the traditional method. -
We use the following definitions in the comparison table below:
- A median and/or mean reversal occurs when the measured median and/or mean latency of
f2
is higher than that off1
. Recall that, by construction,f1
has higher latency thanf2
. - A median and/or mean anomaly occurs when the measured relative difference of the medians and/or means of the two functions’ latencies differs by more than 40% from the known relative difference.
- A median and/or mean reversal occurs when the measured median and/or mean latency of
-
Comparison table – The following conclusions can be gleaned from the table below:
- Reversals and anomalies were much more frequent with the traditional method than with
bench_diff
. - For latency magnitudes of hundreds of microseconds, reversals and anomalies were pervasive with the traditional method, while they were very low or non-existent with
bench_diff
. Furthermore, even when reversals were present withbench_diff
, the t-tests always correctly identified which of the two functions was faster. - For latency magnitudes of tens of milliseconds, reversals and anomalies were lower for both the traditional method and
bench_diff
. They were practically non-existent withbench_diff
for relative latency differences as low as 1%. By contrast, the frequencies of reversals and anomalies were still substantial with the traditional method for relative latency differences of 2% or lower.
Benchmark type Base median latency Sample size Relative difference Reversals Anomalies t_test pass traditional 100 µs 2,000 1% 47 94 N/A bench_diff
100 µs 2,000 1% 2 13 all traditional 100 µs 2,000 2% 35 89 N/A bench_diff
100 µs 2,000 2% 0 5 all traditional 100 µs 2,000 5% 21 74 N/A bench_diff
100 µs 2,000 5% 0 1 all traditional 100 µs 2,000 10% 19 57 N/A bench_diff
100 µs 2,000 10% 0 0 all traditional 20 ms 200 1% 31 82 N/A bench_diff
20 ms 200 1% 0 2 all traditional 20 ms 200 2% 24 75 N/A bench_diff
20 ms 200 2% 0 0 all traditional 20 ms 200 5% 0 6 N/A bench_diff
20 ms 200 5% 0 0 all traditional 20 ms 200 10% 0 1 N/A bench_diff
20 ms 200 10% 0 0 all - Reversals and anomalies were much more frequent with the traditional method than with
-
-
Standalone
bench_diff
test benchmarks-
For comparisons between deterministic functions (i.e., no randomness in the functions themselves):
- Type I error frequency was in line (within 2 sigma) with a chosen alpha of 0.05 and and the corresponding binomial distribution (p=0.05, n=100).
- Type II error frequency was in line (within 2 sigma) with a beta of 0.05 and the corresponding binomial distribution (p=0.05, n=100) for the sample sizes (
exec_count
) used, which were the same as those used in the comparative test benchmarks.
-
For comparisons where at least one of the functions was non-deterministic, with base median of 100 microseconds and sample size of 2,000:
- Type I error frequency was in line (within 2 sigma) with a chosen alpha of 0.05 and and the corresponding binomial distribution (p=0.05, n=100).
- In all but one case, Type II error frequency was in line (within 2 sigma) with a beta of 0.05 and the corresponding binomial distribution (p=0.05, n=100) for the sample size used. The only exception was a comparison involving a small relative latency difference (1%) and a high latency variance.
-
For comparisons where at least one of the functions was non-deterministic, with base median of 20 milliseconds and sample size of 200:
- Type I error frequency was in line (within 2 sigma) with a chosen alpha of 0.05 and and the corresponding binomial distribution (p=0.05, n=100).
- In most cases, Type II error frequency was in line (within 2 sigma) with a beta of 0.05 and the corresponding binomial distribution (p=0.05, n=100) for the sample size used. The exceptions were the comparisons involving either a small relative latency difference (1%) together with low or high variance, or a moderate relative latency difference (10%) together with a high latency variance.
-
Unsurprisingly, the observed Type II errors are higher when the difference in means/medians is smaller and/or the variance is higher. Of course, Type II errors can be reduced by increasing the sample size (and testing time), which reduces beta.
-
§A Model of Time-Dependent Random Noise
Following is a simple mathematical model of time-dependent random noise. This model is useful for reasoning about time-dependent noise and helps in understanding why bench_diff
is more effective than traditional benchmarking for the comparison of latencies between two functions.
Nonetheless, the model’s fit with reality is not necessary to validate bench_diff
as the test benchmarks discussed previously provide independent validation of the library.
§The Model
The first section defines the model. Subsequent sections develop estimates for some relevant statistics and parameters.
Definitions and Assumptions
This section defines the model per se.
-
Let ln(x) be the natural logarithm of x.
-
Let L(f, t) be the latency of function f at time t.
-
Let λ1 be the baseline (ideal) latency of function f1 in the absence of noise; respectively, λ2 for f2.
-
Given a random variable χ, let E(χ) and Stdev(χ) be the expectation and standard deviation of χ, respectively.
-
Assume time-dependent noise is ν(t) =def α(t) * β(t), where:
- α(t) is a differentiable deterministic function of t, such that there are positive constants AL, AU, and AD for which AL ≤ α(t) ≤ AU and |α’(t)| ≤ AD, for all t.
- β(t) is a family of mutually independent log-normal random variables indexed by t, such that E(ln(β(t))) = 0 and Stdev(ln(β(t))) = σ, where σ is a constant that does not depend on t.
-
Assume L(f1, t) = λ1 * ν(t) and L(f2, t) = λ2 * ν(t) for all t.
A couple of points to notice about this model:
- λ1 and λ2 can’t be determined independently of α. If we multiply λ1 and λ2 by a constant factor r and divide α by the same factor, we end up with an exactly equivalent model. Nonetheless, the ratio λ1/λ2 remains the same regardless of the multiplier r, so the model is useful to reason about the ratio of the function latencies or, equivalently, the difference of the logarithm of the latencies.
- This spurious degree of freedom in the model can be removed, without loss of generality, by assuming the lowest value of α(t) is 1 = AL.
- As a log-normally distributed random variable, β(t) can take arbitrarily small positive values, which implies that the function latencies can take arbitrarily small values. That is clearly not 100% realistic. It is, however, a reasonable approximation of reality, especially when σ is close to 0, in which case the values of β(t) are clustered around 1.
- λ1 and λ2 can’t be determined independently of α. If we multiply λ1 and λ2 by a constant factor r and divide α by the same factor, we end up with an exactly equivalent model. Nonetheless, the ratio λ1/λ2 remains the same regardless of the multiplier r, so the model is useful to reason about the ratio of the function latencies or, equivalently, the difference of the logarithm of the latencies.
-
Let mean_diff_ln be the sample mean difference between the natural logarithms of the observed latencies.
-
When we measure f1’s latency at a time t1, getting L(f1, t1), and right after we measure f2’s latency, the measurement for f2 occurs at a time t2 that is very close to L(f1, t1). We assume that t2 = t1 + L(f1, t1).
Game Plan
The goal is to obtain a bound on how closely mean_diff_ln estimates ln(λ1/λ2):
- Obtain an upper bound BE on |E(mean_diff_ln) - ln(λ1/λ2)|.
- Obtain an upper bound BA on E(|mean_diff_ln - ln(λ1/λ2)|).
- mean_diff_ln is approximately normal, so its median and mean are approximately the same.
- The median minimizes the mean absolute deviation, so E(|mean_diff_ln - E(mean_diff_ln))|) ≤ BA.
- A normal distribution with a mean absolute deviation from its mean ≤ BA has stdev ≤ BA * √(π/2).
- The 99% confidence interval around the mean for a normal distribution is stdev * 2.58.
- |mean_diff_ln - ln(λ1/λ2)| ≤ |mean_diff_ln - E(mean_diff_ln)| + |E(mean_diff_ln) - ln(λ1/λ2)|.
- With 99% confidence, |mean_diff_ln - ln(λ1/λ2)| ≤ BA * √(π/2) * 2.58 + BE.
Expansion of mean_diff_ln
This section expands mean_diff_ln (see Definition 7) to facilitate subsequent upper bound calculations.
-
When f1 is executed at time t1 and f2 is executed right after at time t2 = t1 + Δt1, using Assumptions 5, 6, 8:
- L(f1, t1) = λ1 * α(t1) * β(t1)
- L(f2, t2) = λ2 * α(t1 + Δt1) * β(t2) [ where Δt1 = L(f1, t1) ]
-
Applying natural logarithms on Point 2:
- ln(L(f1, t1)) = ln(λ1) + ln(α(t1)) + ln(β(t1))
- ln(L(f2, t2)) = ln(λ2) + ln(α(t1 + Δt1)) + ln(β(t2))
-
By Lagrange’s mean value theorem for ln(α(t)) and the bounds on α(t) and α’(t) from Assumption 5:
-
ln(α(t1 + Δt1))
= ln(α(t1)) + Δt1 * α’(tp)/α(tp) [ for some tp between t1 and t1 + Δt1 ]
= ln(α(t1)) + Δt1 * α’(tp)/α(t1) * α(t1)/α(tp)
= ln(α(t1)) + Δt1 * α’(tp)/α(t1) * (1 + α(t1)/α(tp) - α(t1)/α(t1))
= ln(α(t1)) + Δt1 * α’(tp)/α(t1) * (1 + α(t1) * 1/(α(tp) - 1/α(t1)))
By Lagrange’s mean value theorem for 1/α(t):
= ln(α(t1)) + Δt1 * α’(tp)/α(t1) * (1 + α(t1) * 1/(α(tp) - 1/α(t1)))
= ln(α(t1)) + Δt1 * α’(tp)/α(t1) * (1 + α(t1) * -α’(tq)/α(tq)2 * (tp - t1)) [ for some tp between t1 and tp ]
= ln(α(t1)) + Δt1 * ε1(t1)/α(t1)
[ where ε1(t1) =def α’(tp)*(1-α(t1)*α’(tq)/α(tq)2*(tp-t1)), and thus |ε1(t1)| ≤ AD*(1+AD*AU/AL2*Δt1) ]
-
-
Applying Point 3 to Point 2:
-
ln(L(f1, t1)) = ln(λ1) + ln(α(t1)) + ln(β(t1))
-
ln(L(f2, t2))
= ln(λ2) + ln(α(t1)) + Δt1 * ε1(t1)/α(t1) + ln(β(t2))
Since Δt1 = L(f1, t1):
= ln(λ2) + ln(α(t1)) + L(f1, t1) * ε1(t1)/α(t1) + ln(β(t2))
Using Assumptions 5 and 6 to expand L(f1, t1):
= ln(λ2) + ln(α(t1)) + (λ1 * α(t1) * β(t1)) * ε1(t1)/α(t1) + ln(β(t2))
= ln(λ2) + ln(α(t1)) + λ1 * β(t1) * ε1(t1) + ln(β(t2))
-
-
Subtracting the second equation from the first in Point 5:
-
ln(L(f1, t1) - L(f2, t2))
= ln(λ1) + ln(α(t1)) + ln(β(t1)) - ln(λ2) - ln(α(t1)) - λ1 * β(t1) * ε1(t1) - ln(β(t2))
= ln(λ1/λ2) - λ1 * β(t1) * ε1(t1) + ln(β(t1)) - ln(β(t2))
-
-
Using Assumptions 5, 6, 8 to expand L(f1, t1) and rewrite the bound on |ε1(t1)|:
-
|ε1(t1)|
≤ AD*(1+AD*AU/AL2*Δt1)
= AD*(1+AD*AU/AL2*L(f1, t1))
= AD*(1+AD*AU/AL2*(λ1 * α(t1) * β(t1)))
≤ AD*(1+AD*AU/AL2*(λ1 * AU * β(t1)))
= AD*(1+λ1*AD*AU2/AL2*β(t1))
-
-
With
bench_diff
, measurements are done pairs, with one half of the pairs having f1 followed by f2 and the other half having f2 followed by f1. The equation in Point 5 above pertains to the first case. The analogous equation for the second case is:- ln(L(f2, t2’) - ln(L(f1, t1’)) = ln(λ2/λ1) - λ2 * β(t2’) * ε2(t2’) + ln(β(t1’)) - ln(β(t2’))
Or, equivalently:
-
ln(L(f1, t1’)) - ln(L(f2, t2’) = ln(λ1/λ2) + λ2 * β(t2’) * ε2(t2’) - ln(β(t1’)) + ln(β(t2’))
[ where ε2(t2’) =def α’(tp’)*(1-α(t2’)*α’(tq’)/α(tq’)2*(tp’-t2’)), and thus |ε2(t2’)| ≤ AD*(1+λ2*AD*AU2/AL2*β(t2’))) (see Point 6) ]
-
Assuming the number of latency observations for each function is n and considering the two cases as described in Point 7, we can calculate the sample mean difference between the natural logarithms of the observed latencies (see Definition 7):
-
mean_diff_ln = (1/n) * ∑i=1..n (ln(L(f1, t1,i)) - ln(L(f2, t2,i)))
= (1/n) * (∑i:odd (ln(L(f1, t1,i)) - ln(L(f2, t2,i))) + ∑i:even (ln(L(f1, t1,i)) - ln(L(f2, t2,i))))
= (1/n) * ∑i:odd (ln(λ1/λ2) - λ1 * β(t1,i) * ε1(t1,i) + ln(β(t1,i)) - ln(β(t2,i))) +
(1/n) * ∑i:even (ln(λ1/λ2) + λ2 * β(t2,i) * ε2(t2,i) - ln(β(t1,i)) + ln(β(t2,i)))
= ln(λ1/λ2) +
(1/n) * ∑i:odd (-λ1 * β(t1,i) * ε1(t1,i) + ln(β(t1,i)) - ln(β(t2,i))) +
(1/n) * ∑i:even (λ2 * β(t2,i) * ε2(t2,i) - ln(β(t1,i)) + ln(β(t2,i)))
-
Expectation of mean_diff_ln
This section calculates a bound on |E(mean_diff_ln) - ln(λ1/λ2)| (see Definition 7).
-
From Expansion Point 8:
-
E(mean_diff_ln) - ln(λ1/λ2)
Due to the linearity of E() and the fact that E(β(t)) = 0:
= (1/n) * -λ1 * ∑i:odd E(β(t1,i) * ε1(t1,i)) + (1/n) * λ2 * ∑i:even E(β(t2,i) * ε2(t2,i))
-
-
From Point 1 and the bounds on ε1(t1,i) and ε2(t2,i) from Expansion Points 6 and 7:
-
|E(mean_diff_ln) - ln(λ1/λ2)|
≤ (1/n) * λ1 * ∑i:odd E(β(t1,i) * |ε1(t1,i)|) + (1/n) * λ2 * ∑i:even E(β(t2,i) * |ε2(t2,i)|)
≤ (1/n) * λ1 * ∑i:odd E(β(t1,i) * AD*(1+λ1*AD*AU2/AL2*β(t1,i))) +
(1/n) * λ2 * ∑i:even E(β(t2,i) * AD*(1+λ2*AD*AU2/AL2*β(t2,i)))= (1/n) * λ1 * ∑i:odd (AD*E(β(t1,i) + λ1*AD2*AU2/AL2*E(β(t1,i)2)) + (1/n) * λ2 * ∑i:even (AD*E(β(t2,i) + λ2*AD2*AU2/AL2*E(β(t2,i)2))
By the expectation of log-normal distributions:
= (1/n) * λ1 * ∑i:odd (AD*exp(σ2/2) + λ1*AD2*AU2/AL2*exp((2*σ)2/2)) +
(1/n) * λ2 * ∑i:even (AD*exp(σ2/2) + λ2*AD2*AU2/AL2*exp((2*σ)2/2))= (1/n) * λ1 * ∑i:odd (AD*exp(σ2/2) + λ1*AD2*AU2/AL2*exp(2*σ2)) +
(1/n) * λ2 * ∑i:even (AD*exp(σ2/2) + λ2*AD2*AU2/AL2*exp(2*σ2))= λ1 / 2 * (AD*exp(σ2/2) + λ1*AD2*AU2/AL2*exp(2*σ2)) +
λ2 / 2 * (AD*exp(σ2/2) + λ2*AD2*AU2/AL2*exp(2*σ2))= (λ1+λ2)/2 * AD*exp(σ2/2) + (λ12+λ22)/2 * AD2*AU2/AL2*exp(2*σ2)
-
-
Define BE as the right-hand-side of the immediately above equality. Thus:
- |E(mean_diff_ln) - ln(λ1/λ2)| ≤ BE.
Mean Absolute Error of mean_diff_ln
This section calculates the expected absolute error of mean_diff_ln (see Definition 7) as an estimator of ln(λ1/λ2).
-
From Expansion Point 8::
-
absolute_error(mean_diff_ln) =def |mean_diff_ln - ln(λ1/λ2)|
= |(1/n) * ∑i:odd (-λ1 * β(t1,i) * ε1(t1,i) + ln(β(t1,i)) - ln(β(t2,i))) +
(1/n) * ∑i:even (λ2 * β(t2,i) * ε2(t2,i) - ln(β(t1,i)) + ln(β(t2,i)))|
= |(1/n) * ∑i:odd -λ1 * β(t1,i) * ε1(t1,i) + (1/n) * ∑i:even λ2 * β(t2,i) * ε2(t2,i) +
(1/n) * ∑i:odd (ln(β(t1,i)) - ln(β(t2,i))) - ∑i:even (ln(β(t1,i)) - ln(β(t2,i)))|
≤ (1/n) * |∑i:odd -λ1 * β(t1,i) * ε1(t1,i) + (1/n) * ∑i:even λ2 * β(t2,i) * ε2(t2,i)| +
(1/n) * |∑i:odd (ln(β(t1,i)) - ln(β(t2,i))) - ∑i:even (ln(β(t1,i)) - ln(β(t2,i)))|
-
-
From the above point, defining the mean absolute error:
-
mean_absolute_error(mean_diff_ln) =def E(absolute_error(mean_diff_ln))
≤ (1/n) * λ1 * ∑i:odd E(β(t1,i) * |ε1(t1,i)|) + (1/n) * λ2 * ∑i:even E(β(t2,i) * |ε2(t2,i)|) +
(1/n) * E(|∑i:odd (ln(β(t1,i)) - ln(β(t2,i))) - ∑i:even (ln(β(t1,i)) - ln(β(t2,i)))|)By Expectation Point 2, the first line on the right hand side of the above inequality is bounded by the last line of Expectation Point 2. Thus:
≤ (λ1+λ2)/2 * AD*exp(σ2/2) + (λ12+λ22)/2 * AD2*AU2/AL2*exp(2*σ2) +
(1/n) * E(|∑i:odd (ln(β(t1,i)) - ln(β(t2,i))) - ∑i:even (ln(β(t1,i)) - ln(β(t2,i)))|)
By Expectation Point 3:
= BE +
(1/n) * E(|∑i:odd (ln(β(t1,i)) - ln(β(t2,i))) - ∑i:even (ln(β(t1,i)) - ln(β(t2,i)))|)On the line immediately above, term inside the absolute value is the sum of 2*n independently distributed normal random variables with mean 0 and standard deviation σ. It is, therefore, a normal distribution with mean 0 and standard deviation σ*√(2*n). From the formula for the expectation of the absolute value of a normal random variable, we get:
= BE + (1/n) * σ*√(2*n) * √(2/π)
= BE + 2 * σ / √(n*π)
-
-
Define BA as the right-hand-side of the immediately above equality. Thus:
- mean_absolute_error(mean_diff_ln) ≤ BA.
-
For a large sample size n, the above upper bound becomes, approximately:
-
BA
≈ BE
= (λ1+λ2)/2 * AD*exp(σ2/2) + (λ12+λ22)/2 * AD2*AU2/AL2*exp(2*σ2)
-
Confidence Interval
-
mean_diff_ln is approximately normal, so its median and mean are approximately the same.
-
The median minimizes the mean absolute deviation. From Point 1 and Mean Absolute Error Point 3:
-
E(|mean_diff_ln - E(mean_diff_ln))|)
≈ E(|mean_diff_ln - median(mean_diff_ln))|)
≤ mean_absolute_error(mean_diff_ln)
≤ BA
-
-
A normal distribution with a mean absolute deviation from its mean ≤ BA has stdev ≤ BA * √(π/2). Thus, from Points 1 and 2:
- Stdev(mean_diff_ln) ≤ BA * √(π/2)
-
The 99% confidence interval around the mean for a normal distribution is stdev * 2.58. Thus:
- |mean_diff_ln - E(mean_diff_ln))| ≤ BA * √(π/2) * 2.58 [ with 99% confidence ]
-
99% confidence bound for |mean_diff_ln - ln(λ1/λ2)|:
-
|mean_diff_ln - ln(λ1/λ2)|
≤ |mean_diff_ln - E(mean_diff_ln)| + |E(mean_diff_ln) - ln(λ1/λ2)|
From Point 4 and Expectation Point 3:
≤ BA * √(π/2) * 2.58 + BE [ with 99% confidence ]
≤ (BE + 2 * σ / √(n*π)) * √(π/2) * 2.58 + BE [ with 99% confidence ]
= BE * (√(π/2) * 2.58 + 1) + 2 * σ / √(n*π) * √(π/2) * 2.58 [ with 99% confidence ]
= BE * (√(π/2) * 2.58 + 1) + √(2/n) * σ * 2.58 [ with 99% confidence ]
-
-
Define BC as the right-hand-side of the immediately above equality. Thus:
- |mean_diff_ln - ln(λ1/λ2)| ≤ BC [ with 99% confidence ]
-
For a large sample size n, the above confidence bound becomes, approximately:
-
BC
≈ BE * (√(π/2) * 2.58 + 1)
= ((λ1+λ2)/2 * AD*exp(σ2/2) + (λ12+λ22)/2 * AD2*AU2/AL2*exp(2*σ2)) * (√(π/2) * 2.58 + 1)
-
Reasonable Parameter Values
Assume, without loss of generality, that the lowest value of α(t) is 1 = AL.
In practice, one could reasonably expect the time-dependent random noise parameters to be no greater (and likely substantially lower) than the values below:
- Rate of change of α(t) < 5% per second, which means AD < .00005 when time is measured in milliseconds and AD < .00000005 when time is measured in microseconds.
- AU/AL < 2
- σ < 0.3
From the above, given λ1 and λ2, we can compute BC (assume a large sample size):
- λ1 = λ2 = 12 ⇒ BC < 0.003
- λ1 = λ2 = 120 ⇒ BC < 0.028
Notice that while the variability of the statistic mean_diff_ln depends on the sample size (exec_count), the BC upper bound does not, provided that the sample size is large enough (see Confidence Interval Item 7).
§Comparative Example
We will define an example of the above model and compare how bench_diff
and the traditional benchmarking method fare with respect to the model. The example is admittedly contrived in order to facilitate approximate calculations and also to highlight the potential disparity of results between the two benchmarking methods.
Model parameters
-
The two functions, f1 and f2 are identical (call it f), with λ1 = λ2 = λ = 12 ms.
-
The number of executions of each function is exec_count = 2500. So, the total execution time, ignoring warm-up, is 1 minute.
-
α(t) = 1.5 + 1/2 * sin(t * 2*π / 60000), where t is the number of milliseconds elapsed since the start of the benchmark.
-
α’(t)
= 1/2 * 2*π / 60000 * cos(t * 2*π / 60000)
= π / 60000 * cos(t * 2*π / 60000)
Therefore:
- |α’(t)| ≤ π / 60000.
And we have the following bounds for α(t):
- AL = 1
- AU = 2
- AD = π / 60000
-
-
β(t) has σ = 0.28.
- Therefore, E(β(t)) = exp(σ2/2) ≈ 1.0400.
-
The baseline median latency of f is the median latency when α(t) = 1. Its value is λ = λ1 = λ2 = 12 ms.
-
The baseline mean latency of f is the mean latency when α(t) = 1. Its value μ = μ1 = μ2 = λ * exp(σ2/2) ≈ 12 * 1.0400 = 12.48 ms.
-
The ratio of baseline medians is always equal to the ratio of baseline means even when f1 ≠ f2 because σ does not depend on f1 or f2.
bench_diff
calculations
-
Given exec_count = 2500, the large sample size approximation for the BC bound can be used:
-
BC ≈ ((λ1+λ2)/2 * AD*exp(σ2/2) + (λ12+λ22)/2 * AD2*AU2/AL2*exp(2*σ2)) * (√(π/2) * 2.58 + 1)
≈ 0.00277
-
-
From the model (Confidence Interval Point 7):
- |mean_diff_ln - ln(λ1/λ2)| ≤ 0.00277 [ with 99% confidence ]
-
So, the multiplicative error on the estimate of λ1/λ2 (= λ/λ = 1 in our example) is at most, with 99% confidence:
- exp(0.00277) ≈ 1.00278, i.e., less than 3/10 of 1% error.
-
Recall that the error bound does not depend on the number of executions, so it is the same with only half the number of executions. Also, given the high exec_count assumed, the
bench_diff
results during the first half should be very close to those obtained during the second half. -
If instead of λ = λ1 = λ2 = 12 ms and exec_count = 2500 we assume λ = λ1 = λ2 = 120 ms and exec_count = 250, then, with 99% confidence:
- |mean_diff_ln - ln(λ1/λ2)| ≤ 0.0284
- Multiplicative error ≤ 1.0289, i.e., about 3% error.
Traditional method calculations
-
With the traditional method, we benchmark f1 with exec_count = 2500 and then benchmark f2 (which is the same as f1 in our example) with exec_count = 2500.
-
The first benchmark of f takes place during the first 30 seconds.
-
The calculated sample mean latency is approximately:
-
μ * 1/30000 * ∫030000 α(t) dt
= μ / 30000 * (45000 + 1/2 * (-cos(30000 * 2*π / 60000) + cos(0)) / (2*π / 60000))
= 1.5*μ + μ/30000 * 1/2 * (-cos(π) + cos(0)) / (2*π / 60000)
= 1.5*μ + μ * (-cos(π) + cos(0)) / (2*π)
= 1.5*μ + μ * (1 + 1) / (2*π)
= μ * (1.5 + 1/π)
≈ μ * 1.8183
-
-
The second benchmark of f takes place during the second 30 seconds.
-
The calculated sample mean latency is approximately:
-
μ * 1/30000 * ∫3000060000 α(t) dt
= μ / 30000 * (45000 + 1/2 * (-cos(60000 * 2*π / 60000) + cos(30000 * 2*π / 60000)) / (2*π / 60000))
= 1.5*μ + μ/30000 * 1/2 * (-cos(2*π) + cos(π)) / (2*π / 60000)
= 1.5*μ + μ * (-cos(2*π) + cos(π)) / (2*π)
= 1.5*μ + μ * (-1 + -1) / (2*π)
= μ * (1.5 - 1/π)
≈ μ * 1.1817
-
-
The estimated ratio of mean latencies is approximately 1.8183 / 1.1817 ≈ 1.5387, an estimating error of 54% in comparison with the baseline ratio of means μ1/μ2 = 1 (which is also equal to the baseline ratio of medians λ1/λ2).
-
If instead of λ = λ1 = λ2 = 12 ms and exec_count = 2500 we assume λ = λ1 = λ2 = 120 ms and exec_count = 250, then the results are roughly the same, i.e., an estimating error of 54%.
Closing comments for the example
- In this example, the estimate of λ1/λ2 (or equivalently μ1/μ2) provided by
bench_diff
is accurate to within 3/10 of 1% of the true value of 1. By contrast, the estimate of μ1/μ2 provided by the traditional method is inflated by 54%. - If we “run”
bench_diff
during the model’s first 30 seconds, we still get a very accurate estimate of λ1/λ2 but the individual (sample mean) estimates of μ1 and μ2 are both close to μ * 1.8183, just as with the traditional method. Likewise, if we “run”bench_diff
during the model’s last 30 seconds, we still get a very accurate estimate of λ1/λ2 but the individual (sample mean) estimates of μ1 and μ2 are both close to μ * 1.1817, just as with the traditional method. - The key point of
bench_diff
is to repeatedly run both functions in close time proximity to each other so that the ratios of the two functions’ latencies are close to the baseline even if the individual latencies themselves are distorted by time-dependent noise. - If instead of λ = λ1 = λ2 = 12 ms and exec_count = 2500 we assume λ = λ1 = λ2 = 120 ms and exec_count = 250, then the accuracy of the
bench_diff
results is worse, but it is still much better (at around a 3% error) than the traditional result (which remains about the same at around a 54% error).
§Limitations
This library works well for latencies at the microseconds or milliseconds order of magnitude, but not for latencies at the nanoseconds order of magnitude.
Modules§
- statistics
Deprecated - Structs and enums for confidence intervals and hypothesis tests.
- stats_
types - Structs and enums for confidence intervals and hypothesis tests.
Structs§
- DiffOut
- Contains the data resulting from a benchmark comparing two closures
f1
andf2
. - Summary
Stats - Common summary statistics useful in latency testing/benchmarking.
Enums§
- Latency
Unit - Unit of time used to record latencies. Used as an argument in benchmarking functions.
Functions§
- bench_
diff - Compares latencies for two closures
f1
andf2
. - bench_
diff_ with_ status - Compares latencies for two closures
f1
andf2
and outputs information about the benchmark and its execution status. Execution status is output tostderr
. - bench_
diff_ x - Compares latencies for two closures
f1
andf2
and optionally outputs information about the benchmark and its execution status. - get_
warmup_ millis - The currently defined number of milliseconds used to “warm-up” the benchmark. The default is 3,000 ms.
- set_
warmup_ millis - Changes the number of milliseconds used to “warm-up” the benchmark. The default is 3,000 ms.