Crate latency_trace

Expand description

§latency_trace

This library supports latency measurement for functions and code blocks, both sync and async.

Given code instrumented with the Rust tracing library, this library uses histograms from the hdrhistogram library to capture span latency timings.

Latencies are collected in microseconds for all spans with level trace or higher.

§Design goals

This framework should:

Be easy to use. It should only require a handful of lines of code to get latency metrics for code instrumented with the Rust tracing library.
Be self-contained, i.e., should not depend on the use of external tools like OpenTelemetry collectors, Jaeger, Grafana, etc.
Support both sync and async code.
Have low overhead, i.e., the latency associated with the collection of latency information should be low.

§Core concepts

This library collects latency information for spans. Span definitions are created in the code using macros and functions from the Rust tracing library which define span callsites, i.e., the places in the code where spans are defined. As the code is executed, a span definition in the code may be executed multiple times – each such execution is a span instance. Span instances arising from the same span definition are grouped into SpanGroups for latency information collection. Latencies are collected using Histograms from the hdrhistogram library.

The grouping of spans for latency collection is not exactly based on the span definitions in the code. Spans at runtime are structured as a set of span trees that correspond to the nesting of spans from code execution paths. The grouping of runtime spans for latency collection should respect the runtime parent-child relationships among spans.

Thus, SpanGroups form a forest of trees where some pairs of span groups have a parent-child relationship, corresponding to the parent-child relationships of the spans associated with the span groups. This means that if SpanGroup A is the parent of SpanGroup B then, for each span that was assigned to group B, its parent span was assigned to group A.

The coarsest-grained grouping of spans is characterized by a callsite path – a callsite and the (possibly empty) list of its ancestor callsites based on the different runtime execution paths (see span relationships). This is the default SpanGroup definition. Finer-grained groupings of spans can differentiate groups of spans with the same callsite path by taking into account values computed at runtime from the spans’ runtime Attributes.

While the granularity of latency information collection cannot be finer than a SpanGroup, the collected latency information can be subsequently aggregated further by grouping SpanGroups as needed (see Timings::aggregate.)

§Key design choices

This framework uses hdrhistogram::Histogram to collect latency information as it provides an efficient data structure for high-fidelity data collection across wide latency value ranges.

Two other design choices support the low overhead goal.

The tracing library’s Registry is used to store temporary timing data at runtime. As noted in the documentation, “This registry is implemented using a lock-free sharded slab, and is highly optimized for concurrent access.”
Runtime data collection takes place independently on each thread, overwhelmingly without the need for inter-thread coordination. The only inter-thread coordination involves two mutex lock request per thread (at the beginning of data collection for the thread and at the end) for the entire duration of the measurement, regardless of the number of spans executed. After the test execution has completed, information is post-processed for presentation purposes, with zero impact on the latency measurements. The thread_local_collect library is used to support this design approach.

§Usage modes

This framework is used to measure latencies for a sync or async function f that takes no arguments and contains code instrumented with the tracing framework. Any code to be measured can be wrapped by such a function.

The core LatencyTrace struct can be used directly as a Subscriber or indirectly as a Layer. The former usage is most straightforward, while the latter allows LatencyTrace to be combined with other tracing_subscriber::Layers to form a layered subscriber in more elaborate tracing scenarios. Using LatencyTrace as a tracing_subscriber::Layer also allows filtering to be added, regardless of whether it is combined with other layers.

The following modes of latency information reporting are supported:

Direct – Executes the instrumented function f and, after f terminates, returns with the reporting of latencies.
Probed – Executes the instrumented function fand immediately returns an object that can be used to report on partial latency information while f is still running. The returned object also provides a method that waits until f terminates and reports the final latency information.

§Async runtimes

This framework supports tokio out-of-the-box (see LatencyTrace::measure_latencies_tokio and LatencyTrace::measure_latencies_probed_tokio) but other async runtimes can be used as well by simply wrapping the async code with the chosen async runtime and using one of the sync methods (LatencyTrace::measure_latencies or LatencyTrace::measure_latencies_probed). The source code for the above-mentioned tokio variants shows exactly how to do it.

§Rust version requirements

This version of this library can be compiled with rustc 1.79.0 or higher. It may work with earlier rustc versions but that is not guaranteed.

§Default cargo feature

To include this library as a dependency without optional features in your Cargo.toml:

[dependencies]
latency_trace = "0"

§Optional cargo features

The optional feature flag “tokio” enables the tokio-related methods of LatencyTrace.

[dependencies]
latency_trace = { version = "0", features = ["tokio"] }

To run the doc_async_* examples from the latency_trace crate (see source repo), specify --features tokio or --all-features when invoking cargo run. For the example, to run doc_async_probed.rs, do as follows:

cargo run --features tokio --example doc_async_probed

or

cargo run --all-features --example doc_async_probed

§Usage examples

§Simple sync Subscriber example

use latency_trace::{summary_stats, LatencyTrace};
use std::{thread, time::Duration};
use tracing::{info, instrument, trace, trace_span};

#[instrument(level = "info")]
fn f() {
    trace!("in f");
    for _ in 0..10 {
        trace_span!("loop_body").in_scope(|| {
            info!("in loop body");
            // Simulated work
            thread::sleep(Duration::from_micros(1200));

            g();
        });
    }
}

#[instrument(level = "info")]
fn g() {
    trace!("in g");
    // Simulated work
    thread::sleep(Duration::from_micros(800));
}

fn main() {
    let latencies = LatencyTrace::activated_default()
        .unwrap()
        .measure_latencies(f);

    println!("\nLatency stats below are in microseconds");
    for (span_group, stats) in latencies.map_values(summary_stats) {
        println!("  * {:?}, {:?}", span_group, stats);
    }

    // A shorter way to print the summary stats, with uglier formatting.
    println!("\nDebug print of `latencies.map_values(summary_stats)`:");
    println!("{:?}", latencies.map_values(summary_stats));
}

§Simple async Subscriber example

//! Requires feature flag "tokio".

use latency_trace::{summary_stats, LatencyTrace};
use std::time::Duration;
use tracing::{instrument, trace_span, Instrument};

#[instrument(level = "trace")]
async fn f() {
    for _ in 0..1000 {
        async {
            // Simulated work
            tokio::time::sleep(Duration::from_micros(1200)).await;

            g().await;
        }
        .instrument(trace_span!("loop_body"))
        .await
    }
}

#[instrument(level = "trace")]
async fn g() {
    // Simulated work
    tokio::time::sleep(Duration::from_micros(800)).await;
}

fn main() {
    let latencies = LatencyTrace::activated_default()
        .unwrap()
        .measure_latencies_tokio(f);

    println!("\nLatency stats below are in microseconds");
    for (span_group, stats) in latencies.map_values(summary_stats) {
        println!("  * {:?}, {:?}", span_group, stats);
    }

    // A shorter way to print the summary stats, with uglier formatting.
    println!("\nDebug print of `latencies.map_values(summary_stats)`:");
    println!("{:?}", latencies.map_values(summary_stats));
}

§Simple sync probed Subscriber example

use latency_trace::{summary_stats, LatencyTrace};
use std::{thread, time::Duration};
use tracing::{info, instrument, trace, trace_span};

#[instrument(level = "info")]
fn f() {
    trace!("in f");
    for _ in 0..10 {
        trace_span!("loop_body").in_scope(|| {
            info!("in loop body");
            // Simulated work
            thread::sleep(Duration::from_micros(1200));

            g();
        });
    }
}

#[instrument(level = "info")]
fn g() {
    trace!("in g");
    // Simulated work
    thread::sleep(Duration::from_micros(800));
}

fn main() {
    let probed = LatencyTrace::activated_default()
        .unwrap()
        .measure_latencies_probed(f)
        .unwrap();

    // Let the function run for some time before probing latencies.
    thread::sleep(Duration::from_micros(16000));

    let latencies1 = probed.probe_latencies();
    let latencies2 = probed.wait_and_report();

    println!("\nlatencies1 in microseconds");
    for (span_group, stats) in latencies1.map_values(summary_stats) {
        println!("  * {:?}, {:?}", span_group, stats);
    }

    println!("\nlatencies2 in microseconds");
    for (span_group, stats) in latencies2.map_values(summary_stats) {
        println!("  * {:?}, {:?}", span_group, stats);
    }
}

§Simple async probed Subscriber example

//! Requires feature flag "tokio".

use latency_trace::{summary_stats, LatencyTrace};
use std::{thread, time::Duration};
use tracing::{instrument, trace_span, Instrument};

#[instrument(level = "trace")]
async fn f() {
    for _ in 0..1000 {
        async {
            // Simulated work
            tokio::time::sleep(Duration::from_micros(1200)).await;

            g().await;
        }
        .instrument(trace_span!("loop_body"))
        .await
    }
}

#[instrument(level = "trace")]
async fn g() {
    // Simulated work
    tokio::time::sleep(Duration::from_micros(800)).await;
}

fn main() {
    let probed = LatencyTrace::activated_default()
        .unwrap()
        .measure_latencies_probed_tokio(f)
        .unwrap();

    // Let the function run for some time before probing latencies.
    thread::sleep(Duration::from_micros(48000));

    let latencies1 = probed.probe_latencies();
    let latencies2 = probed.wait_and_report();

    println!("\nlatencies1 in microseconds");
    for (span_group, stats) in latencies1.map_values(summary_stats) {
        println!("  * {:?}, {:?}", span_group, stats);
    }

    println!("\nlatencies2 in microseconds");
    for (span_group, stats) in latencies2.map_values(summary_stats) {
        println!("  * {:?}, {:?}", span_group, stats);
    }
}

§Simple sync Layer example

//! Demonstrates how to use [`LatencyTrace`] as a [`Layer`](tracing_subscriber::layer::Layer)
//! that can be composed with other layers to make a layered [`Subscriber`](tracing::Subscriber).
//!
//! Compare with examples `doc_sync` and `doc_sync_fmt` in the GitHub repo.

use latency_trace::{summary_stats, LatencyTrace};
use std::{thread, time::Duration};
use tracing::{info, instrument, level_filters::LevelFilter, trace, trace_span};
use tracing_subscriber::{fmt::format::FmtSpan, prelude::*, Registry};

#[instrument(level = "info")]
fn f() {
    trace!("in f");
    for _ in 0..10 {
        trace_span!("loop_body").in_scope(|| {
            info!("in loop body");
            // Simulated work
            thread::sleep(Duration::from_micros(1200));

            g();
        });
    }
}

#[instrument(level = "info")]
fn g() {
    trace!("in g");
    // Simulated work
    thread::sleep(Duration::from_micros(800));
}

fn main() {
    // LatencyTrace instance from which latency statistics will be extracted later.
    let lt = LatencyTrace::default();

    // Clone of the above that will be used as a `tracing_subscriber::layer::Layer` that can be
    // composed with other tracing layers. Add a filter so that only spans with level `INFO` or
    // higher priority (lower level) are aggregated.
    let ltl = lt.clone().with_filter(LevelFilter::INFO);

    // `tracing_subscriber::fmt::Layer` that can be composed with the above `LatencyTrace` layer.
    // Spans with level `TRACE` or higher priority (lower level) are displayed.
    let tfmt = tracing_subscriber::fmt::layer()
        .with_span_events(FmtSpan::FULL)
        .with_filter(LevelFilter::TRACE);

    // Instantiate a layered subscriber and set it as the global default.
    let layered = Registry::default().with(ltl).with(tfmt);
    layered.init();

    // Measure latencies.
    let latencies = lt.measure_latencies(f);

    println!("\nLatency stats below are in microseconds");
    for (span_group, stats) in latencies.map_values(summary_stats) {
        println!("  * {:?}, {:?}", span_group, stats);
    }

    // A shorter way to print the summary stats, with uglier formatting.
    println!("\nDebug print of `latencies.map_values(summary_stats)`:");
    println!("{:?}", latencies.map_values(summary_stats));
}

§Other Layer examples

The above Subscriber examples can be easily modified based on the above Layer example to create analogous layered examples. See also test_layer_sync.rs and test_layer_async.rs in the GitHub repo under the dev_support/tests directory.

§Benchmarks

Execution on my laptop of the benchmarks defined in the source repo indicates that there can be significant variability in latency tracing overhead measurements. There can be so much variability across runs that sometimes nonsensical negative overheads were observed (using both the Divan and Criterion benchmarking frameworks), i.e., the median total latency of an instrumented function was less than that of the uninstrumented version of the same function.

To more reliably compare latency differences between two functions, given the above-mentioned variability, I created a utility that separates positive and negative differences so that nonsensical results can be easily excluded. See bench_diff.rs.

Running the benchark bench_overhead_simple_real_sync with a target function (simple_real_sync) that does heavy in-memory computation, the overhead per span varied from less than 1 µs up to 3 µs, depending on the intensity of the computation. The high end of the aforementioned overhead range was with computations that took 800 µs to 1,200 µs per span, in a multi-threaded scenario with 6 threads and 100 iterations (instantiating 2 spans each) per thread. Thus, the overhead observed on these benchmarks was less than 3/800 (< 1%) on a per-span latency basis.

Where the target function had a thread::sleep() in lieu of real computation, higher overheads were observed, especially as the sleep time increased. It appears that thread context switches have a significant adverse impact on overheads, possibly due to the invalidation of processor caches.

Structs§

LatencyTrace: Core type supporting latency mesurements.
LatencyTraceCfg: Configuration information for LatencyTrace. It is instantiated with its LatencyTraceCfg::default method and can be customized with its other methods.
ProbedTrace: Represents an ongoing collection of latency information with the ability to report on partial latencies before the instrumented function completes.
SpanGroup: Represents a set of tracing::Spans for which latency information should be collected as a group. It is the unit of latency information collection.
SummaryStats: Common summary statistics useful in latency testing/benchmarking.
Wrapper: Generic wrapper to facilitate the addition of new methods to the wrapped type.

Enums§

ActivationError: Error returned by LatencyTrace activation methods.

Functions§

default_span_grouper: Default span grouper. Used to group spans by callsite and ancestors, ignoring any span attributes.
group_by_all_fields: Custom span grouper used to group spans by callsite, ancestors, and all span fields and their values.
group_by_given_fields: Custom span grouper used to group spans by callsite, ancestors, and a given list of span field names. Spans with the same callsites, ancestors, and values for the specified fields are grouped together.
summary_stats: Computes a SummaryStats from a Timing.

Type Aliases§

Timing: Alias of Histogram<u64>.
Timings: Mapping of SpanGroups to the Timing information recorded for them; inherits all BTreeMap methods.
TimingsView: Wrapper of BTreeMap<K, Timing>; inherits all BTreeMap methods.