piano

Automatic instrumentation-based profiler for Rust. Measures self-time, call counts, and heap allocations per function across sync, threaded, and async code.

Given this program:

#[tokio::main]
async fn main() {
    let config = load_config("settings.toml");
    let data = fetch_all(&config.urls).await;
    let report = analyze(data);
    write_output(report);
}

piano reports:

Function                                       Self    Calls   Allocs  Alloc Bytes    Frees   Free Bytes
--------------------------------------------------------------------------------------------------------
fetch_all                                   341.21ms        1      840       62.5KB      780       58.1KB
analyze                                     141.13ms        1      329       24.1KB      310       22.8KB
load_config                                  12.44ms        1       52        4.1KB       48        3.9KB
write_output                                  3.02ms        1       12        1.2KB        8        0.9KB

Self-time is time spent in the function itself, excluding called functions. fetch_all took 341ms of its own work. Alloc and free columns track heap operations per function. Free columns appear when the program frees memory during profiling. The .await points and thread migrations are tracked automatically. Programs using threads (rayon, std::thread) and async runtimes (tokio, async-std) work automatically.

Output shows the top 10 functions by self-time. Use --all to see every instrumented function, or --top N to adjust the limit.

Install

cargo install piano

Requires Rust 1.88+.

Usage

Piano instruments binary crates only. Your project must have a src/main.rs or a [[bin]] target; library crates cannot be profiled directly.

Three steps: instrument, run, report.

$ piano build                                # instrument all functions and compile
$ piano run                                  # execute the instrumented binary
$ piano report                               # display results

If your program takes arguments, pass them after --:

$ piano run -- --input data.csv --verbose

piano profile chains all three steps into one command:

$ piano profile                              # build + run + report
$ piano profile -- --input data.csv          # with args passed to your binary

Narrow instrumentation to specific functions, files, or modules:

$ piano build --fn parse                     # functions matching "parse"
$ piano build --fn parse --exact             # only functions named exactly "parse"
$ piano build --fn "Parser::parse"           # specific impl method
$ piano build --file src/lexer.rs            # all functions in a file
$ piano build --mod resolver                 # all functions in a module

These flags work with piano profile too: piano profile --fn parse -- --input data.csv.

Add --cpu-time to measure per-thread CPU time alongside wall time (Linux + macOS, 64-bit only).

Timed profiling

Run the profiled program for a fixed duration, then stop and report:

$ piano profile --duration 10                   # stop after 10 seconds
$ piano profile --duration 2.5                  # fractional seconds supported

Structured output

$ piano profile --json                          # JSON to stdout (pipeable)
$ piano report --json                           # same for report
$ piano diff baseline current --json            # same for diff

Multi-binary projects

For workspace crates with multiple binaries:

$ piano profile --bin my-server                 # profile a specific binary

Piano discovers binaries from [[bin]] entries and from src/bin/ automatically.

Per-thread breakdown

$ piano report --threads                        # show per-thread timing

Raw timing data

By default, piano subtracts its own measurement overhead from reported times (about 8ns per call on x86-64). To see the uncorrected measured values:

$ piano report --uncorrected

Compare runs

Tag a run, make changes, profile again, diff:

$ piano tag baseline
# ... make changes ...
$ piano profile
$ piano tag current
$ piano diff baseline current
Function                                   baseline    current      Delta     Allocs    A.Delta
----------------------------------------------------------------------------------------------
parse                                      341.21ms    198.44ms   -142.77ms        640       -200
parse_item                                  94.58ms     88.12ms     -6.46ms       1240       -380
resolve                                    141.13ms    141.09ms     -0.04ms        329         +0

How it works

Runs cargo build --release with a compiler wrapper that instruments each crate during compilation. Your source is never modified; instrumented code goes to temporary files that are cleaned up automatically.
Injects piano-runtime via --extern flag (no changes to your Cargo.toml or lockfile)
Parses source with ra_ap_syntax and injects timing guards into matched functions (including impl Future functions, trait implementations, and functions generated by macro_rules! via compile-time expansion)
Wraps your global allocator for heap tracking. If the allocator is behind #[cfg(...)] (e.g., tikv-jemallocator on Linux only), piano wraps it under the same cfg gate and injects a fallback wrapper for all other platforms automatically.
Builds to target/piano/. The path is stable across runs, so cargo's incremental cache applies; dependencies compile once and only instrumented source files recompile.

The runtime has zero external dependencies to avoid conflicts with your project.

When to use something else

piano adds ~29ns per instrumented function call (measured on AMD Ryzen 7 3700X at 4.2 GHz). You can narrow instrumentation with --fn, --file, or --mod to reduce this. For programs with millions of short function calls, a sampling profiler like perf or cargo-flamegraph will add less overhead. piano is a better fit when you want exact call counts, self-time breakdowns, or allocation tracking without configuring OS-level tooling.

Limitations

Measurement bias is calibrated at startup (two rdtsc reads, ~8ns on x86-64). After bias correction, residual error is under 2ns. Functions faster than the residual floor will show inflated times. Use --uncorrected to see raw measured values.
const fn and extern fn are skipped (use piano build --list-skipped to see them)
Allocation tracking covers heap operations only (stack allocations are not tracked)
Binary crates only (no libraries)

License

MIT

piano 0.14.2