Expand description
GPU profiling and benchmarking utilities for iro-cuda-ffi.
This crate provides tools for measuring GPU kernel performance with minimal overhead and comprehensive statistical analysis.
§Quick Start
ⓘ
use iro_cuda_ffi::prelude::*;
use iro_cuda_ffi_profile::prelude::*;
// One-shot timing
let ms = stream.timed_ms(|| {
my_kernel(&stream, ...)?;
Ok(())
})?;
// Reusable timer for hot loops
let timer = GpuTimer::new()?;
for _ in 0..100 {
timer.start(&stream)?;
my_kernel(&stream, ...)?;
let ms = timer.stop_sync(&stream)?;
}
// Full benchmark with statistics
let result = Benchmark::new("my_kernel", &stream)
.warmup(10)
.iterations(100)
.memory(MemoryAccess::f32(n, 3))
.run(|s| my_kernel(s, ...))?;
println!("{}", result);§Features
GpuTimer: Reusable event pair for low-overhead timing in loopsStreamTimingExt: Convenience extension for one-shot timingBenchmark: Full benchmark harness with warmup and iterationsStats: Comprehensive statistics including percentiles and outlier detectionReport: Formatted output for benchmark results
§When to Use What
| Scenario | Tool |
|---|---|
| Quick one-off timing | stream.timed_ms() |
| Timing in a hot loop | GpuTimer |
| Full benchmark with stats | Benchmark::new().run() |
| Comparing two implementations | Comparison |
§Statistical Analysis
The Stats type provides:
- Basic statistics: min, max, mean, median, standard deviation
- Percentiles: P1, P5, P25, P50, P75, P95, P99
- Outlier detection using the IQR method
- Coefficient of variation for comparing variability
§Throughput Calculation
For memory-bound kernels:
ⓘ
let result = Benchmark::new("vector_add", &stream)
.memory(MemoryAccess::f32(n, 3)) // read a, read b, write c
.run(|s| vector_add(s, &a, &b, &mut c))?;
println!("Throughput: {:.2} GB/s", result.throughput_gbs().unwrap());For compute-bound kernels:
ⓘ
let result = Benchmark::new("fma_chain", &stream)
.compute(ComputeIntensity::fma(n, iters))
.run(|s| fma_chain(s, ...))?;
println!("Compute: {:.2} GFLOP/s", result.throughput_gflops().unwrap());Re-exports§
pub use bench::bench;pub use bench::bench_memory;pub use bench::BenchConfig;pub use bench::BenchResult;pub use bench::Benchmark;pub use bench::ComputeIntensity;pub use bench::MemoryAccess;pub use report::format_bytes;pub use report::format_count;pub use report::format_gbs;pub use report::format_gflops;pub use report::format_ms;pub use report::Comparison;pub use report::Report;pub use report::print_stats;pub use stats::Stats;pub use timer::GpuTimer;pub use timer::StreamTimingExt;pub use timer::TimingSamples;