Crate iro_cuda_ffi_profile

Crate iro_cuda_ffi_profile 

Source
Expand description

GPU profiling and benchmarking utilities for iro-cuda-ffi.

This crate provides tools for measuring GPU kernel performance with minimal overhead and comprehensive statistical analysis.

§Quick Start

use iro_cuda_ffi::prelude::*;
use iro_cuda_ffi_profile::prelude::*;

// One-shot timing
let ms = stream.timed_ms(|| {
    my_kernel(&stream, ...)?;
    Ok(())
})?;

// Reusable timer for hot loops
let timer = GpuTimer::new()?;
for _ in 0..100 {
    timer.start(&stream)?;
    my_kernel(&stream, ...)?;
    let ms = timer.stop_sync(&stream)?;
}

// Full benchmark with statistics
let result = Benchmark::new("my_kernel", &stream)
    .warmup(10)
    .iterations(100)
    .memory(MemoryAccess::f32(n, 3))
    .run(|s| my_kernel(s, ...))?;

println!("{}", result);

§Features

  • GpuTimer: Reusable event pair for low-overhead timing in loops
  • StreamTimingExt: Convenience extension for one-shot timing
  • Benchmark: Full benchmark harness with warmup and iterations
  • Stats: Comprehensive statistics including percentiles and outlier detection
  • Report: Formatted output for benchmark results

§When to Use What

ScenarioTool
Quick one-off timingstream.timed_ms()
Timing in a hot loopGpuTimer
Full benchmark with statsBenchmark::new().run()
Comparing two implementationsComparison

§Statistical Analysis

The Stats type provides:

  • Basic statistics: min, max, mean, median, standard deviation
  • Percentiles: P1, P5, P25, P50, P75, P95, P99
  • Outlier detection using the IQR method
  • Coefficient of variation for comparing variability

§Throughput Calculation

For memory-bound kernels:

let result = Benchmark::new("vector_add", &stream)
    .memory(MemoryAccess::f32(n, 3))  // read a, read b, write c
    .run(|s| vector_add(s, &a, &b, &mut c))?;

println!("Throughput: {:.2} GB/s", result.throughput_gbs().unwrap());

For compute-bound kernels:

let result = Benchmark::new("fma_chain", &stream)
    .compute(ComputeIntensity::fma(n, iters))
    .run(|s| fma_chain(s, ...))?;

println!("Compute: {:.2} GFLOP/s", result.throughput_gflops().unwrap());

Re-exports§

pub use bench::bench;
pub use bench::bench_memory;
pub use bench::BenchConfig;
pub use bench::BenchResult;
pub use bench::Benchmark;
pub use bench::ComputeIntensity;
pub use bench::MemoryAccess;
pub use report::format_bytes;
pub use report::format_count;
pub use report::format_gbs;
pub use report::format_gflops;
pub use report::format_ms;
pub use report::Comparison;
pub use report::Report;
pub use report::print_stats;
pub use stats::Stats;
pub use timer::GpuTimer;
pub use timer::StreamTimingExt;
pub use timer::TimingSamples;

Modules§

bench
Benchmark harness for GPU kernel performance measurement.
prelude
Prelude module for convenient imports.
report
Reporting and formatting utilities for benchmark results.
stats
Statistical analysis utilities for benchmark results.
timer
GPU timing utilities with reusable events.