Crate iro_cuda_ffi_profile

Expand description

GPU profiling and benchmarking utilities for iro-cuda-ffi.

This crate provides tools for measuring GPU kernel performance with minimal overhead and comprehensive statistical analysis.

§Quick Start

use iro_cuda_ffi::prelude::*;
use iro_cuda_ffi_profile::prelude::*;

// One-shot timing
let ms = stream.timed_ms(|| {
    my_kernel(&stream, ...)?;
    Ok(())
})?;

// Reusable timer for hot loops
let timer = GpuTimer::new()?;
for _ in 0..100 {
    timer.start(&stream)?;
    my_kernel(&stream, ...)?;
    let ms = timer.stop_sync(&stream)?;
}

// Full benchmark with statistics
let result = Benchmark::new("my_kernel", &stream)
    .warmup(10)
    .iterations(100)
    .memory(MemoryAccess::f32(n, 3))
    .run(|s| my_kernel(s, ...))?;

println!("{}", result);

§Features

GpuTimer: Reusable event pair for low-overhead timing in loops
StreamTimingExt: Convenience extension for one-shot timing
Benchmark: Full benchmark harness with warmup and iterations
Stats: Comprehensive statistics including percentiles and outlier detection
Report: Formatted output for benchmark results

§When to Use What

Scenario	Tool
Quick one-off timing	`stream.timed_ms()`
Timing in a hot loop	`GpuTimer`
Full benchmark with stats	`Benchmark::new().run()`
Comparing two implementations	`Comparison`

§Statistical Analysis

The Stats type provides:

Basic statistics: min, max, mean, median, standard deviation
Percentiles: P1, P5, P25, P50, P75, P95, P99
Outlier detection using the IQR method
Coefficient of variation for comparing variability

§Throughput Calculation

For memory-bound kernels:

let result = Benchmark::new("vector_add", &stream)
    .memory(MemoryAccess::f32(n, 3))  // read a, read b, write c
    .run(|s| vector_add(s, &a, &b, &mut c))?;

println!("Throughput: {:.2} GB/s", result.throughput_gbs().unwrap());

For compute-bound kernels:

let result = Benchmark::new("fma_chain", &stream)
    .compute(ComputeIntensity::fma(n, iters))
    .run(|s| fma_chain(s, ...))?;

println!("Compute: {:.2} GFLOP/s", result.throughput_gflops().unwrap());

Re-exports§

pub use bench::bench;
pub use bench::bench_memory;
pub use bench::BenchConfig;
pub use bench::BenchResult;
pub use bench::Benchmark;
pub use bench::ComputeIntensity;
pub use bench::MemoryAccess;
pub use report::format_bytes;
pub use report::format_count;
pub use report::format_gbs;
pub use report::format_gflops;
pub use report::format_ms;
pub use report::Comparison;
pub use report::Report;
pub use report::print_stats;
pub use stats::Stats;
pub use timer::GpuTimer;
pub use timer::StreamTimingExt;
pub use timer::TimingSamples;

Modules§

bench: Benchmark harness for GPU kernel performance measurement.
prelude: Prelude module for convenient imports.
report: Reporting and formatting utilities for benchmark results.
stats: Statistical analysis utilities for benchmark results.
timer: GPU timing utilities with reusable events.

Crate iro_cuda_ffi_profile

Crate iro_cuda_ffi_profile Copy item path

§Quick Start

§Features

§When to Use What

§Statistical Analysis

§Throughput Calculation

Re-exports§

Modules§

Crate iro_cuda_ffi_profile