Expand description
Dynamic parallelism support for device-side kernel launches.
CUDA dynamic parallelism allows kernels running on the GPU to launch child kernels without returning to the host. This module provides configuration, planning, and PTX code generation for nested kernel launches.
§Architecture requirements
Dynamic parallelism requires compute capability 3.5+ (sm_35). All
SmVersion variants in this crate are sm_75+, so they all support
dynamic parallelism.
§CUDA nesting limits
- Maximum nesting depth: 24
- Default pending launch limit: 2048
- Each pending launch consumes device memory for bookkeeping
§Example
use oxicuda_launch::dynamic_parallelism::{
DynamicParallelismConfig, ChildKernelSpec, GridSpec,
validate_dynamic_config, plan_dynamic_launch,
generate_child_launch_ptx, generate_device_sync_ptx,
estimate_launch_overhead, max_nesting_for_sm,
};
use oxicuda_launch::Dim3;
use oxicuda_ptx::arch::SmVersion;
use oxicuda_ptx::PtxType;
let config = DynamicParallelismConfig {
max_nesting_depth: 4,
max_pending_launches: 2048,
sync_depth: 2,
child_grid: Dim3::x(128),
child_block: Dim3::x(256),
child_shared_mem: 0,
sm_version: SmVersion::Sm80,
};
validate_dynamic_config(&config).ok();
let plan = plan_dynamic_launch(&config).ok();
let child = ChildKernelSpec {
name: "child_kernel".to_string(),
param_types: vec![PtxType::U64, PtxType::U32],
grid_dim: GridSpec::Fixed(Dim3::x(128)),
block_dim: Dim3::x(256),
shared_mem_bytes: 0,
};
let ptx = generate_child_launch_ptx("parent_kernel", &child, SmVersion::Sm80);
let sync_ptx = generate_device_sync_ptx(SmVersion::Sm80);
let overhead = estimate_launch_overhead(4, 2048);
let max_depth = max_nesting_for_sm(SmVersion::Sm80);Structs§
- Child
Kernel Spec - Specification for a child kernel to be launched from device code.
- Dynamic
Launch Plan - A validated plan for a dynamic (device-side) kernel launch.
- Dynamic
Parallelism Config - Configuration for dynamic parallelism (device-side kernel launches).
Enums§
- Grid
Spec - Specifies how child kernel grid dimensions are determined.
Functions§
- estimate_
launch_ overhead - Estimates the device memory overhead for dynamic parallelism in bytes.
- generate_
child_ launch_ ptx - Generates PTX code for a device-side child kernel launch.
- generate_
device_ sync_ ptx - Generates PTX code for device-side synchronization.
- max_
nesting_ for_ sm - Returns the maximum supported nesting depth for a given SM version.
- plan_
dynamic_ launch - Creates a validated launch plan from a dynamic parallelism configuration.
- validate_
dynamic_ config - Validates a dynamic parallelism configuration.