Module dynamic_parallelism

Expand description

Dynamic parallelism support for device-side kernel launches.

CUDA dynamic parallelism allows kernels running on the GPU to launch child kernels without returning to the host. This module provides configuration, planning, and PTX code generation for nested kernel launches.

§Architecture requirements

Dynamic parallelism requires compute capability 3.5+ (sm_35). All SmVersion variants in this crate are sm_75+, so they all support dynamic parallelism.

§CUDA nesting limits

Maximum nesting depth: 24
Default pending launch limit: 2048
Each pending launch consumes device memory for bookkeeping

§Example

use oxicuda_launch::dynamic_parallelism::{
    DynamicParallelismConfig, ChildKernelSpec, GridSpec,
    validate_dynamic_config, plan_dynamic_launch,
    generate_child_launch_ptx, generate_device_sync_ptx,
    estimate_launch_overhead, max_nesting_for_sm,
};
use oxicuda_launch::Dim3;
use oxicuda_ptx::arch::SmVersion;
use oxicuda_ptx::PtxType;

let config = DynamicParallelismConfig {
    max_nesting_depth: 4,
    max_pending_launches: 2048,
    sync_depth: 2,
    child_grid: Dim3::x(128),
    child_block: Dim3::x(256),
    child_shared_mem: 0,
    sm_version: SmVersion::Sm80,
};

validate_dynamic_config(&config).ok();
let plan = plan_dynamic_launch(&config).ok();

let child = ChildKernelSpec {
    name: "child_kernel".to_string(),
    param_types: vec![PtxType::U64, PtxType::U32],
    grid_dim: GridSpec::Fixed(Dim3::x(128)),
    block_dim: Dim3::x(256),
    shared_mem_bytes: 0,
};

let ptx = generate_child_launch_ptx("parent_kernel", &child, SmVersion::Sm80);
let sync_ptx = generate_device_sync_ptx(SmVersion::Sm80);
let overhead = estimate_launch_overhead(4, 2048);
let max_depth = max_nesting_for_sm(SmVersion::Sm80);

Structs§

ChildKernelSpec: Specification for a child kernel to be launched from device code.
DynamicLaunchPlan: A validated plan for a dynamic (device-side) kernel launch.
DynamicParallelismConfig: Configuration for dynamic parallelism (device-side kernel launches).

Enums§

GridSpec: Specifies how child kernel grid dimensions are determined.

Functions§

estimate_launch_overhead: Estimates the device memory overhead for dynamic parallelism in bytes.
generate_child_launch_ptx: Generates PTX code for a device-side child kernel launch.
generate_device_sync_ptx: Generates PTX code for device-side synchronization.
max_nesting_for_sm: Returns the maximum supported nesting depth for a given SM version.
plan_dynamic_launch: Creates a validated launch plan from a dynamic parallelism configuration.
validate_dynamic_config: Validates a dynamic parallelism configuration.

Module dynamic_parallelism

Module dynamic_parallelism Copy item path

§Architecture requirements

§CUDA nesting limits

§Example

Structs§

Enums§

Functions§

Module dynamic_parallelism