oxicuda-launch
Type-safe GPU kernel launch infrastructure for the OxiCUDA ecosystem.
Part of the OxiCUDA project.
Overview
oxicuda-launch provides ergonomic, type-safe abstractions for launching
CUDA GPU kernels. It builds on top of oxicuda-driver and oxicuda-memory
to turn raw cuLaunchKernel calls into safe, builder-pattern Rust code.
The Kernel struct wraps a Function handle and holds an Arc<Module> to
ensure the PTX module remains loaded for the kernel's lifetime. Arguments
are passed through the KernelArgs trait, which converts typed Rust values
into the *mut c_void array that the driver expects. Tuple implementations
are provided for up to 24 Copy elements, covering virtually all practical
kernel signatures.
The launch! macro offers a concise syntax for kernel dispatch, while
LaunchParams and its builder provide explicit control over grid size,
block size, shared memory, and stream selection. The grid_size_for(n, block_size) helper computes the minimum number of blocks needed to cover
a given number of work items.
Modules
| Module | Description |
|---|---|
grid |
Dim3 struct, grid_size_for(), auto_grid_for() helpers |
params |
LaunchParams and LaunchParamsBuilder (builder pattern) |
kernel |
Kernel struct, KernelArgs trait (tuple impls up to 24) |
macros |
launch! macro for ergonomic kernel dispatch |
error |
Launch-specific error types and validation |
cooperative |
Cooperative kernel launch (cuLaunchCooperativeKernel) |
cluster |
Cluster launch support (Hopper+, CUlaunchConfig) |
graph_launch |
Graph-based launch recording and replay |
async_launch |
Async kernel launch with future/poll-based completion |
dynamic_parallelism |
Device-side kernel launch (nested kernels) |
multi_stream |
Multi-stream kernel dispatch helpers |
named_args |
Named kernel arguments (named_args!(), launch_named!()) |
arg_serialize |
Kernel argument serialization and debug display |
telemetry |
Launch telemetry: timing, occupancy achieved, register usage |
trace |
tracing crate span integration per kernel launch |
Quick Start
use Arc;
use ;
use ;
init?;
let dev = get?;
let ctx = new;
// Load PTX and create a kernel.
let module = new;
let kernel = from_module?;
// Configure launch dimensions.
let n: u32 = 1024;
let block_size = 256u32;
let grid = grid_size_for;
let stream = new?;
// Launch with the macro.
let = ;
launch!?;
stream.synchronize?;
# Ok::
Dim3 Conversions
Dim3 supports convenient From conversions for common dimension patterns:
| Input type | Result |
|---|---|
u32 |
Dim3 { x, y: 1, z: 1 } |
(u32, u32) |
Dim3 { x, y, z: 1 } |
(u32, u32, u32) |
Dim3 { x, y, z } |
Features
| Feature | Description |
|---|---|
gpu-tests |
Enable tests that require a physical GPU |
Platform Support
| Platform | Status |
|---|---|
| Linux | Full support (NVIDIA driver 525+) |
| Windows | Full support (NVIDIA driver 525+) |
| macOS | Compile only (UnsupportedPlatform at runtime) |
Status
| Item | Value |
|---|---|
| Version | 0.1.4 (2026-04-18) |
| Tests | 207 passing |
| Warnings | 0 |
unwrap() |
0 |
License
Apache-2.0 -- (C) 2026 COOLJAPAN OU (Team KitaSan)