Crate iro_cuda_ffi

Crate iro_cuda_ffi 

Source
Expand description

§IRO CUDA FFI (iro-cuda-ffi) v1

A minimal, rigid ABI boundary that lets Rust orchestrate nvcc-compiled CUDA C++ kernels with no performance penalty vs pure C++.

§Design Philosophy

  1. nvcc produces device code. iro-cuda-ffi never competes with nvcc.
  2. Rust owns host orchestration. Ownership, lifetimes, ordering, and errors are Rust responsibilities.
  3. FFI is constrained. The ABI boundary is small, stable, and verifiable.
  4. Patterns are mechanical. Humans and AI can generate wrappers safely via deterministic rules.

§Core Guarantees

  • No hidden device synchronization: Kernel launches never implicitly synchronize streams.
  • No implicit stream dependencies: You control all ordering via streams and events.
  • Typed transfer boundary: Host↔device copies are gated by IcffiPod for safety.
  • ABI verification: Layout asserts on both Rust and C++ sides catch mismatches at compile time.

§CUDA Version Requirements

iro-cuda-ffi requires CUDA 12.0 or later. CUDA Graph features use runtime APIs introduced in CUDA 11.4–12.0; linking against older runtimes will fail.

§Quick Start

use iro_cuda_ffi::prelude::*;

// Create a non-blocking stream
let stream = Stream::new()?;

// Allocate and initialize device memory (safe sync variant)
let input = DeviceBuffer::from_slice_sync(&stream, &[1.0f32, 2.0, 3.0, 4.0])?;
let mut output = DeviceBuffer::<f32>::zeros(4)?;

// Launch your kernel (extern "C" fn icffi_my_kernel(...) -> i32)
let blocks = (input.len() as u32 + 255) / 256;
let params = LaunchParams::new_1d(blocks, 256, stream.raw());
check(unsafe { icffi_my_kernel(params, input.as_in(), output.as_out()) })?;

// Read results (synchronizes automatically)
let results = output.to_vec(&stream)?;

Re-exports§

pub use prelude::*;

Modules§

abi
ABI types for the iro-cuda-ffi kernel interface.
device
CUDA device management.
error
Error handling for iro-cuda-ffi.
event
CUDA event primitives for synchronization and timing.
graph
CUDA graph capture and execution.
host_memory
Pinned host memory management.
memory
Device memory management.
pod
Plain Old Data (POD) traits for safe host↔device transfers.
prelude
Convenient re-exports for common iro-cuda-ffi usage.
stream
CUDA stream primitives for work ordering.
transfer
Async transfer guards for memory-safe DMA operations.