Module cudarc::driver

Expand description

Wrappers around the CUDA driver API, in three levels. See crate documentation for description of each.

safe api usage

Instantiate a CudaDevice:

let device = CudaDevice::new(0).unwrap();

Allocate device memory with host data with CudaDevice::htod_copy(), CudaDevice::alloc_zeros(), or CudaDevice::htod_sync_copy().

You can also copy data to CudaSlice using CudaDevice::htod_sync_copy_into()

let a_dev: CudaSlice<f32> = device.alloc_zeros(10).unwrap();
let b_dev: CudaSlice<f32> = device.htod_copy(vec![0.0; 10]).unwrap();
let c_dev: CudaSlice<f32> = device.htod_sync_copy(&[1.0, 2.0, 3.0]).unwrap();

Transfer to host memory with CudaDevice::sync_reclaim(), CudaDevice::dtoh_sync_copy(), or CudaDevice::dtoh_sync_copy_into()

let a_dev: CudaSlice<f32> = device.alloc_zeros(10).unwrap();
let mut a_buf: [f32; 10] = [1.0; 10];
device.dtoh_sync_copy_into(&a_dev, &mut a_buf);
assert_eq!(a_buf, [0.0; 10]);
let a_host: Vec<f32> = device.sync_reclaim(a_dev).unwrap();
assert_eq!(&a_host, &[0.0; 10]);

Mutating device memory - CudaFunction

See LaunchAsync and CudaFunction.

In order to mutate device data, you need to use cuda kernels.

Loading kernels is done with CudaDevice::load_ptx()

let ptx = compile_ptx("extern \"C\" __global__ void my_function(float *out) { }").unwrap();
let device = CudaDevice::new(0).unwrap();
device.load_ptx(ptx, "module_name", &["my_function"]).unwrap();

Retrieve the function using the registered module name & actual function name:

let func: CudaFunction = device.get_func("module_name", "my_function").unwrap();

Asynchronously execute the kernel:

let mut a = device.alloc_zeros::<f32>(10).unwrap();
let cfg = LaunchConfig::for_num_elems(10);
unsafe { func.launch(cfg, (&mut a,)) }.unwrap();

Note: Launching kernels is extremely unsafe. See LaunchAsync for more info.

For some operations, it is necessary to only operate on a small part of a single CudaSlice. For example, the slice may represent a batch of items, and you want to run separate kernels on each of the items in the batch.

Use CudaSlice::try_slice() and CudaSlice::try_slice_mut() for this. The returned views (CudaView and CudaViewMut hold references to the owning CudaSlice, so rust’s ownership system handles safety here.

These view structs can be used with CudaFunction.

let mut a: CudaSlice<f32> = device.alloc_zeros::<f32>(3 * 10).unwrap();
for i_batch in 0..3 {
    let mut a_sub_view: CudaViewMut<f32> = a.try_slice_mut(i_batch * 10..).unwrap();
    let f: CudaFunction = device.get_func("module_name", "my_function").unwrap();
    let cfg = LaunchConfig::for_num_elems(10);
    unsafe { f.launch(cfg, (&mut a_sub_view,)) }.unwrap();
}

A note on implementation

It would be possible to re-use CudaSlice itself for sub-slices, however that would involve adding another structure underneath the hood that is wrapped in an std::sync::Arc to minimize data cloning. Overall it seemed more complex than the current implementation.

Multi threading

In order to use a CudaDevice on multiple threads, you must call CudaDevice::bind_to_thread on each thread before you use the device.

Safety

There are a number of aspects to this, but at a high level this API utilizes std::sync::Arc to control when CudaDevice can be dropped.

Context/Stream lifetimes

The first part of safety is ensuring that crate::driver::sys::CUcontext, crate::driver::sys::CUdevice, and crate::driver::sys::CUstream all live the required amount of time (i.e. device outlives context, which outlives stream).

This is accomplished by putting all of them inside one struct, the CudaDevice. There are other ways, such as adding newtypes that carry lifetimes with them, but this approach was chosen to make working with device pointers easier.

Additionally, CudaDevice implements Drop as releasing all the data from the device in the expected way.

To this end every operation by default happens on the same stream.

Multi stream is supported via CudaStream, however it automatically synchronizes with the main stream on creation & on drop. It is still possible to be unsafe in a multi stream context though.

Re-exports

pub use safe::*;

Modules

result
A thin wrapper around sys.
safe
Safe abstractions over crate::driver::result provided by CudaSlice, CudaDevice, CudaStream, and more.
sys

Module cudarc::driver

safe api usage

Mutating device memory - CudaFunction

Sub slices of CudaSlice

A note on implementation

Multi threading

Safety

Context/Stream lifetimes

Device Data lifetimes

Host and Device Data lifetimes

Single stream operations

Re-exports

Modules