Module cust::memory

Expand description

Access to CUDA’s memory allocation and transfer functions.

The memory module provides a safe wrapper around CUDA’s memory allocation and transfer functions. This includes access to device memory, unified memory, and page-locked host memory.

Device Memory

Device memory is just what it sounds like - memory allocated on the device. Device memory cannot be accessed from the host directly, but data can be copied to and from the device. cust exposes device memory through the DeviceBox and DeviceBuffer structures. Pointers to device memory are represented by DevicePointer, while slices in device memory are represented by DeviceSlice.

Unified Memory

Unified memory is a memory allocation which can be read from and written to by both the host and the device. When the host (or device) attempts to access a page of unified memory, it is seamlessly transferred from host RAM to device RAM or vice versa. The programmer may also choose to explicitly prefetch data to one side or another. cust exposes unified memory through the UnifiedBox and UnifiedBuffer structures, and pointers to unified memory are represented by UnifiedPointer. Since unified memory is accessible to the host, slices in unified memory are represented by normal Rust slices.

Unified memory is generally easier to use than device memory, but there are drawbacks. It is possible to allocate more memory than is available on the card, and this can result in very slow paging behavior. Additionally, it can require careful use of prefetching to achieve optimum performance. Finally, unified memory is not supported on some older systems.

Warning

⚠️ On certain systems/OSes/GPUs, accessing Unified memory from the CPU while the GPU is currently using it (e.g. before stream synchronization) will cause a Page Error/Segfault. For this reason, we strongly suggest to treat unified memory as exclusive to the GPU while it is being used by a kernel ⚠️

This is not considered Undefined Behavior because the behavior is always “either works, or yields a page error/segfault”, doing this will never corrupt memory or cause other undesireable behavior.

Page-locked Host Memory

Page-locked memory is memory that the operating system has locked into physical RAM, and will not page out to disk. When copying data from the process’ memory space to the device, the CUDA driver needs to first copy the data to a page-locked region of host memory, then initiate a DMA transfer to copy the data to the device itself. Likewise, when transferring from device to host, the driver copies the data into page-locked host memory then into the normal memory space. This extra copy can be eliminated if the data is loaded or generated directly into page-locked memory. cust exposes page-locked memory through the LockedBuffer struct.

For example, if the programmer needs to read an array of bytes from disk and transfer it to the device, it would be best to create a LockedBuffer, load the bytes directly into the LockedBuffer, and then copy them to a DeviceBuffer. If the bytes are in a Vec<u8>, there would be no advantage to using a LockedBuffer.

However, since the OS cannot page out page-locked memory, excessive use can slow down the entire system (including other processes) as physical RAM is tied up. Therefore, page-locked memory should be used sparingly.

FFI Information

The internal representations of DevicePointer<T> and UnifiedPointer<T> are guaranteed to be the same as *mut T and they can be safely passed through an FFI boundary to code expecting raw pointers (though keep in mind that device-only pointers cannot be dereferenced on the CPU). This is important when launching kernels written in C.

As with regular Rust, all other types (eg. DeviceBuffer or UnifiedBox) are not FFI-safe. Their internal representations are not guaranteed to be anything in particular, and are not guaranteed to be the same in different versions of cust. If you need to pass them through an FFI boundary, you must convert them to FFI-safe primitives yourself. For example, with UnifiedBuffer, use the as_unified_ptr() and len() functions to get the primitives, and mem::forget() the Buffer so that it isn’t dropped. Again, as with regular Rust, the caller is responsible for reconstructing the UnifiedBuffer using from_raw_parts() and dropping it to ensure that the memory allocation is safely cleaned up.

Re-exports

pub use bytemuck;

Modules

array

Routines for allocating and using CUDA Array Objects.

Structs

DeviceBox

A pointer type for heap-allocation in CUDA device memory.

DeviceBuffer

Fixed-size device-side buffer. Provides basic access to device memory.

DevicePointer

A pointer to device memory.

DeviceSlice

Fixed-size device-side slice.

DeviceVariable

Wrapper around a variable on the host and a DeviceBox holding the variable on the device, allowing for easy synchronization and storage.

LockedBox

Page-locked box in host memory.

LockedBuffer

Fixed-size host-side buffer in page-locked memory.

UnifiedBox

A pointer type for heap-allocation in CUDA unified memory.

UnifiedBuffer

Fixed-size buffer in unified memory.

UnifiedPointer

A pointer to unified memory.

Traits

AsyncCopyDestination

Sealed trait implemented by types which can be the source or destination when copying data asynchronously to/from the device or from one device allocation to another.

CopyDestination

Sealed trait implemented by types which can be the source or destination when copying data to/from the device or from one device allocation to another.

DeviceCopy

Marker trait for types which can safely be copied to or from a CUDA device.

DeviceMemory

A trait describing a region of memory on the device with a base pointer and a size, used to be generic over DeviceBox, DeviceBuffer, DeviceVariable etc.

DeviceSliceIndex

GpuBox

A trait describing a generic pointer that can be accessed from the GPU. This could be either a UnifiedBox or a regular DeviceBox.

GpuBuffer

A trait describing a generic buffer that can be accessed from the GPU. This could be either a UnifiedBuffer or a regular DeviceBuffer.

MemoryAdvise

Functions for advising the driver about certain uses of unified memory. Such as advising the driver to prefetch memory or to treat memory as read-mostly.

Functions

cuda_free ^⚠

Free memory allocated with cuda_malloc.

cuda_free_async ^⚠

Unsafe wrapper around cuMemFreeAsync which queues a memory allocation free operation on a stream. Retains all of the unsafe semantics of cuda_free with the extra requirement that the memory must not be used after it is dropped. Therefore, proper stream ordering semantics must be respected.

cuda_free_locked ^⚠

Free page-locked memory allocated with cuda_malloc_host.

cuda_free_unified ^⚠

Free memory allocated with cuda_malloc_unified.

cuda_malloc ^⚠

Unsafe wrapper around the cuMemAlloc function, which allocates some device memory and returns a DevicePointer pointing to it. The memory is not cleared.

cuda_malloc_async ^⚠

Unsafe wrapper around cuMemAllocAsync which queues a memory allocation operation on a stream. Retains all of the unsafe semantics of cuda_malloc with the extra requirement that the memory must not be used until it is allocated on the stream. Therefore, proper stream ordering semantics must be respected.

cuda_malloc_locked ^⚠

Unsafe wrapper around the cuMemAllocHost function, which allocates some page-locked host memory and returns a raw pointer pointing to it. The memory is not cleared.

cuda_malloc_unified ^⚠

Unsafe wrapper around the cuMemAllocManaged function, which allocates some unified memory and returns a UnifiedPointer pointing to it. The memory is not cleared.

mem_get_info

Get the current free and total memory.

memcpy_htod ^⚠

Simple wrapper over cuMemcpyHtoD_v2

Derive Macros

DeviceCopy