Expand description
Access to CUDA’s memory allocation and transfer functions.
The memory module provides a safe wrapper around CUDA’s memory allocation and transfer functions. This includes access to device memory, unified memory, and page-locked host memory.
Device Memory
Device memory is just what it sounds like - memory allocated on the device. Device memory
cannot be accessed from the host directly, but data can be copied to and from the device.
cust exposes device memory through the DeviceBox
and
DeviceBuffer
structures. Pointers to device memory are
represented by DevicePointer
, while slices in device memory are
represented by DeviceSlice
.
Unified Memory
Unified memory is a memory allocation which can be read from and written to by both the host
and the device. When the host (or device) attempts to access a page of unified memory, it is
seamlessly transferred from host RAM to device RAM or vice versa. The programmer may also
choose to explicitly prefetch data to one side or another. cust exposes unified memory through the
UnifiedBox
and UnifiedBuffer
structures, and pointers to unified memory are represented by
UnifiedPointer
. Since unified memory is accessible to the host,
slices in unified memory are represented by normal Rust slices.
Unified memory is generally easier to use than device memory, but there are drawbacks. It is possible to allocate more memory than is available on the card, and this can result in very slow paging behavior. Additionally, it can require careful use of prefetching to achieve optimum performance. Finally, unified memory is not supported on some older systems.
Warning
⚠️ On certain systems/OSes/GPUs, accessing Unified memory from the CPU while the GPU is currently using it (e.g. before stream synchronization) will cause a Page Error/Segfault. For this reason, we strongly suggest to treat unified memory as exclusive to the GPU while it is being used by a kernel ⚠️
This is not considered Undefined Behavior because the behavior is always “either works, or yields a page error/segfault”, doing this will never corrupt memory or cause other undesireable behavior.
Page-locked Host Memory
Page-locked memory is memory that the operating system has locked into physical RAM, and will
not page out to disk. When copying data from the process’ memory space to the device, the CUDA
driver needs to first copy the data to a page-locked region of host memory, then initiate a DMA
transfer to copy the data to the device itself. Likewise, when transferring from device to host,
the driver copies the data into page-locked host memory then into the normal memory space. This
extra copy can be eliminated if the data is loaded or generated directly into page-locked
memory. cust exposes page-locked memory through the
LockedBuffer
struct.
For example, if the programmer needs to read an array of bytes from disk and transfer it to the
device, it would be best to create a LockedBuffer
, load the bytes directly into the
LockedBuffer
, and then copy them to a DeviceBuffer
. If the bytes are in a Vec<u8>
, there
would be no advantage to using a LockedBuffer
.
However, since the OS cannot page out page-locked memory, excessive use can slow down the entire system (including other processes) as physical RAM is tied up. Therefore, page-locked memory should be used sparingly.
FFI Information
The internal representations of DevicePointer<T>
and UnifiedPointer<T>
are guaranteed to be
the same as *mut T
and they can be safely passed through an FFI boundary to code expecting
raw pointers (though keep in mind that device-only pointers cannot be dereferenced on the CPU).
This is important when launching kernels written in C.
As with regular Rust, all other types (eg. DeviceBuffer
or UnifiedBox
) are not FFI-safe.
Their internal representations are not guaranteed to be anything in particular, and are not
guaranteed to be the same in different versions of cust. If you need to pass them through
an FFI boundary, you must convert them to FFI-safe primitives yourself. For example, with
UnifiedBuffer
, use the as_unified_ptr()
and len()
functions to get the primitives, and
mem::forget()
the Buffer so that it isn’t dropped. Again, as with regular Rust, the caller is
responsible for reconstructing the UnifiedBuffer
using from_raw_parts()
and dropping it to
ensure that the memory allocation is safely cleaned up.
Re-exports
Modules
Routines for allocating and using CUDA Array Objects.
Structs
A pointer type for heap-allocation in CUDA device memory.
Fixed-size device-side buffer. Provides basic access to device memory.
A pointer to device memory.
Fixed-size device-side slice.
Wrapper around a variable on the host and a DeviceBox
holding the
variable on the device, allowing for easy synchronization and storage.
Page-locked box in host memory.
Fixed-size host-side buffer in page-locked memory.
A pointer type for heap-allocation in CUDA unified memory.
Fixed-size buffer in unified memory.
A pointer to unified memory.
Traits
Sealed trait implemented by types which can be the source or destination when copying data asynchronously to/from the device or from one device allocation to another.
Sealed trait implemented by types which can be the source or destination when copying data to/from the device or from one device allocation to another.
Marker trait for types which can safely be copied to or from a CUDA device.
A trait describing a region of memory on the device with a base pointer and a size, used to be generic over DeviceBox, DeviceBuffer, DeviceVariable etc.
A trait describing a generic pointer that can be accessed from the GPU. This could be either a UnifiedBox
or a regular DeviceBox
.
A trait describing a generic buffer that can be accessed from the GPU. This could be either a UnifiedBuffer
or a regular DeviceBuffer
.
Functions for advising the driver about certain uses of unified memory. Such as advising the driver to prefetch memory or to treat memory as read-mostly.
Functions
Free memory allocated with cuda_malloc
.
Unsafe wrapper around cuMemFreeAsync
which queues a memory allocation free operation on a stream.
Retains all of the unsafe semantics of cuda_free
with the extra requirement that the memory
must not be used after it is dropped. Therefore, proper stream ordering semantics must be
respected.
Free page-locked memory allocated with cuda_malloc_host
.
Free memory allocated with cuda_malloc_unified
.
Unsafe wrapper around the cuMemAlloc
function, which allocates some device memory and
returns a DevicePointer
pointing to it. The memory is not cleared.
Unsafe wrapper around cuMemAllocAsync
which queues a memory allocation operation on a stream.
Retains all of the unsafe semantics of cuda_malloc
with the extra requirement that the memory
must not be used until it is allocated on the stream. Therefore, proper stream ordering semantics must be
respected.
Unsafe wrapper around the cuMemAllocHost
function, which allocates some page-locked host memory
and returns a raw pointer pointing to it. The memory is not cleared.
Unsafe wrapper around the cuMemAllocManaged
function, which allocates some unified memory and
returns a UnifiedPointer
pointing to it. The memory is not cleared.
Get the current free and total memory.
Simple wrapper over cuMemcpyHtoD_v2