Struct DeviceMemory

Source

pub struct DeviceMemory<T> { /* private fields */ }

Expand description

Represents a region of owned CUDA device memory for elements of type T.

Implementations§

Source §

impl<T> DeviceMemory<T>

Associated utility functions.

Source

pub unsafe fn alloc(count: usize) -> Result<*mut T>

Allocates size bytes of linear memory on the device and returns a pointer to the allocated memory. The allocated memory is suitably aligned for any kind of variable. The memory is not cleared. DeviceMemory::alloc returns crate::error::Status::OutOfMemory on allocation failure.

The device version of DeviceMemory::free cannot be used with a pointer allocated using the host API, and vice versa.

§Errors

Returns an error if the requested byte size overflows, CUDA cannot allocate device memory, a previous asynchronous launch reports an error, or CUDA reports runtime initialization diagnostics such as crate::error::Status::NotInitialized, crate::error::Status::CallRequiresNewerDriver, or crate::error::Status::NoDevice.

§Safety

The returned pointer is uninitialized device memory. The caller must use it only for count elements of T and eventually free it with a compatible CUDA free function.

Source

pub unsafe fn alloc_managed( count: usize, flags: MemoryAttachFlags, ) -> Result<*mut T>

Source

pub unsafe fn free(ptr: *mut T) -> Result<()>

Frees the memory space pointed to by ptr, which must have been returned by a previous call to one of these allocation functions: DeviceMemory::alloc, sys::cudaMallocPitch, DeviceMemory::alloc_managed, DeviceMemory::alloc_async, or sys::cudaMallocFromPoolAsync.

This does not perform implicit synchronization when the pointer was allocated with DeviceMemory::alloc_async or sys::cudaMallocFromPoolAsync. Callers must ensure that all accesses to this pointer have completed before invoking DeviceMemory::free. For best performance and memory reuse, use DeviceMemory::free_async to free memory allocated via the stream ordered memory allocator. For all other pointers, this call may perform implicit synchronization.

If DeviceMemory::free has already been called before, an error is returned. If ptr is null, no operation is performed. DeviceMemory::free returns an error on failure.

The device version of DeviceMemory::free cannot be used with a pointer allocated using the host API, and vice versa.

§Errors

Returns an error if CUDA cannot free ptr, ptr has already been freed, a previous asynchronous launch reports an error, or CUDA reports runtime initialization diagnostics.

§Safety

ptr must be null or a live allocation returned by a compatible CUDA device allocation function, and no work may access it after it is freed.

Source

pub unsafe fn copy( dst: mut T, src: const T, count: usize, kind: MemoryCopyKind, ) -> Result<()>

Copies count elements from src to dst. The transfer direction is specified by MemoryCopyKind. MemoryCopyKind::Default is recommended when unified virtual addressing is available, in which case the transfer direction is inferred from the pointer values. Calling DeviceMemory::copy with dst and src pointers that do not match the direction of the copy results in undefined behavior.

Exhibits synchronous behavior for most use cases.
Memory regions requested must be either entirely registered with CUDA, or in the case of host pageable transfers, not registered at all. Memory regions spanning over allocations that are both registered and not registered with CUDA are not supported and return crate::error::Status::InvalidValue.

§Errors

Returns an error if the requested byte count overflows, CUDA rejects the pointer combination or copy kind, a previous asynchronous launch reports an error, or CUDA reports runtime initialization diagnostics.

§Safety

src and dst must be valid for count elements of T according to kind, and the source and destination regions must not overlap unless CUDA permits that transfer.

Source

pub unsafe fn set(dst: *mut T, value: u8, count: usize) -> Result<()>

Fills the first count bytes of the memory area pointed to by ptr with the constant byte value.

This call is asynchronous with respect to the host unless ptr refers to pinned host memory.

See the CUDA memset synchronization rules for when this operation blocks the host.

§Errors

Returns an error if the requested byte count overflows, CUDA rejects the pointer or size, a previous asynchronous launch reports an error, or CUDA reports runtime initialization diagnostics.

§Safety

dst must be valid for writes of count * size_of::<T>() bytes and must refer to memory that CUDA can memset.

Source

pub unsafe fn alloc_host(size: usize) -> Result<*mut ()>

Source

pub unsafe fn free_host(ptr: *mut ()) -> Result<()>

Frees host memory returned by DeviceMemory::alloc_host or DeviceMemory::alloc_pinned.

§Errors

Returns an error if CUDA cannot free the host allocation, a previous asynchronous launch reports an error, or CUDA reports runtime initialization diagnostics.

§Safety

ptr must be null or a live host allocation returned by a compatible CUDA host allocation function.

Source

pub unsafe fn alloc_pinned( size: usize, flags: HostAllocationFlags, ) -> Result<*mut ()>

Allocates size bytes of host memory that is page-locked and accessible to the device. The driver tracks the allocated virtual memory ranges and automatically accelerates calls such as DeviceMemory::copy. Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc(). Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, use this sparingly to allocate staging areas for data exchange between host and device.

flags selects allocation options:

HostAllocationFlags::DEFAULT: equivalent to DeviceMemory::alloc_host.
HostAllocationFlags::PORTABLE: the memory returned by this call is considered pinned memory by all CUDA contexts, not just the one that performed the allocation.
HostAllocationFlags::MAPPED: maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling sys::cudaHostGetDevicePointer.
HostAllocationFlags::WRITE_COMBINED: allocates the memory as write-combined (WC). WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. WC memory is a good option for buffers written by the CPU and read by the device via mapped pinned memory or host->device transfers.

All of these flags are orthogonal to one another: a developer may allocate memory that is portable, mapped and/or write-combined with no restrictions.

For HostAllocationFlags::MAPPED to have any effect, the CUDA context must support ContextFlags::MAP_HOST, which can be checked via Device::flags. ContextFlags::MAP_HOST is implicitly set for contexts created via the runtime API.

HostAllocationFlags::MAPPED may be specified on CUDA contexts for devices that do not support mapped pinned memory. The failure is deferred to sys::cudaHostGetDevicePointer because the memory may be mapped into other CUDA contexts via HostAllocationFlags::PORTABLE.

Memory allocated by this method must be freed with DeviceMemory::free_host.

§Errors

Returns an error if CUDA cannot allocate pinned host memory, a previous asynchronous launch reports an error, or CUDA reports runtime initialization diagnostics.

§Safety

The returned pointer is uninitialized host memory. The caller must ensure it is accessed within size bytes and freed with DeviceMemory::free_host.

Source

pub unsafe fn register_host( ptr: *mut (), size: usize, flags: HostRegisterFlags, ) -> Result<()>

Page-locks the memory range specified by ptr and size, and maps it for the devices selected by flags. This memory range also is added to the same tracking mechanism as DeviceMemory::alloc_pinned to automatically accelerate calls to functions such as DeviceMemory::copy. Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory that has not been registered. Page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, use this sparingly to register staging areas for data exchange between host and device.

On systems where DeviceProperties::pageable_memory_access_uses_host_page_tables is enabled, DeviceMemory::register_host does not page-lock the memory range specified by ptr and instead only populates unpopulated pages.

DeviceMemory::register_host is supported only on I/O coherent devices where DeviceProperties::host_register_supported is enabled.

flags selects registration options:

HostRegisterFlags::DEFAULT: on a system with unified virtual addressing, the memory is both mapped and portable. On a system with no unified virtual addressing, the memory is neither mapped nor portable.
HostRegisterFlags::PORTABLE: the memory returned by this call is considered pinned memory by all CUDA contexts, not just the one that performed the allocation.
HostRegisterFlags::MAPPED: maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling sys::cudaHostGetDevicePointer.
HostRegisterFlags::IO_MEMORY: the passed memory pointer is treated as pointing to some memory-mapped I/O space, for example belonging to a third-party PCIe device, and it is marked as non-cache-coherent and contiguous.
HostRegisterFlags::READ_ONLY: the passed memory pointer is treated as pointing to memory that is considered read-only by the device. On platforms without DeviceProperties::pageable_memory_access_uses_host_page_tables, this flag is required to register memory mapped to the CPU as read-only. Query support with DeviceProperties::host_register_read_only_supported. Using this flag with a current context associated with a device that does not have this attribute set makes DeviceMemory::register_host return crate::error::Status::NotSupported.

All of these flags are orthogonal to one another: a developer may page-lock memory that is portable or mapped with no restrictions.

The CUDA context must have been created with ContextFlags::MAP_HOST for HostRegisterFlags::MAPPED to have any effect.

HostRegisterFlags::MAPPED may be specified on CUDA contexts for devices that do not support mapped pinned memory. The failure is deferred to sys::cudaHostGetDevicePointer because the memory may be mapped into other CUDA contexts via HostRegisterFlags::PORTABLE.

On devices where DeviceProperties::can_use_host_pointer_for_registered_mem is enabled, the memory can also be accessed from the device using the original host pointer. The device pointer returned by sys::cudaHostGetDevicePointer may or may not match the original host pointer and depends on the devices visible to the application. If all devices visible to the application have a non-zero value for the device attribute, the device pointer returned by sys::cudaHostGetDevicePointer matches the original pointer. If any device visible to the application has a zero value for the device attribute, the device pointer returned by sys::cudaHostGetDevicePointer does not match the original host pointer, but is suitable for use on all devices provided Unified Virtual Addressing is enabled. In such systems, it is valid to access the memory using either pointer on devices that have a non-zero value for the device attribute. Such devices must access the memory through only one of the two pointers, not both.

The memory page-locked by this method must be unregistered with DeviceMemory::unregister_host.

§Errors

Returns an error if CUDA cannot register the host range, the pointer, size, or flags are invalid, a previous asynchronous launch reports an error, or CUDA reports runtime initialization diagnostics.

§Safety

ptr..ptr + size must be a valid host memory range and must remain valid until it is unregistered.

Source

pub unsafe fn unregister_host(ptr: *mut ()) -> Result<()>

Unmaps the memory range whose base address is specified by ptr, and makes it pageable again.

The base address must be the same one specified to DeviceMemory::register_host.

§Errors

Returns an error if CUDA cannot unregister the host range, ptr is not the base address of a registered range, a previous asynchronous launch reports an error, or CUDA reports runtime initialization diagnostics.

§Safety

ptr must be the base address of a host range registered with DeviceMemory::register_host and must not be unregistered twice.

Source

pub fn memory_info() -> Result<(usize, usize)>

Returns the total amount of memory available to the current context and the amount of memory free on the device. CUDA is not guaranteed to be able to allocate all of the memory that the OS reports as free. In a multi-tenant situation, the free-memory estimate is prone to a race condition: an allocation or free by another process or thread between estimation and reporting can make the reported free value differ from actual free memory.

The integrated GPU on Tegra shares memory with CPU and other component of the SoC. The free and total values returned by this call exclude the SWAP memory space maintained by the OS on some platforms. The OS may move some of the memory pages into swap area as the GPU or CPU allocate or access memory. See Tegra app note on how to calculate total and free memory on Tegra.

§Errors

Returns an error if CUDA cannot query memory information, a previous asynchronous launch reports an error, or CUDA reports runtime initialization diagnostics.

Source

pub fn pointer_attributes(ptr: *const T) -> Result<PointerAttributes>

Returns the attributes of ptr. If ptr was not allocated in, mapped by, or registered with a context that supports unified addressing, crate::error::Status::InvalidValue is returned.

In CUDA 11.0 and later, passing a host pointer reports MemoryType::Unregistered in PointerAttributes::memory_type.

PointerAttributes::memory_type identifies the type of memory. It can be MemoryType::Unregistered for unregistered host memory, MemoryType::Host for registered host memory, MemoryType::Device for device memory, or MemoryType::Managed for managed memory.
PointerAttributes::device is the device against which ptr was allocated. If ptr has memory type MemoryType::Device, this identifies the device on which the memory physically resides. If ptr has memory type MemoryType::Host, this identifies the device that was current when the allocation was made, and if that device is deinitialized then this allocation will vanish with that device’s state.
PointerAttributes::device_pointer is the device pointer alias through which the memory referred to by ptr may be accessed on the current device. If the memory referred to by ptr cannot be accessed directly by the current device then this is null.
PointerAttributes::host_pointer is the host pointer alias through which the memory referred to by ptr may be accessed on the host. If the memory referred to by ptr cannot be accessed directly by the host then this is null.

§Errors

Returns an error if CUDA cannot query attributes for ptr, ptr is not known to a unified-addressing context, or CUDA reports runtime initialization diagnostics.

Source

pub unsafe fn alloc_async(count: usize, stream: &Stream) -> Result<*mut T>

Source

pub unsafe fn free_async(ptr: *mut T, stream: &Stream) -> Result<()>

Inserts a free operation into stream. The allocation must not be accessed after stream execution reaches the free. After this call returns, accessing the memory from any subsequent work launched on the GPU or querying its pointer attributes results in undefined behavior.

During stream capture, this creates a free node and must therefore be passed the address of a graph allocation.

§Errors

Returns an error if CUDA cannot enqueue the free on stream, ptr is invalid for asynchronous freeing, a previous asynchronous launch reports an error, or CUDA reports runtime initialization diagnostics.

§Safety

ptr must be null or a live stream-ordered CUDA allocation. No work may access it after stream reaches the enqueued free.

Source

pub unsafe fn copy_async( dst: mut T, src: const T, count: usize, kind: MemoryCopyKind, stream: &Stream, ) -> Result<()>

Source

pub unsafe fn set_async( dst: *mut T, value: u8, count: usize, stream: &Stream, ) -> Result<()>

Fills the first count bytes of the memory area pointed to by ptr with the constant byte value.

DeviceMemory::set_async is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated with a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.

The device version only handles device-to-device copies and cannot be given local or shared pointers.

See the CUDA memset synchronization rules for when this operation blocks the host.

§Errors

Returns an error if the requested byte count overflows, CUDA cannot enqueue the memset on stream, a previous asynchronous launch reports an error, or CUDA reports runtime initialization diagnostics.

§Safety

dst must be valid for writes of count * size_of::<T>() bytes until stream reaches the enqueued memset.

Source

pub fn prefetch_async( ptr: DevicePtr, count: usize, location: MemoryLocation, stream: &Stream, ) -> Result<()>

Prefetches memory to the specified destination location. ptr is the base device pointer of the memory to be prefetched, location specifies the destination location, count specifies the number of bytes to copy, and stream is the stream in which the operation is enqueued. The memory range must refer to managed memory allocated via DeviceMemory::alloc_managed or declared via __managed__ variables. It may also refer to memory allocated from a managed memory pool, or to system-allocated memory on systems where DeviceProperties::pageable_memory_access is enabled.

Setting MemoryLocation::kind to MemoryLocationKind::Device prefetches memory to the GPU identified by MemoryLocation::id. That device, and the device associated with stream, must support concurrent managed access. Setting MemoryLocation::kind to MemoryLocationKind::Host prefetches data to host memory. Applications can request prefetching memory to a specific host NUMA node by using MemoryLocationKind::Numa with a valid NUMA node identifier, or to the NUMA node closest to the current thread’s CPU by using MemoryLocationKind::NumaCurrent. When MemoryLocation::kind is MemoryLocationKind::Host or MemoryLocationKind::NumaCurrent, MemoryLocation::id is ignored.

The start and end addresses of the memory range are rounded down and up, respectively, to CPU page-size alignment before the prefetch operation is enqueued in the stream.

If no physical memory has been allocated for this region, CUDA populates and maps it on the destination device. If there is insufficient memory to prefetch the desired region, the Unified Memory driver may evict pages from other DeviceMemory::alloc_managed allocations to host memory to make room. Device memory allocated using DeviceMemory::alloc or sys::cudaMallocArray is not evicted.

By default, mappings to the previous location of the migrated pages are removed and mappings for the new location are only set up at the destination. The exact behavior also depends on the settings applied to this memory range via cuMemAdvise as described below:

If read-mostly advice was set on any subset of this memory range, then that subset will create a read-only copy of the pages at the destination location. If the destination location is a host NUMA node, any pages of that subset that are already in another host NUMA node are transferred to the destination.

If preferred-location advice was set on any subset of this memory range, then the pages will migrate to location even if it is not the preferred location of every page in the range.

If accessed-by advice was set on any subset of this memory range, then mappings to those pages from all appropriate processors are updated to refer to the new location if establishing such a mapping is possible. Otherwise, those mappings are cleared.

This is not required for correctness; it improves performance by allowing the application to migrate data to a suitable location before access. Memory accesses to this range are always coherent and are allowed even when the data is actively being migrated.

This call is asynchronous with respect to the host and all work on other devices.

§Errors

Returns an error if CUDA cannot enqueue the prefetch on stream, the memory range or destination location is invalid, a previous asynchronous launch reports an error, or CUDA reports runtime initialization diagnostics.

Source §

impl<T> DeviceMemory<T>

Source

pub unsafe fn from_raw_parts(ptr: *mut T, length: usize) -> Self

Takes ownership of an existing device allocation.

§Safety

ptr must be null for an empty allocation or point to length live elements allocated by cudaMallocManaged or another CUDA allocation function compatible with cudaFree. length * size_of::<T>() must fit in usize. No other owner may free the pointer while the returned value is alive.

Source

pub unsafe fn from_slice_async(v: &[T], stream: &Stream) -> Result<Self>

§Safety

The caller must ensure v remains valid and unmodified until stream has completed the transfer.

§Errors

Returns an error if CUDA cannot allocate device memory or enqueue the host-to-device copy.

Source

pub const fn len(&self) -> usize

Source

pub const fn is_empty(&self) -> bool

Source

pub fn byte_len(&self) -> usize

Source

pub const fn as_ptr(&self) -> *const T

Source

pub const fn as_mut_ptr(&self) -> *mut T

Source

pub fn copy_from_host(&mut self, host_slice: &[T]) -> Result<()>

Source

pub fn copy_from_host_async<'scope, 'env>( &mut self, host_slice: &'env [T], stream: &StreamScope<'scope, 'env>, ) -> Result<()>

Source

pub unsafe fn copy_from_host_async_unchecked( &mut self, host_slice: &[T], stream: &Stream, ) -> Result<()>

§Safety

The caller must ensure self and host_slice both remain valid until stream has completed the transfer.

Source

pub unsafe fn copy_from_host_operation<'a>( &'a mut self, host_slice: &'a [T], ) -> Result<MemoryCopyOperation<'a, T>>

Returns a capture operation that copies from host memory into this device allocation.

§Safety

Capturing this operation stores the host and device pointer addresses in the resulting CUDA graph. The caller must ensure self and host_slice remain valid whenever a captured graph using this operation is launched. The destination allocation must remain exclusive for the work ordered by those launches.

Source

pub fn copy_to_host(&self, host_slice: &mut [T]) -> Result<()>

Source

pub fn copy_to_host_async<'scope, 'env>( &self, host_slice: &'env mut [T], stream: &StreamScope<'scope, 'env>, ) -> Result<()>

Source

pub unsafe fn copy_to_host_async_unchecked( &self, host_slice: &mut [T], stream: &Stream, ) -> Result<()>

§Safety

The caller must ensure self and host_slice both remain valid until stream has completed the transfer.

Source

pub unsafe fn copy_to_host_operation<'a>( &'a self, host_slice: &'a mut [T], ) -> Result<MemoryCopyOperation<'a, T>>

Returns a capture operation that copies this allocation into host memory.

§Safety

Capturing this operation stores the device and host pointer addresses in the resulting CUDA graph. The caller must ensure self and host_slice remain valid whenever a captured graph using this operation is launched. The host destination must remain exclusive for the work ordered by those launches.

Source

pub fn copy_to_host_vec(&self) -> Result<Vec<T>>

Source

pub fn copy_from_device(&mut self, src: &Self) -> Result<()>

Source

pub fn copy_from_device_async<'scope, 'env>( &mut self, src: &Self, stream: &StreamScope<'scope, 'env>, ) -> Result<()>

Source

pub unsafe fn copy_from_device_async_unchecked( &mut self, src: &Self, stream: &Stream, ) -> Result<()>

§Safety

The caller must ensure self and src both remain valid until stream has completed the transfer.

Source

pub unsafe fn copy_from_device_operation<'a>( &'a mut self, src: &'a Self, ) -> Result<MemoryCopyOperation<'a, T>>

Returns a capture operation that copies from another device allocation into this allocation.

§Safety

Capturing this operation stores both device pointer addresses in the resulting CUDA graph. The caller must ensure self and src remain valid whenever a captured graph using this operation is launched. The destination allocation must remain exclusive for the work ordered by those launches.

Source

pub fn set_zeroes(&mut self) -> Result<()>

Source

pub fn set_value(&mut self, value: u8) -> Result<()>

Source

pub fn set_value_async<'scope, 'env>( &mut self, value: u8, stream: &StreamScope<'scope, 'env>, ) -> Result<()>

Source

pub unsafe fn set_value_async_unchecked( &mut self, value: u8, stream: &Stream, ) -> Result<()>

§Safety

The caller must ensure self remains valid until stream has completed the memset.

§Errors

Returns an error if CUDA cannot enqueue the memset on stream.

Source

pub unsafe fn set_value_operation<'a>( &'a mut self, value: u8, ) -> MemorySetOperation<'a, T>

Returns a capture operation that fills this device allocation with value.

§Safety

Capturing this operation stores this allocation’s pointer address in the resulting CUDA graph. The caller must ensure self remains valid and exclusive whenever a captured graph using this operation is launched.

Source

pub fn ipc_handle(&self) -> Result<IpcMemoryHandle>

Takes a pointer to the base of an existing device memory allocation created with DeviceMemory::alloc and exports it for use in another process. This is a lightweight operation and may be called multiple times on an allocation without adverse effects.

If a region of memory is freed with DeviceMemory::free and a subsequent call to DeviceMemory::alloc returns memory with the same device address, DeviceMemory::ipc_handle returns a unique handle for the new memory.

IPC is restricted to devices with unified-addressing support on Linux and Windows. IPC on Windows is supported for compatibility but is not recommended because of its performance cost. Check device IPC support through the device properties exposed by this crate, for example DeviceProperties::ipc_event_supported.