pub struct DeviceBuffer<T: DeviceRepr> { /* private fields */ }Expand description
Owned, typed allocation of device memory.
The underlying bytes are freed when the buffer drops. Clone/copy is
deliberately not implemented — copying len bytes of device memory is
not free, so baracuda makes the user spell it out as an explicit
stream-ordered D2D memcpy.
Implementations§
Source§impl<T: DeviceRepr> DeviceBuffer<T>
impl<T: DeviceRepr> DeviceBuffer<T>
Sourcepub fn new(context: &Context, len: usize) -> Result<Self>
pub fn new(context: &Context, len: usize) -> Result<Self>
Allocate an uninitialized buffer of len elements on the given context’s device.
len == 0 (or a zero-sized T) short-circuits: CUDA rejects 0-byte
allocations with CUDA_ERROR_INVALID_VALUE, so we produce a sentinel
null-pointer buffer. Drop knows to skip the free on such buffers,
and every copy method below treats len == 0 as a no-op.
Sourcepub fn new_async(context: &Context, len: usize, stream: &Stream) -> Result<Self>
pub fn new_async(context: &Context, len: usize, stream: &Stream) -> Result<Self>
Allocate len elements asynchronously on stream using the
device’s default memory pool. Requires CUDA 11.2+.
Unlike new, this call doesn’t block — the
allocation becomes usable for any subsequent operation on stream
in stream order. Use free_async to reclaim
on the same stream, or let Drop reclaim synchronously.
Sourcepub fn free_async(self, stream: &Stream) -> Result<()>
pub fn free_async(self, stream: &Stream) -> Result<()>
Free self asynchronously on stream. The buffer becomes invalid
stream-ordered-after this call completes on the device. Consumes
self so Drop does not also try to free.
Requires CUDA 11.2+.
Sourcepub fn zeros(context: &Context, len: usize) -> Result<Self>
pub fn zeros(context: &Context, len: usize) -> Result<Self>
Allocate and fill with zero bytes. Zero-length allocations are a
no-op (no cuMemsetD8 call is issued).
Sourcepub fn zero(&self) -> Result<()>
pub fn zero(&self) -> Result<()>
Synchronously fill this buffer with zero bytes via cuMemsetD8.
Empty buffers are a no-op (no FFI call). Use this to reuse an
existing allocation when you want zeroed contents without paying
the allocation cost a second time.
Sourcepub fn zero_async(&self, stream: &Stream) -> Result<()>
pub fn zero_async(&self, stream: &Stream) -> Result<()>
Stream-ordered zero-fill via cuMemsetD8Async. Empty buffers are
a no-op. The fill is ordered with respect to other work submitted
to stream; synchronize the stream before reading from the host.
Sourcepub fn from_slice(context: &Context, src: &[T]) -> Result<Self>
pub fn from_slice(context: &Context, src: &[T]) -> Result<Self>
Allocate and copy src synchronously from host memory. Empty
slices produce a sentinel zero-length buffer (no CUDA calls).
Sourcepub fn copy_from_host(&self, src: &[T]) -> Result<()>
pub fn copy_from_host(&self, src: &[T]) -> Result<()>
Synchronous H2D copy. src.len() must equal self.len().
No-op when the buffer is empty — no cuMemcpy is issued.
Sourcepub fn copy_to_host(&self, dst: &mut [T]) -> Result<()>
pub fn copy_to_host(&self, dst: &mut [T]) -> Result<()>
Synchronous D2H copy. dst.len() must equal self.len().
No-op on empty buffers.
Sourcepub fn copy_from_host_async(&self, src: &[T], stream: &Stream) -> Result<()>
pub fn copy_from_host_async(&self, src: &[T], stream: &Stream) -> Result<()>
Asynchronous H2D copy on stream. No-op on empty buffers.
Sourcepub fn copy_to_host_async(&self, dst: &mut [T], stream: &Stream) -> Result<()>
pub fn copy_to_host_async(&self, dst: &mut [T], stream: &Stream) -> Result<()>
Asynchronous D2H copy on stream. No-op on empty buffers.
Sourcepub fn copy_to_device(&self, dst: &DeviceBuffer<T>) -> Result<()>
pub fn copy_to_device(&self, dst: &DeviceBuffer<T>) -> Result<()>
Device-to-device copy into another buffer of the same length. No-op on empty buffers.
Sourcepub fn copy_to_device_async(
&self,
dst: &DeviceBuffer<T>,
stream: &Stream,
) -> Result<()>
pub fn copy_to_device_async( &self, dst: &DeviceBuffer<T>, stream: &Stream, ) -> Result<()>
Asynchronous device-to-device copy on stream. No-op on empty buffers.
Sourcepub fn as_raw(&self) -> CUdeviceptr
pub fn as_raw(&self) -> CUdeviceptr
Raw device pointer. Use with care — baracuda still owns the allocation.
Sourcepub fn as_slice(&self) -> DeviceSlice<'_, T>
pub fn as_slice(&self) -> DeviceSlice<'_, T>
Borrow the whole buffer as a DeviceSlice<'_, T>.
Sourcepub fn as_slice_mut(&mut self) -> DeviceSliceMut<'_, T>
pub fn as_slice_mut(&mut self) -> DeviceSliceMut<'_, T>
Borrow the whole buffer as a DeviceSliceMut<'_, T>.
Sourcepub fn slice(&self, range: Range<usize>) -> DeviceSlice<'_, T>
pub fn slice(&self, range: Range<usize>) -> DeviceSlice<'_, T>
Borrow a sub-range of the buffer as an immutable DeviceSlice.
Panics if the range is out of bounds or inverted. Element indices
are used — the byte offset is range.start * size_of::<T>().
let ctx = Context::new(&Device::get(0)?)?;
let buf: DeviceBuffer<f32> = DeviceBuffer::zeros(&ctx, 1024)?;
let first_half = buf.slice(0..512);
let tail = buf.slice(512..1024);Source§impl DeviceBuffer<u8>
impl DeviceBuffer<u8>
Sourcepub fn view_as<U: DeviceRepr>(&self) -> DeviceSlice<'_, U>
pub fn view_as<U: DeviceRepr>(&self) -> DeviceSlice<'_, U>
Reinterpret the byte buffer as an immutable typed DeviceSlice<'_, U>.
The recommended primitive for layering safe typed APIs over a
byte-shaped storage substrate — e.g. a unified-binding table that
stores all device tensors as DeviceBuffer<u8> and only acquires
element types at the edges where it calls into typed CUDA
libraries.
Alignment is guaranteed: cuMemAlloc returns 256-byte-aligned
pointers, which satisfies any U: DeviceRepr we ship today and
any reasonable user type.
§Panics
Panics if the buffer’s byte length isn’t an integer multiple of
size_of::<U>(). Zero-sized U produces a zero-length view.
Sourcepub fn view_as_mut<U: DeviceRepr>(&mut self) -> DeviceSliceMut<'_, U>
pub fn view_as_mut<U: DeviceRepr>(&mut self) -> DeviceSliceMut<'_, U>
Mutable counterpart of view_as.