Module dlpack_cuda

Expand description

GPU tensor passthrough via DLPack without CPU roundtrip.

Provides device-aware dispatch: CPU tensors are zero-copy viewed as ndarray, while CUDA/ROCm/Metal tensors are returned as dlpack_cuda::CudaTensorInfo without touching device memory. CUDA / GPU tensor passthrough via DLPack.

When a DLPack capsule contains a tensor resident on CUDA (or another non-CPU device), naively trying to consume it as a CPU ndarray would either panic or silently trigger an unacceptable host-device copy.

This module provides:

CudaTensorInfo — metadata extracted from a CUDA DLPack tensor without triggering any data copy.
cuda_tensor_info_from_dltensor — pure-Rust function operating directly on a DLTensor; no Python runtime needed.
[dlpack_auto_dispatch_f32] / [dlpack_auto_dispatch_f64] — device- aware dispatch that returns an ndarray view for CPU tensors, or CudaTensorInfo for GPU tensors, with no host-device copy.

§Design

The DLPack standard defines device type codes:

Code	Device
1	CPU
2	CUDA
3	CUDA pinned host
4	OpenCL
7	Vulkan
8	Metal
10	ROCm

For CPU tensors (type 1) the existing zero-copy array_from_dlpack_f32/f64 functions are used directly. For CUDA tensors (type 2) we extract shape, dtype, device_id, and byte_offset without touching the data pointer.

§CUDA runtime linkage

Full GPU-to-GPU processing (e.g. copying the tensor buffer to a cudarc-managed allocation) requires the cuda_special cargo feature and CUDA runtime linkage, which is deliberately kept out of default features to preserve the Pure Rust build. With default features only the metadata extraction path is available.

Structs§

CudaTensorInfo: Metadata extracted from a CUDA-resident DLPack tensor.

Enums§

DLPackDispatchResult: The result of auto-dispatching a DLPack tensor based on its device.

Functions§

cuda_tensor_info: Extract CudaTensorInfo from a Python DLPack capsule object.
cuda_tensor_info_from_dltensor: Extract CudaTensorInfo from a raw DLTensor pointer.
dlpack_auto_dispatch_f32^⚠: Dispatch an f32 DLPack tensor to CPU or GPU path without a CPU roundtrip.
dlpack_auto_dispatch_f64^⚠: Dispatch an f64 DLPack tensor to CPU or GPU path without a CPU roundtrip.
get_cuda_tensor_info: Python-facing function: extract GPU tensor metadata from a DLPack capsule.
register_dlpack_cuda_module: Register the get_cuda_tensor_info function into a PyO3 module.