pub fn cuda_tensor_matmul<'py>(
_py: Python<'py>,
tensor_a: &Bound<'py, PyAny>,
_tensor_b: &Bound<'py, PyAny>,
) -> PyResult<Py<PyAny>>Expand description
Multiply two PyTorch/JAX tensors via the DLPack protocol.
This is the entry point for the zero-copy GPU path. When the cuda_bridge
Cargo feature is enabled (and cudarc is linked), the function accepts any
Python object implementing __dlpack__ and dispatches directly to a CUDA
GEMM kernel.
In the current CPU-only build this function returns PyNotImplementedError
with a clear message directing callers to gpu_matmul().
ยงPython example
import torch, scirs2
a = torch.randn(512, 512, device='cuda')
b = torch.randn(512, 512, device='cuda')
# GPU path (when cuda_bridge feature is enabled):
c = scirs2.cuda_tensor_matmul(a, b)
# CPU fallback for all tensor sizes:
c_data = scirs2.gpu_matmul(a.flatten().tolist(), 512, 512, b.flatten().tolist(), 512)