pyo3-dlpack 0.3.0

Zero-copy DLPack tensor interop for PyO3
Documentation

pyo3-dlpack

Zero-copy DLPack tensor interop for PyO3.

This crate provides a safe and ergonomic way to exchange tensor data between Rust and Python ML frameworks (PyTorch, JAX, TensorFlow, CuPy, etc.) using the DLPack protocol.

Features

  • Zero-copy: Tensors are shared directly without copying data
  • PyO3 0.28+: Uses the modern API (no deprecation warnings)
  • Bidirectional: Import tensors from Python and export tensors to Python
  • Device-agnostic: Works with CPU, CUDA, ROCm, and other devices
  • DLPack 1.0: Versioned protocol with read-only tensors — auto-negotiated on import, fully backward-compatible with legacy producers
  • Benchmarked: zero-copy stays O(1)/flat-memory where copy-based interop is O(n)/2×-memory (≈54,000× faster at 100M elements), and at parity with dlpark — see BENCHMARKS.md

Installation

Add to your Cargo.toml:

[dependencies]
pyo3-dlpack = "0.3"
pyo3 = "0.28"

Usage

Importing a tensor from Python

use pyo3::prelude::*;
use pyo3_dlpack::PyTensor;

#[pyfunction]
fn process_tensor(py: Python<'_>, obj: &Bound<'_, PyAny>) -> PyResult<()> {
    let tensor = PyTensor::from_pyany(py, obj)?;

    println!("Shape: {:?}", tensor.shape());
    println!("Device: {:?}", tensor.device());
    println!("Dtype: {:?}", tensor.dtype());

    // Respect the producer's read-only flag (DLPack 1.0); legacy producers
    // always report `false`.
    if tensor.is_read_only() {
        // Treat the data as immutable.
    }

    if tensor.device().is_cpu() {
        // Safe to access data on CPU
        let ptr = tensor.data_ptr() as *const f32;
        // ... process the data
    }

    Ok(())
}

Exporting a tensor to Python

use pyo3::prelude::*;
use pyo3_dlpack::{IntoDLPack, TensorInfo, cuda_device, dtype_f32};
use std::ffi::c_void;

struct MyGpuTensor {
    device_ptr: u64,
    shape: Vec<i64>,
    device_id: i32,
}

impl IntoDLPack for MyGpuTensor {
    fn tensor_info(&self) -> TensorInfo {
        TensorInfo::contiguous(
            self.device_ptr as *mut c_void,
            cuda_device(self.device_id),
            dtype_f32(),
            self.shape.clone(),
        )
    }
}

#[pyfunction]
fn create_tensor(py: Python<'_>) -> PyResult<Py<PyAny>> {
    let tensor = MyGpuTensor {
        device_ptr: 0x12345678, // your actual device pointer
        shape: vec![2, 3],
        device_id: 0,
    };
    tensor.into_dlpack(py)
}

Python side:

import torch

# Call your Rust function that returns a DLPack capsule
capsule = create_tensor()

# Convert to PyTorch tensor (zero-copy)
tensor = torch.from_dlpack(capsule)

Read-only and versioned DLPack

pyo3-dlpack speaks both the legacy and the versioned (DLPack 1.0) protocol, and negotiation is automatic — you do not have to choose.

  • Import (PyTensor::from_pyany) advertises versioned support to the producer and transparently accepts either a legacy dltensor capsule or a versioned dltensor_versioned one. Call tensor.is_read_only() to check the read-only flag (always false for legacy producers, which cannot express it).
  • Export keeps into_dlpack unchanged (a writable legacy capsule, for maximum consumer compatibility). To export a read-only tensor, use into_dlpack_readonly, which emits a versioned capsule with the read-only flag set:
#[pyfunction]
fn create_readonly_tensor(py: Python<'_>) -> PyResult<Py<PyAny>> {
    let tensor = MyTensor { /* ... */ };
    tensor.into_dlpack_readonly(py)
}

Supported Data Types

  • Float: f16, f32, f64, bf16
  • Integer: i8, i16, i32, i64
  • Unsigned: u8, u16, u32, u64
  • Boolean

Supported Devices

  • CPU
  • CUDA
  • CUDA Host (pinned memory)
  • ROCm
  • Metal
  • Vulkan
  • And more (see DLDeviceType)

Performance

DLPack enables true zero-copy tensor sharing: only metadata is processed, never the data, so cost is constant regardless of tensor size. Copy-based interop is O(n) in both time and peak memory. Representative results on Apple M3 (see BENCHMARKS.md for the full methodology, the dlpark head-to-head, and reproduce commands):

Operation (1M f32) pyo3-dlpack (zero-copy) Copy baseline
Export Rust → Python ~3.2 µs ~99 µs (Vec::clone / rust-numpy)
Import Python → Rust ~2.5 µs

The gap widens with size. Importing zero-copy from Python stays flat at ~0.5 µs from 1M to 100M elements, while numpy.copy() grows to ~28 ms at 100M — roughly 54,000× faster — and a zero-copy import adds 0 MiB of resident memory where a copy adds the full buffer (≈191 MiB for a 191 MiB array).

Against dlpark (the mature Rust DLPack crate), raw throughput is at parity — both are zero-copy capsule wrappers. See BENCHMARKS.md for the per-size head-to-head.

Run the benchmarks yourself (cargo bench needs Rust ≥ 1.85 for the dlpark dev-dependency):

  • make bench-rust — Rust criterion head-to-head (cargo bench --bench dlpack)
  • make bench-python — Python benchmarks; also python benchmarks/bench_dlpack.py --compare and --memory, and python benchmarks/interop_probe.py
  • make bench — all benchmarks

Testing

Validate correctness and zero-copy behavior:

  • make test - Rust unit tests + Python integration tests
  • Tests verify data pointers are preserved across transfers, capsule ownership (no double-free), and the versioned/read-only round-trip

Python environment

The test module is built with maturin using the same interpreter as tests. Override it with PYTHON=/path/to/python if needed (e.g., a venv). Default tests include PyTorch (pip install -e ".[test]"). For CI or lightweight runs, use pip install -e ".[test-lite]".

License

Licensed under the MIT license. See LICENSE for details.