pyo3-dlpack
Zero-copy DLPack tensor interop for PyO3.
This crate provides a safe and ergonomic way to exchange tensor data between Rust and Python ML frameworks (PyTorch, JAX, TensorFlow, CuPy, etc.) using the DLPack protocol.
Features
- Zero-copy: Tensors are shared directly without copying data
- PyO3 0.28+: Uses the modern API (no deprecation warnings)
- Bidirectional: Import tensors from Python and export tensors to Python
- Device-agnostic: Works with CPU, CUDA, ROCm, and other devices
- DLPack 1.0: Versioned protocol with read-only tensors — auto-negotiated on import, fully backward-compatible with legacy producers
- Benchmarked: zero-copy stays O(1)/flat-memory where copy-based interop is O(n)/2×-memory (≈54,000× faster at 100M elements), and at parity with
dlpark— see BENCHMARKS.md
Installation
Add to your Cargo.toml:
[]
= "0.3"
= "0.28"
Usage
Importing a tensor from Python
use *;
use PyTensor;
Exporting a tensor to Python
use *;
use ;
use c_void;
Python side:
# Call your Rust function that returns a DLPack capsule
=
# Convert to PyTorch tensor (zero-copy)
=
Read-only and versioned DLPack
pyo3-dlpack speaks both the legacy and the versioned (DLPack 1.0) protocol,
and negotiation is automatic — you do not have to choose.
- Import (
PyTensor::from_pyany) advertises versioned support to the producer and transparently accepts either a legacydltensorcapsule or a versioneddltensor_versionedone. Calltensor.is_read_only()to check the read-only flag (alwaysfalsefor legacy producers, which cannot express it). - Export keeps
into_dlpackunchanged (a writable legacy capsule, for maximum consumer compatibility). To export a read-only tensor, useinto_dlpack_readonly, which emits a versioned capsule with the read-only flag set:
Supported Data Types
- Float: f16, f32, f64, bf16
- Integer: i8, i16, i32, i64
- Unsigned: u8, u16, u32, u64
- Boolean
Supported Devices
- CPU
- CUDA
- CUDA Host (pinned memory)
- ROCm
- Metal
- Vulkan
- And more (see
DLDeviceType)
Performance
DLPack enables true zero-copy tensor sharing: only metadata is processed, never the
data, so cost is constant regardless of tensor size. Copy-based interop is O(n) in
both time and peak memory. Representative results on Apple M3 (see
BENCHMARKS.md for the full methodology, the dlpark head-to-head,
and reproduce commands):
| Operation (1M f32) | pyo3-dlpack (zero-copy) | Copy baseline |
|---|---|---|
| Export Rust → Python | ~3.2 µs | ~99 µs (Vec::clone / rust-numpy) |
| Import Python → Rust | ~2.5 µs | — |
The gap widens with size. Importing zero-copy from Python stays flat at ~0.5 µs
from 1M to 100M elements, while numpy.copy() grows to ~28 ms at 100M — roughly
54,000× faster — and a zero-copy import adds 0 MiB of resident memory where a
copy adds the full buffer (≈191 MiB for a 191 MiB array).
Against dlpark (the mature Rust DLPack crate),
raw throughput is at parity — both are zero-copy capsule wrappers. See
BENCHMARKS.md for the per-size head-to-head.
Run the benchmarks yourself (cargo bench needs Rust ≥ 1.85 for the dlpark
dev-dependency):
make bench-rust— Rust criterion head-to-head (cargo bench --bench dlpack)make bench-python— Python benchmarks; alsopython benchmarks/bench_dlpack.py --compareand--memory, andpython benchmarks/interop_probe.pymake bench— all benchmarks
Testing
Validate correctness and zero-copy behavior:
make test- Rust unit tests + Python integration tests- Tests verify data pointers are preserved across transfers, capsule ownership (no double-free), and the versioned/read-only round-trip
Python environment
The test module is built with maturin using the same interpreter as tests.
Override it with PYTHON=/path/to/python if needed (e.g., a venv).
Default tests include PyTorch (pip install -e ".[test]"). For CI or lightweight runs, use pip install -e ".[test-lite]".
License
Licensed under the MIT license. See LICENSE for details.