ndrs
ndrs is a NumPy‑like tensor library for Rust, providing multi‑dimensional array (tensor) operations with optional GPU acceleration via CUDA. It emphasizes zero‑copy views, efficient strided operations, and a flexible ownership model.
📑 Table of Contents
- Features
- Quick Start
- Core Concepts
- Cargo Features
- Testing
- Python Bindings
- Performance Considerations
- License
- Contributing
- Acknowledgments
✨ Features
- N‑dimensional tensors – shape, strides, and byte‑level data storage.
- View‑based operations – slicing, broadcasting, transposing, and reshaping without copying data.
- Efficient strided copy – fast data movement between non‑contiguous layouts.
- Thread‑local and thread‑safe variants
Rc<RefCell<Tensor>>for single‑threaded speedArc<ReentrantMutex<RefCell<Tensor>>>for multi‑threading and Python bindings
- GPU acceleration – transparent CPU ↔ GPU transfer, CUDA kernels for element‑wise operations.
- CUDA stream support – asynchronous execution, events for timing and cross‑stream synchronization.
- Operator overloading –
+and+=for tensors with broadcastable shapes. - Python‑like slicing – intuitive
s!macro:s![1..4:2, ..]. - Broadcasting – automatic shape expansion.
- Dynamic CUDA kernels – compile and launch kernels from PTX or CUDA C++ source at runtime with
RawKernel. - Custom elementwise kernels – define your own per‑element operations (e.g.,
out = a + b * c) that work on tensors of any rank and dtype, automatically compiled for GPU. - Structured dtypes – build compound types (similar to NumPy structured arrays) with named fields.
- NPY file I/O – load and save tensors from/to NumPy
.npyfiles (preserving shape, supportsf32/i32). - Convenient
tensor!macro – create tensors from nested literals with optional dtype and device specifiers. - Python bindings – use ndrs from Python via PyO3, with full support for custom dtypes and operation overriding.
- Override operators with custom kernels – replace built‑in implementations (e.g., addition) with your own CPU/GPU kernels for maximum performance.
🚀 Quick Start
Add this to your Cargo.toml:
[]
= "0.4"
Basic CPU usage with the tensor! macro
use ;
Creating tensors with explicit dtype and device
let t = tensor!; // CPU, i32
let t = tensor!; // CPU, f32
let t = tensor!; // CPU, auto‑dtype
let t = tensor!; // GPU 0, f32
let t = tensor!; // GPU 1, i32
GPU usage with CUDA streams
use ;
NPY file I/O (NumPy compatibility)
use ;
Custom primitive data types
use ;
use Arc;
const DTYPE_MY_TYPE: DType = 1000;
register_dtype;
register_add_op;
Custom elementwise kernels (GPU) – Rust
Define per‑element operations that work on tensors of any rank and dtype, automatically compiled for GPU.
use ;
use ElementwiseKernel;
use TensorViewOps;
let a = tensor!;
let b = tensor!;
let mut out = new_contiguous?.into_arc;
let a_view = a.into_arc.as_view;
let b_view = b.into_arc.as_view;
let mut out_view = out.as_view;
// Simple addition
elementwise_kernel?;
// Multi‑statement with local variables
elementwise_kernel?;
Python custom elementwise kernels
The Python binding provides a CuPy‑like ElementwiseKernel:
=
=
=
=
Multi‑statement example:
=
The kernel automatically handles broadcasting, strides, and per‑tensor offsets (e.g., from slicing). Type placeholders (X, Y, Z) are mapped to actual dtypes at call time.
Low‑level RawKernel (dynamic CUDA kernels)
use ;
use LaunchConfig;
use Tensor;
🧠 Core Concepts
Tensor
The raw data container. It owns a contiguous byte buffer (either on CPU or GPU) and stores shape, strides, data type, and device information. It does not implement operations directly – use TensorView for that.
Constructors:
Tensor::new_cpu_from_slice<T>(&[T], shape)– from slice.Tensor::new_from_bytes(bytes, shape, dtype, device)– from raw bytes (CPU or GPU).Tensor::new_contiguous(shape, dtype)– zero‑initialized CPU tensor.Tensor::from_string_literal(s)– parse from a literal string (used bytensor!).
TensorView
A view into a Tensor with an optional offset, shape, and strides. All mathematical operations (addition, slicing, broadcasting, device transfer) are defined on views.
Two concrete view types are provided:
RcTensorView– thread‑local variant usingRc<RefCell<Tensor>>.
Fast and lightweight for single‑threaded code.ArcTensorView– thread‑safe variant usingArc<ReentrantMutex<RefCell<Tensor>>>.
Required for multi‑threaded environments and Python bindings.
Slice macro s!
let sub = view.slice?;
let row = view.slice?;
let every_other = view.slice?;
Broadcasting
use broadcast_shapes;
let target = broadcast_shapes.unwrap;
let a_bcast = a_view.broadcast_to?;
Device management
Device::Cpu/Device::Cuda(id)cuda::set_device(id),cuda::get_device_count(),cuda::get_stream(),cuda::set_stream()
GPU streams and events
let stream = new?;
let event = stream.record?;
stream2.wait_event?;
📦 Cargo Features
- default – CPU only.
- cuda – enables GPU support (requires CUDA toolkit and
cudarc).
Enablescuda::*functions,ArcTensorViewGPU transfers, GPU‑accelerated addition, and theElementwiseKernel/RawKernelfacilities.
🧪 Testing
🐍 Python Bindings
The ndrs-python crate provides Python bindings using PyO3.
Installation
Basic usage
=
=
= +
# [[2. 4.], [6. 8.]]
CUDA streams and events from Python
=
=
=
=
= +
=
= * 2.0
# [10. 14. 18.]
Custom dtypes and operations
Structured dtypes:
=
=
=
Override addition:
=
=
=
return 0
Python API reference
| Function / Class | Description |
|---|---|
nd.Tensor(data, dtype=None, device=None) |
Create tensor from list or NumPy array. |
nd.Tensor.from_numpy(array, device=None) |
Alternative constructor from NumPy array. |
tensor.shape / tensor.dtype / tensor.device |
Basic properties. |
tensor.numpy() |
Copy data to NumPy array. |
tensor.to(device) |
Move tensor to another device. |
nd.dtype.from_fields(fields) |
Create structured dtype. |
nd.register_dtype(name, itemsize) |
Register plain dtype, returns id. |
nd.register_binary_op(dtype, kind, device, callback) |
Override binary op (e.g., nd.BINARY_OP_ADD). |
nd.cuda.get_device(), set_device(device_str) |
Get/set current CUDA device. |
nd.cuda.Stream(device_id) |
Create a CUDA stream. |
nd.cuda.Event(device_id) |
Create a CUDA event. |
⚙️ Performance Considerations
- Views are cheap: they copy only shape, strides, offset, and a handle.
- Strided copy is optimized for CPU and GPU; non‑contiguous copies use an iterative kernel but are still efficient.
- GPU addition uses a highly optimized stride‑aware CUDA kernel.
- ElementwiseKernel compiles a dedicated kernel for the given expression, shape, and dtypes; subsequent calls with the same signature reuse the compiled kernel.
- Automatic event tracking is enabled by default to ensure safe multi‑stream synchronization; you can disable it for maximum throughput if you manage dependencies manually.
📄 License
This project is licensed under the MIT License. See the LICENSE file for details.
🤝 Contributing
Contributions are welcome! Please open an issue or pull request on GitHub. For major changes, please discuss first.