ndrs
ndrs is a NumPy‑like tensor library for Rust, providing multi‑dimensional array (tensor) operations with optional GPU acceleration via CUDA. It emphasizes zero‑copy views, efficient strided operations, and a flexible ownership model.
✨ Features
- N‑dimensional tensors – shape, strides, and byte‑level data storage.
- View‑based operations – slicing, broadcasting, transposing, and reshaping without copying data.
- Efficient strided copy – fast data movement between non‑contiguous layouts.
- Thread‑local and thread‑safe variants
Rc<RefCell<Tensor>>for single‑threaded speedArc<ReentrantMutex<RefCell<Tensor>>>for multi‑threading and Python bindings
- GPU acceleration – transparent CPU ↔ GPU transfer, CUDA kernels for element‑wise addition.
- CUDA stream support – asynchronous execution, events for timing and cross‑stream synchronization (
wait_event). - Operator overloading –
+and+=for tensors with broadcastable shapes. - Python‑like slicing – intuitive
s!macro:s![1..4:2, ..]. - Broadcasting – automatic shape expansion.
- Custom data types – register your own primitive or structured types with user‑defined addition operations.
- Structured dtypes – build compound types (similar to NumPy structured arrays) with named fields.
- NPY file I/O – load and save tensors from/to NumPy
.npyfiles (preserving shape, supportsf32/i32). - Convenient
tensor!macro – create tensors from nested literals with optional dtype and device specifiers. - Python bindings – use ndrs from Python via PyO3, with full support for custom dtypes and operation overriding.
- Override operators with custom kernels – replace built‑in implementations (e.g., addition) with your own CPU/GPU kernels for maximum performance.
🚀 Quick Start
Add this to your Cargo.toml:
[]
= "0.4"
Basic CPU usage with the tensor! macro
use ;
Creating tensors with explicit dtype and device
The tensor! macro accepts optional ; dtype and ; "device" specifiers:
let t = tensor!; // CPU, i32
let t = tensor!; // CPU, f32
let t = tensor!; // CPU, auto‑dtype
let t = tensor!; // GPU 0, f32
let t = tensor!; // GPU 1, i32
GPU usage with CUDA streams
use ;
NPY file I/O (NumPy compatibility)
use ;
Custom primitive data types
You can register your own primitive (non‑structured) types and define addition for them:
use ;
use Arc;
const DTYPE_MY_TYPE: DType = 1000;
// Register
register_dtype;
register_add_op;
Custom structured dtypes (Python)
In the Python bindings, you can define structured dtypes similar to NumPy’s structured arrays:
# Define a complex dtype composed of two float32 fields
=
# Create a tensor from a NumPy structured array
=
=
# Access fields after conversion back to NumPy
=
# [1.0, 3.0]
# [2.0, 4.0]
🧠 Core Concepts
Tensor
The raw data container. It owns a contiguous byte buffer (either on CPU or GPU) and stores shape, strides, data type, and device information. It does not implement operations directly – use TensorView for that.
Constructors:
Tensor::new_cpu_from_slice<T>(&[T], shape)– from slice.Tensor::new_from_bytes(bytes, shape, dtype, device)– from raw bytes (CPU or GPU).Tensor::new_contiguous(shape, dtype)– zero‑initialized CPU tensor.Tensor::from_string_literal(s)– parse from a literal string (used bytensor!).
TensorView
A view into a Tensor with an optional offset, shape, and strides. All mathematical operations (addition, slicing, broadcasting, device transfer) are defined on views.
Two concrete view types are provided:
RcTensorView– thread‑local variant usingRc<RefCell<Tensor>>.
Fast and lightweight for single‑threaded code. All operations are non‑blocking and cheap.ArcTensorView– thread‑safe variant usingArc<ReentrantMutex<RefCell<Tensor>>>.
Required for multi‑threaded environments and Python bindings. Locking is automatic and reentrant.
Slice macro s!
Creates a slice descriptor for the .slice() method. Supports ranges, steps, single indices, and .. (all).
let sub = view.slice?; // rows 1..4, cols 2..6
let row = view.slice?; // single row (dimension reduced)
let col = view.slice?; // single column
let every_other = view.slice?; // every second row
Broadcasting
Use broadcast_shapes to compute the target shape for two tensors, then broadcast_to to expand a view.
use broadcast_shapes;
let a = new_cpu_from_f32;
let b = new_cpu_from_f32;
let target = broadcast_shapes.unwrap; // [3, 4]
let a_bcast = a_view.broadcast_to?;
let b_bcast = b_view.broadcast_to?;
Device management
Device::Cpu– host memory.Device::Cuda(id)– CUDA device with given index.cuda::set_device(id)– sets the default device for context creation.cuda::get_device_count()– returns number of CUDA‑capable devices.cuda::get_stream()/cuda::set_stream()– thread‑local current CUDA stream.
GPU streams and events
- Streams allow asynchronous command submission. Use
cuda::Stream::new()to create a custom non‑default stream. - Events record points in a stream, can be used for timing (
elapsed_since) or cross‑stream dependencies (wait_event).
let stream = new?;
// ... launch kernels, copy data ...
let event = stream.record?;
stream2.wait_event?; // wait on another stream
📦 Cargo Features
- default – CPU only.
- cuda – enables GPU support (requires CUDA toolkit and
cudarc).
Enablescuda::*functions,ArcTensorViewGPU transfers, and GPU‑accelerated addition.
🧪 Testing
Run all tests (CPU only, GPU tests are ignored by default if no device):
To run GPU tests (requires CUDA device):
🐍 Python Bindings
The ndrs-python crate provides Python bindings using PyO3. Install from source:
Then in Python:
# Create a tensor from a nested list (auto‑detects dtype)
=
# Move to GPU and add
=
= +
# Convert back to NumPy (requires `numpy` installed)
# [[2. 4.]
# [6. 8.]]
Customizing operations from Python (overriding built‑in kernels)
You may want to replace ndrs’s default implementation of a binary operation (e.g., Add) with your own highly optimized kernel – either for a built‑in dtype like float32 or for a custom dtype.
The register_binary_op function allows you to supply a Python callback that will be invoked for the given dtype, operation, and device. The callback receives raw pointers to the input and output buffers, the number of elements, a device code, and an optional stream pointer. You can use ctypes + numpy to access the data and perform the computation.
For performance‑critical custom kernels, you can write a C/CUDA function, compile it into a shared library, then call it from Python via ctypes. This gives you full control over the kernel without sacrificing speed.
Example: Replace the CPU addition for float32 with a faster (or vectorized) implementation
# Assume the data is contiguous (ndrs will pass contiguous tensors if you call .contiguous() first)
# Use numpy for SIMD-optimized addition (or call your own C library)
=
=
=
# NumPy’s vectorized add
# Optionally multiply by 2, etc.
return 0 # success
# Override the default CPU addition for float32
For CUDA, you can write a .ptx or .cubin kernel, load it via ctypes or cupy, and invoke it inside the callback.
Overriding from Rust
The Rust API also lets you override operations. This is useful when you want to integrate a kernel written directly in Rust (e.g., using ndarray or rayon) or a third‑party CUDA kernel.
use ;
use Arc;
let op: BinaryOpFn = new;
register_binary_op;
After registration, all tensor additions using that dtype and device will route through your custom kernel.
Important notes for custom kernels
- Contiguity: The kernel may be called with arbitrary strides. To simplify your implementation, you can first call
.contiguous()on the tensors inside the kernel (or require the user to do so) – but this will add a copy overhead. For maximum performance, your kernel should be stride‑aware (like ndrs’s default CPU and GPU kernels). - Thread safety: The callback will be called from possibly multiple threads; it must be safe to use concurrently.
- Device‑specific: You can register different kernels for CPU and CUDA, allowing you to use specialized GPU kernels while keeping a CPU fallback.
- Performance gains: By overriding built‑in operations, you can integrate hand‑tuned CPU vectorization (e.g., using
avx2intrinsics) or highly optimized CUDA kernels (e.g., using tensor cores) without waiting for ndrs to natively support them.
Python API reference
| Function / Class | Description |
|---|---|
nd.Tensor(data, dtype=None, device=None) |
Create a tensor from a Python list or NumPy array. |
nd.Tensor.from_numpy(array, device=None) |
Alternative constructor from a NumPy array. |
tensor.shape |
Returns shape as list of ints. |
tensor.dtype |
Returns dtype id (int) or DType object for custom dtypes. |
tensor.device |
Returns device string (e.g., "cpu", "cuda:0"). |
tensor.numpy() |
Returns a NumPy array (copy). |
tensor.to(device) |
Moves tensor to another device (returns new tensor). |
nd.dtype.from_fields(fields) |
Create a custom structured dtype. fields is a list of (name, dtype_id). |
nd.register_dtype(name, itemsize) |
Register a plain (non‑structured) dtype, returns dtype id. |
nd.register_binary_op(dtype, kind, device, callback) |
Register a binary operation callback for a specific dtype and device. kind is one of nd.BINARY_OP_ADD, nd.BINARY_OP_SUB, … |
⚙️ Performance Considerations
- Views are cheap: they copy only shape, strides, offset, and a handle to the underlying
Tensor. - Strided copy is optimized for CPU and GPU; non‑contiguous copies use a fallback iterative kernel but are still efficient.
- GPU addition uses a highly optimized CUDA kernel that respects arbitrary strides, achieving near‑peak memory bandwidth.
- Automatic event tracking is enabled by default to ensure safe multi‑stream synchronization; you can disable it for maximum throughput if you manage dependencies manually.
📄 License
This project is licensed under the MIT License. See the LICENSE file for details.
🤝 Contributing
Contributions are welcome! Please open an issue or pull request on GitHub. For major changes, please discuss first.