Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
ndrs
ndrs is a NumPy‑like tensor library for Rust, providing multi‑dimensional array (tensor) operations with optional GPU acceleration via CUDA. It emphasizes zero‑copy views, efficient strided operations, and a flexible ownership model.
✨ Features
- N‑dimensional tensors – shape, strides, and byte‑level data storage.
- View‑based operations – slicing, broadcasting, transposing, and reshaping without copying data.
- Efficient strided copy – fast data movement between non‑contiguous layouts.
- Thread‑local and thread‑safe variants
Rc<RefCell<Tensor>>for single‑threaded speedArc<ReentrantMutex<RefCell<Tensor>>>for multi‑threading and Python bindings
- GPU acceleration – transparent CPU ↔ GPU transfer, CUDA kernels for element‑wise addition.
- CUDA stream support – asynchronous execution, events for timing and cross‑stream synchronization (
wait_event). - Operator overloading –
+and+=for tensors with broadcastable shapes. - Python‑like slicing – intuitive
s!macro:s![1..4:2, ..]. - Broadcasting – automatic shape expansion.
- Custom data types – register your own types with user‑defined addition operations.
- NPY file I/O – load and save tensors from/to NumPy
.npyfiles (preserving shape, supportsf32/i32). - Python bindings – use ndrs from Python via PyO3 (optional).
🚀 Quick Start
Add this to your Cargo.toml:
[]
= "0.1"
Basic CPU usage with the tensor! macro
use ;
GPU usage with CUDA streams
use ;
NPY file I/O (NumPy compatibility)
use ;
Custom data types
You can register your own types and define addition for them:
use ;
use Arc;
const DTYPE_MY_TYPE: DType = 1000;
register_dtype;
register_add_op;
🧠 Core Concepts
Tensor
The raw data container. It owns a contiguous byte buffer (either on CPU or GPU) and stores shape, strides, data type, and device information. It does not implement operations directly – use TensorView for that.
TensorView
A view into a Tensor with an optional offset, shape, and strides. All mathematical operations (addition, slicing, broadcasting, device transfer) are defined on views.
Two concrete view types are provided:
RcTensorView– thread‑local variant usingRc<RefCell<Tensor>>.
Fast and lightweight for single‑threaded code. All operations are non‑blocking and cheap.ArcTensorView– thread‑safe variant usingArc<ReentrantMutex<RefCell<Tensor>>>.
Required for multi‑threaded environments and Python bindings. Locking is automatic and reentrant.
Slice macro s!
Creates a slice descriptor for the .slice() method. Supports ranges, steps, single indices, and .. (all).
let sub = view.slice?; // rows 1..4, cols 2..6
let row = view.slice?; // single row (dimension reduced)
let col = view.slice?; // single column
let every_other = view.slice?; // every second row
Broadcasting
Use broadcast_shapes to compute the target shape for two tensors, then broadcast_to to expand a view.
use broadcast_shapes;
let a = new_cpu_from_f32;
let b = new_cpu_from_f32;
let target = broadcast_shapes.unwrap; // [3, 4]
let a_bcast = a_view.broadcast_to?;
let b_bcast = b_view.broadcast_to?;
Device management
Device::Cpu– host memory.Device::Cuda(id)– CUDA device with given index.set_current_device(id)– sets the default device for context creation.get_device_count()– returns number of CUDA‑capable devices.get_stream()/set_stream()– thread‑local current CUDA stream.
GPU streams and events
- Streams allow asynchronous command submission. Use
Stream::new()to create a custom non‑default stream. - Events record points in a stream, can be used for timing (
elapsed_since) or cross‑stream dependencies (wait_event).
let stream = new?;
// ... launch kernels, copy data ...
let event = stream.record?;
stream2.wait_event?; // wait on another stream
📦 Cargo Features
- default – CPU only.
- cuda – enables GPU support (requires CUDA toolkit and
cudarc).
Enablescuda::*functions,ArcTensorViewGPU transfers, and GPU‑accelerated addition.
🧪 Testing
Run all tests (CPU only, GPU tests are ignored by default if no device):
To run GPU tests (requires CUDA device):
🐍 Python Bindings
The ndrs-python crate provides Python bindings using PyO3. Install from source:
Then in Python:
# Create a tensor from a nested list (auto‑detects dtype)
=
# Move to GPU and add
=
= +
# Convert back to NumPy (requires `numpy` installed)
# [[2. 4.]
# [6. 8.]]
The Python API mirrors the Rust API: slicing, broadcasting, and arithmetic operators are supported.
⚙️ Performance Considerations
- Views are cheap: they copy only shape, strides, offset, and a handle to the underlying
Tensor. - Strided copy is optimized for CPU and GPU; non‑contiguous copies use a fallback iterative kernel but are still efficient.
- GPU addition uses a highly optimized CUDA kernel that respects arbitrary strides, achieving near‑peak memory bandwidth.
- Automatic event tracking is enabled by default to ensure safe multi‑stream synchronization; you can disable it for maximum throughput if you manage dependencies manually.
📄 License
This project is licensed under the MIT License. See the LICENSE file for details.
🤝 Contributing
Contributions are welcome! Please open an issue or pull request on GitHub. For major changes, please discuss first.