ndrs

ndrs is a NumPy‑like tensor library for Rust, providing multi‑dimensional array (tensor) operations with optional GPU acceleration via CUDA. It emphasizes zero‑copy views, efficient strided operations, and a flexible ownership model.

📑 Table of Contents

Features
Quick Start
Core Concepts
Cargo Features
Testing
Python Bindings
Performance Considerations
License
Contributing
Acknowledgments

✨ Features

N‑dimensional tensors – shape, strides, and byte‑level data storage.
View‑based operations – slicing, broadcasting, transposing, and reshaping without copying data.
Efficient strided copy – fast data movement between non‑contiguous layouts.
Thread‑local and thread‑safe variants
- Rc<RefCell<Tensor>> for single‑threaded speed
- Arc<ReentrantMutex<RefCell<Tensor>>> for multi‑threading and Python bindings
GPU acceleration – transparent CPU ↔ GPU transfer, CUDA kernels for element‑wise operations.
CUDA stream support – asynchronous execution, events for timing and cross‑stream synchronization.
Operator overloading – + and += for tensors with broadcastable shapes.
Python‑like slicing – intuitive s! macro: s![1..4:2, ..].
Broadcasting – automatic shape expansion.
Dynamic CUDA kernels – compile and launch kernels from PTX or CUDA C++ source at runtime with RawKernel.
Custom elementwise kernels – define your own per‑element operations (e.g., out = a + b * c) that work on tensors of any rank and dtype, automatically compiled for GPU.
Structured dtypes – build compound types (similar to NumPy structured arrays) with named fields.
NPY file I/O – load and save tensors from/to NumPy .npy files (preserving shape, supports f32/i32).
Convenient tensor! macro – create tensors from nested literals with optional dtype and device specifiers.
Python bindings – use ndrs from Python via PyO3, with full support for custom dtypes and operation overriding.
Override operators with custom kernels – replace built‑in implementations (e.g., addition) with your own CPU/GPU kernels for maximum performance.

🚀 Quick Start

Add this to your Cargo.toml:

[dependencies]
ndrs = "0.4"

Basic CPU usage with the `tensor!` macro

use ndrs::{Tensor, s, tensor};

fn main() -> Result<(), String> {
    let a = tensor!([[1, -2], [3, 4]]);      // automatically i32 on CPU
    let b = tensor!([[5, 6], [7, 8]]);
    let a_view = a.into_rc().as_view();
    let b_view = b.into_rc().as_view();
    let c_view = a_view + b_view;
    assert_eq!(c_view.shape(), &[2, 2]);
    let result = c_view.to_vec::<i32>()?;
    assert_eq!(result, vec![6, 4, 10, 12]);
    Ok(())
}

Creating tensors with explicit dtype and device

let t = tensor!([[1, 2], [3, 4]]);               // CPU, i32
let t = tensor!([[1, 2], [3, 4]]; f32);         // CPU, f32
let t = tensor!([[1, 2], [3, 4]]; "cpu");       // CPU, auto‑dtype
let t = tensor!([[1.0, 2.0], [3.0, 4.0]]; "cuda:0");   // GPU 0, f32
let t = tensor!([[1, 2], [3, 4]]; i32; "cuda:1");       // GPU 1, i32

GPU usage with CUDA streams

use ndrs::{tensor, cuda};

fn gpu_stream_example() -> Result<(), String> {
    if !cuda::is_available() { return Ok(()); }
    cuda::set_device(0)?;
    let stream1 = cuda::Stream::new(Some(0))?;
    let stream2 = cuda::Stream::new(Some(0))?;

    let a = tensor!([[1.0, 2.0], [3.0, 4.0]]; f32; "cuda:0");
    let b = tensor!([[5.0, 6.0], [7.0, 8.0]]; f32; "cuda:0");
    let a_view = a.into_arc().as_view();
    let b_view = b.into_arc().as_view();
    let mut out_gpu = a_view.create_output()?;

    let start = stream1.record()?;
    a_view.add(&b_view, &mut out_gpu)?;
    let end = stream1.record()?;
    stream1.synchronize()?;
    println!("GPU addition took {:?}", end.elapsed_since(&start)?);

    let event = stream1.record()?;
    stream2.wait_event(&event)?;
    let mut out2_gpu = out_gpu.create_output()?;
    out_gpu.add(&out_gpu, &mut out2_gpu)?;
    stream2.synchronize()?;
    let result = out2_gpu.to_cpu()?.to_vec::<f32>()?;
    assert_eq!(result, vec![12.0, 16.0, 20.0, 24.0]);
    Ok(())
}

NPY file I/O (NumPy compatibility)

use ndrs::{tensor, load_npy, save_npy};

fn npy_example() -> Result<(), String> {
    let t = tensor!([[1.0, 2.0], [3.0, 4.0]]);
    save_npy(&t, "example.npy")?;
    let loaded = load_npy("example.npy")?;
    assert_eq!(loaded.to_vec::<f32>()?, vec![1.0, 2.0, 3.0, 4.0]);
    Ok(())
}

Custom primitive data types

use ndrs::{register_dtype, register_add_op, TypeInfo, DType, Device};
use std::sync::Arc;

const DTYPE_MY_TYPE: DType = 1000;

#[repr(C)]
#[derive(Clone, Copy)]
struct MyType { a: i32, b: i32 }

fn mytype_add_op(/* ... */) -> Result<(), String> { /* ... */ }

register_dtype(DTYPE_MY_TYPE, TypeInfo { size: std::mem::size_of::<MyType>(), name: "mytype" });
register_add_op(DTYPE_MY_TYPE, Arc::new(mytype_add_op));

Custom elementwise kernels (GPU) – Rust

Define per‑element operations that work on tensors of any rank and dtype, automatically compiled for GPU.

use ndrs::{tensor, ArcTensorView, cuda, Device, Tensor};
use ndrs::backend::cuda::ElementwiseKernel;
use ndrs::view::TensorViewOps;

let a = tensor!([[1.0, 2.0], [3.0, 4.0]]; f32; "cuda:0");
let b = tensor!([[5.0, 6.0], [7.0, 8.0]]; f32; "cuda:0");
let mut out = Tensor::new_contiguous(vec![2,2], a.dtype(), Device::Cuda(0))?.into_arc();

let a_view = a.into_arc().as_view();
let b_view = b.into_arc().as_view();
let mut out_view = out.as_view();

// Simple addition
ArcTensorView::elementwise_kernel(&mut out_view, "out = a + b", vec![&a_view, &b_view])?;

// Multi‑statement with local variables
ArcTensorView::elementwise_kernel(
    &mut out_view,
    "float tmp = a * 2.0; out = tmp + b",
    vec![&a_view, &b_view],
)?;

Python custom elementwise kernels

The Python binding provides a CuPy‑like ElementwiseKernel:

import ndrs as nd
import numpy as np

kernel = nd.cuda.ElementwiseKernel(
    "X x, Y y",          # input parameters: type placeholder X, variable name x
    "Z z",               # output parameter: type placeholder Z, variable name z
    "z = (x - y) * (x - y)",  # expression
    "squared_diff"       # kernel name (optional)
)

a = nd.Tensor([1.0, 2.0, 3.0], dtype=nd.float32, device="cuda:0")
b = nd.Tensor([4.0, 5.0, 6.0], dtype=nd.float32, device="cuda:0")
c = kernel(a, b)
np.testing.assert_allclose(c.contiguous().numpy(), [9.0, 9.0, 9.0])

Multi‑statement example:

kernel = nd.cuda.ElementwiseKernel(
    "X x, Y y", "Z z",
    "X a = x + 1; Y b = y * 2; z = a + b",
    "multi_stmt"
)

The kernel automatically handles broadcasting, strides, and per‑tensor offsets (e.g., from slicing). Type placeholders (X, Y, Z) are mapped to actual dtypes at call time.

Low‑level RawKernel (dynamic CUDA kernels)

use ndrs::cuda::{RawKernel, get_stream, set_device};
use cudarc::driver::LaunchConfig;
use ndrs::tensor::Tensor;

fn raw_kernel_example() -> anyhow::Result<()> {
    set_device(0)?;
    let stream = get_stream()?;
    let ctx = stream.inner().context().clone();

    let kernel_src = r#"
        extern "C" __global__ void times_two(float* out, const float* in, const int n) {
            int i = blockIdx.x * blockDim.x + threadIdx.x;
            if (i < n) out[i] = 2.0f * in[i];
        }
    "#;
    let kernel = RawKernel::from_source(kernel_src, "times_two", &ctx)?;

    let in_tensor = Tensor::new_cpu_from_f32(vec![1.0, 2.0, 3.0], vec![3]);
    let in_gpu = in_tensor.into_arc().as_view().to_gpu(0)?;
    let mut out_gpu = in_gpu.create_output()?;

    let n = in_gpu.size();
    let block = 256;
    let grid = (n + block - 1) / block;
    let cfg = LaunchConfig { grid_dim: (grid as u32, 1, 1), block_dim: (block, 1, 1), shared_mem_bytes: 0 };

    let mut builder = kernel.launch_builder(stream.inner());
    builder.arg(&mut out_gpu.handle().0.lock().borrow_mut().as_gpu_slice_mut().unwrap());
    builder.arg(&in_gpu.handle().0.lock().borrow().as_gpu_slice().unwrap());
    builder.arg(&n);
    unsafe { builder.launch(cfg)?; }

    let result = out_gpu.to_cpu()?.to_vec::<f32>()?;
    assert_eq!(result, vec![2.0, 4.0, 6.0]);
    Ok(())
}

🧠 Core Concepts

`Tensor`

The raw data container. It owns a contiguous byte buffer (either on CPU or GPU) and stores shape, strides, data type, and device information. It does not implement operations directly – use TensorView for that.

Constructors:

Tensor::new_cpu_from_slice<T>(&[T], shape) – from slice.
Tensor::new_from_bytes(bytes, shape, dtype, device) – from raw bytes (CPU or GPU).
Tensor::new_contiguous(shape, dtype) – zero‑initialized CPU tensor.
Tensor::from_string_literal(s) – parse from a literal string (used by tensor!).

`TensorView`

A view into a Tensor with an optional offset, shape, and strides. All mathematical operations (addition, slicing, broadcasting, device transfer) are defined on views.

Two concrete view types are provided:

RcTensorView – thread‑local variant using Rc<RefCell<Tensor>>.
Fast and lightweight for single‑threaded code.
ArcTensorView – thread‑safe variant using Arc<ReentrantMutex<RefCell<Tensor>>>.
Required for multi‑threaded environments and Python bindings.

Slice macro `s!`

let sub = view.slice(&s![1..4, 2..6])?;
let row = view.slice(&s![2, ..])?;
let every_other = view.slice(&s![0..8:2, ..])?;

Broadcasting

use ndrs::broadcast_shapes;
let target = broadcast_shapes(a.shape(), b.shape()).unwrap();
let a_bcast = a_view.broadcast_to(&target)?;

Device management

Device::Cpu / Device::Cuda(id)
cuda::set_device(id), cuda::get_device_count(), cuda::get_stream(), cuda::set_stream()

GPU streams and events

let stream = cuda::Stream::new(Some(0))?;
let event = stream.record()?;
stream2.wait_event(&event)?;

📦 Cargo Features

default – CPU only.
cuda – enables GPU support (requires CUDA toolkit and cudarc).
Enables cuda::* functions, ArcTensorView GPU transfers, GPU‑accelerated addition, and the ElementwiseKernel / RawKernel facilities.

🧪 Testing

cargo test               # CPU tests only
cargo test -- --ignored  # includes GPU tests (if CUDA available)

🐍 Python Bindings

The ndrs-python crate provides Python bindings using PyO3.

Installation

cd python
maturin develop          # CPU only
maturin develop --features cuda   # with CUDA support

Basic usage

import ndrs as nd
import numpy as np

t = nd.Tensor([[1, 2], [3, 4]], dtype=nd.float32)
t2 = t.to("cuda:0")
t3 = t2 + t2
print(t3.numpy())   # [[2. 4.], [6. 8.]]

CUDA streams and events from Python

import ndrs as nd

nd.cuda.set_device("cuda:0")
stream1 = nd.cuda.Stream(device_id=0)
stream2 = nd.cuda.Stream(device_id=0)

a = nd.Tensor([1.0, 2.0, 3.0], dtype=nd.float32, device="cuda:0")
b = nd.Tensor([4.0, 5.0, 6.0], dtype=nd.float32, device="cuda:0")

nd.cuda.set_stream(stream1)
c = a + b
event = stream1.record_event()

nd.cuda.set_stream(stream2)
stream2.wait_event(event)
d = c * 2.0
stream2.synchronize()

print(d.numpy())  # [10. 14. 18.]

Custom dtypes and operations

Structured dtypes:

complex_dtype = nd.dtype.from_fields([('re', nd.float32), ('im', nd.float32)])
data = np.array([(1.0, 2.0), (3.0, 4.0)], dtype=complex_dtype.to_numpy_dtype())
t = nd.Tensor(data, dtype=complex_dtype)

Override addition:

def my_add(a_ptr, b_ptr, out_ptr, n, device_code, stream):
    import ctypes, numpy as np
    a = np.ctypeslib.as_array(ctypes.cast(a_ptr, ctypes.POINTER(ctypes.c_float)), shape=(n,))
    b = np.ctypeslib.as_array(ctypes.cast(b_ptr, ctypes.POINTER(ctypes.c_float)), shape=(n,))
    out = np.ctypeslib.as_array(ctypes.cast(out_ptr, ctypes.POINTER(ctypes.c_float)), shape=(n,))
    np.add(a, b, out=out)
    return 0

nd.register_binary_op(nd.float32, nd.BINARY_OP_ADD, "cpu", my_add)

Python API reference

Function / Class	Description
`nd.Tensor(data, dtype=None, device=None)`	Create tensor from list or NumPy array.
`nd.Tensor.from_numpy(array, device=None)`	Alternative constructor from NumPy array.
`tensor.shape` / `tensor.dtype` / `tensor.device`	Basic properties.
`tensor.numpy()`	Copy data to NumPy array.
`tensor.to(device)`	Move tensor to another device.
`nd.dtype.from_fields(fields)`	Create structured dtype.
`nd.register_dtype(name, itemsize)`	Register plain dtype, returns id.
`nd.register_binary_op(dtype, kind, device, callback)`	Override binary op (e.g., `nd.BINARY_OP_ADD`).
`nd.cuda.get_device()`, `set_device(device_str)`	Get/set current CUDA device.
`nd.cuda.Stream(device_id)`	Create a CUDA stream.
`nd.cuda.Event(device_id)`	Create a CUDA event.

⚙️ Performance Considerations

Views are cheap: they copy only shape, strides, offset, and a handle.
Strided copy is optimized for CPU and GPU; non‑contiguous copies use an iterative kernel but are still efficient.
GPU addition uses a highly optimized stride‑aware CUDA kernel.
ElementwiseKernel compiles a dedicated kernel for the given expression, shape, and dtypes; subsequent calls with the same signature reuse the compiled kernel.
Automatic event tracking is enabled by default to ensure safe multi‑stream synchronization; you can disable it for maximum throughput if you manage dependencies manually.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please open an issue or pull request on GitHub. For major changes, please discuss first.

🙏 Acknowledgments

Inspired by NumPy, PyTorch, and the ndarray crate.
Uses cudarc for CUDA bindings and bytemuck for safe byte casts.

ndrs 0.5.0