# numr Architecture Guide
This document describes the internal architecture of numr for contributors and
adopters migrating from ndarray, nalgebra, or PyTorch-like workflows.
---
## Overview
numr is a multi-backend tensor library. The same user code runs on CPU, CUDA,
and WebGPU without modification — backends are selected at compile time via
feature flags, and tensor operations dispatch to backend-specific kernels
through Rust's trait system.
```
User code: client.add(&a, &b)
│
┌──────────┼──────────┐
▼ ▼ ▼
CPU CUDA WebGPU
(SIMD) (PTX/nvcc) (WGSL)
```
---
## Runtime Trait Hierarchy
Every backend implements three traits that together define a compute target.
### `Runtime` — backend identity
```
src/runtime/traits/runtime.rs
```
```rust
pub trait Runtime: Clone + Send + Sync + 'static {
type Device: Device;
type Client: RuntimeClient<Self>;
type Allocator: Allocator;
type RawHandle: Send + Sync;
fn name() -> &'static str;
fn allocate(size_bytes: usize, device: &Self::Device) -> Result<u64>;
fn deallocate(ptr: u64, size_bytes: usize, device: &Self::Device);
fn copy_to_device(src: &[u8], dst: u64, device: &Self::Device) -> Result<()>;
fn copy_from_device(src: u64, dst: &mut [u8], device: &Self::Device) -> Result<()>;
fn copy_within_device(src: u64, dst: u64, size_bytes: usize, device: &Self::Device) -> Result<()>;
fn default_device() -> Self::Device;
fn default_client(device: &Self::Device) -> Self::Client;
// ...
}
```
`Runtime` owns the raw memory interface. It is purely a type-level marker
with static methods — no instances are created.
Concrete implementations: `CpuRuntime`, `CudaRuntime`, `WgpuRuntime`.
### `Device` — a specific GPU or CPU
```
src/runtime/traits/device.rs
```
```rust
pub trait Device: Clone + Send + Sync + 'static {
fn id(&self) -> usize;
fn name(&self) -> String;
}
```
A lightweight handle identifying a particular piece of hardware. For CPU this
is a singleton; for CUDA it maps to a device ordinal.
### `RuntimeClient` — operation dispatcher
```
src/runtime/traits/client.rs
```
```rust
pub trait RuntimeClient<R: Runtime>: Clone + Send + Sync {
fn device(&self) -> &R::Device;
fn synchronize(&self);
}
```
The client owns any per-device state (CUDA stream, WebGPU queue, parallelism
config) and is the receiver for all operation trait methods.
**All tensor operations are methods on the client**, not on tensors:
```rust
let result = client.add(&a, &b)?; // BinaryOps::add
let reduced = client.sum(&a, &[0], false)?; // ReduceOps::sum
```
This design makes it impossible to accidentally mix backends — the client's
type determines which kernels run.
---
## Tensor Layout
```
src/tensor/core.rs — Tensor<R> struct
src/tensor/storage.rs — Storage<R>, reference-counted device memory
src/tensor/layout.rs — Layout (shape + strides + offset)
```
### `Tensor<R: Runtime>`
```rust
pub struct Tensor<R: Runtime> {
id: TensorId, // unique ID for autograd tracking
storage: Storage<R>, // Arc-wrapped device memory
layout: Layout, // shape, strides, offset
}
```
### `Storage<R>`
```rust
struct StorageInner<R: Runtime> {
ptr: u64, // raw device pointer (GPU address or CPU ptr)
len: usize, // number of elements
dtype: DType, // element type
device: R::Device, // device where memory lives
owned: bool, // if true, deallocate on drop
}
```
Storage is `Arc`-wrapped. Multiple tensors can share the same allocation —
this is how zero-copy views work. Memory is freed when the last reference
drops, via `Runtime::deallocate()` in the `Drop` impl.
### `Layout`
```rust
pub struct Layout {
shape: Shape, // size along each dimension
strides: Strides, // element offset between consecutive elements per dim
offset: usize, // starting element index in storage
}
```
Strides follow row-major convention: shape `[2, 3, 4]` produces strides
`[12, 4, 1]`.
---
## Zero-Copy Views
These operations create a new `Tensor` sharing the same `Storage`, only
changing the `Layout`:
| `reshape` | New shape + recomputed strides (contiguous input only) |
| `transpose(d0, d1)` | Swaps shape[d0]/shape[d1] and strides[d0]/strides[d1] |
| `permute` | Arbitrary dimension reordering via stride permutation |
| `unsqueeze(dim)` | Inserts size-1 dimension (stride = next dim's stride × size) |
| `squeeze(dim)` | Removes size-1 dimension |
| `narrow(dim, start, len)` | Adjusts offset + shape along one dimension |
| `broadcast_to` | Sets stride=0 for broadcast dimensions |
| `flip` | Negates stride, adjusts offset |
No data is copied. The resulting tensor is a view into the original storage.
If an operation needs contiguous memory (e.g., kernel launch), call
`.contiguous()` which returns a new tensor with freshly allocated, contiguous
storage — or returns `self` if already contiguous.
---
## Operation Architecture
### Three-Layer Dispatch (Primitive Ops)
Primitive operations like `add`, `exp`, `sum` follow this pattern:
```
1. Trait definition — src/ops/traits/{op}.rs
2. Backend impl — src/ops/{backend}/{op}.rs
3. Backend kernel — src/runtime/cpu/kernels/{op}.rs (CPU)
src/runtime/cuda/kernels/{op}.cu (CUDA)
src/runtime/wgpu/shaders/{op}.wgsl (WebGPU)
```
**Concrete example: `client.add(&a, &b)`**
```
src/ops/traits/binary.rs trait BinaryOps<R> { fn add(...) }
│
├─ src/ops/cpu/binary.rs impl BinaryOps<CpuRuntime> for CpuClient
│ │
│ └─ src/runtime/cpu/helpers/binary.rs shape validation, broadcast
│ │
│ └─ src/runtime/cpu/kernels/binary.rs SIMD kernel (AVX2/NEON)
│
├─ src/ops/cuda/binary.rs impl BinaryOps<CudaRuntime> for CudaClient
│ │
│ └─ launches PTX kernel: binary.ptx → add_f32
│
└─ src/ops/wgpu/binary.rs impl BinaryOps<WgpuRuntime> for WgpuClient
│
└─ dispatches WGSL shader: binary.wgsl → add entry point
```
### Four-Layer Dispatch (Composite Ops)
Composite operations (softmax, layernorm, unfold) add `impl_generic/` to
guarantee the same algorithm across all backends:
```
1. Trait definition — src/ops/traits/{op}.rs
2. Generic algorithm — src/ops/impl_generic/{op}.rs
3. Backend impl — src/ops/{backend}/{op}.rs (delegates to impl_generic)
4. Optional fused kernel
```
The generic algorithm calls only primitive ops, so all backends execute the
same sequence:
```rust
// src/ops/impl_generic/shape.rs
pub fn unfold_impl<R: Runtime, C: ShapeOps<R>>(
client: &C,
tensor: &Tensor<R>,
dim: isize,
size: usize,
step: usize,
) -> Result<Tensor<R>> {
// Uses narrow (primitive) + stack (primitive) + permute (view)
// Same algorithm regardless of backend
}
```
Backend impls delegate:
```rust
impl ShapeOps<CudaRuntime> for CudaClient {
fn unfold(&self, tensor: &Tensor<CudaRuntime>, ...) -> Result<...> {
unfold_impl(self, tensor, dim, size, step) // same code path
}
}
```
### Why This Matters
- Adding a new primitive op = new files, not modifying existing files
- Composite ops produce identical numerical results across backends
- Optional fused kernels (CUDA softmax, etc.) must match `impl_generic` output
---
## Backend Kernel Mechanisms
### CPU: SIMD Kernels
```
src/runtime/cpu/kernels/ — kernel entry points
src/runtime/cpu/kernels/simd/ — AVX2/AVX-512/NEON implementations
```
CPU kernels dispatch on dtype and architecture:
```rust
pub unsafe fn binary_op_kernel<T: Element>(op: BinaryOp, a: *const T, b: *const T, out: *mut T, len: usize) {
#[cfg(any(target_arch = "x86_64", target_arch = "aarch64"))]
match T::DTYPE {
DType::F32 => { simd::binary::binary_f32(op, a, b, out, len); return; }
DType::F64 => { simd::binary::binary_f64(op, a, b, out, len); return; }
_ => {}
}
binary_op_scalar(op, a, b, out, len); // scalar fallback
}
```
Parallelism is controlled via `ParallelismConfig` on `CpuClient`, which
configures thread count and chunk size for rayon-based parallel iteration.
### CUDA: PTX Kernel Loading
```
build.rs — compiles .cu → .ptx via nvcc
src/runtime/cuda/kernels/*.cu — CUDA C++ source (templated per dtype)
src/runtime/cuda/kernels/loader.rs — loads PTX, caches modules per device
```
**Lifecycle:**
1. `build.rs` runs `nvcc -ptx -O3 -arch=sm_75` on each `.cu` file
2. PTX files written to `$OUT_DIR`, path stored in `CUDA_KERNEL_DIR` env var
3. At runtime, first use loads PTX via `Ptx::from_file()` and creates a `CudaModule`
4. Module cached in a global `HashMap<(device_index, module_name), Arc<CudaModule>>`
5. Kernel functions retrieved from module by name (e.g., `"add_f32"`)
CUDA kernels use C++ templates with `extern "C"` linkage for per-dtype
instantiation:
```cuda
template<typename T>
__global__ void add_kernel(const T* a, const T* b, T* out, unsigned int n) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) out[idx] = a[idx] + b[idx];
}
extern "C" {
__global__ void add_f32(const float* a, const float* b, float* out, unsigned int n)
{ add_kernel<float>(a, b, out, n); }
__global__ void add_f64(const double* a, const double* b, double* out, unsigned int n)
{ add_kernel<double>(a, b, out, n); }
}
```
### WebGPU: WGSL Shader Dispatch
```
src/runtime/wgpu/shaders/ — WGSL source (embedded as Rust strings)
src/runtime/wgpu/shaders/pipeline.rs — shader compilation + pipeline cache
```
**Lifecycle:**
1. WGSL source is embedded in Rust code as string constants
2. First use: `device.create_shader_module()` compiles WGSL → `ShaderModule`
3. A `ComputePipeline` is created with bind group layout (buffer bindings)
4. Both module and pipeline cached in `PipelineCache` (keyed by shader name + entry point)
5. Dispatch: create bind group → encode compute pass → `queue.submit()`
WebGPU supports F32, I32, U32 natively, plus F16 with the
`shader-f16` feature. Unsupported dtypes return `Error::UnsupportedDType`.
---
## Autograd
```
src/autograd/var.rs — Var<R> struct
src/autograd/grad_fn.rs — GradFn trait
src/autograd/backward.rs — backward() traversal
src/autograd/var_ops/ — differentiable operations (var_add, var_matmul, etc.)
```
### `Var<R: Runtime>`
```rust
pub struct Var<R: Runtime> {
tensor: Tensor<R>, // underlying data
id: TensorId, // graph node identity
requires_grad: bool, // leaf flag
grad_fn: Option<Arc<dyn GradFn<R>>>, // backward function (None for leaves)
}
```
`Var` wraps `Tensor` with gradient-tracking metadata. During the forward pass,
`var_*` functions create new `Var` nodes with `grad_fn` closures that capture
references to parent nodes.
### Backward Pass
`backward(&loss, &client)` performs reverse-mode AD:
1. Topological sort of the computation graph from `loss` to leaves
2. Walk in reverse order, calling each node's `grad_fn` to propagate gradients
3. Return `GradStore` mapping `TensorId → Tensor<R>` (gradient tensors)
Gradients are regular tensors — they use the same backend and operations as
the forward pass.
---
## DType Dispatch
```
src/dtype/mod.rs — DType enum
src/dtype/element.rs — Element trait (type-level ↔ value-level bridge)
```
Every operation must handle all supported dtypes at runtime. The
`dispatch_dtype!` macro bridges from the `DType` enum to generic `T: Element`
code:
```rust
dispatch_dtype!(tensor.dtype(), T => {
kernels::binary_op::<T>(op, a, b, out)?;
}, "add");
```
This generates a match statement that monomorphizes the kernel for each dtype.
---
## Design Rationale
### Why traits, not enum dispatch?
Trait-based dispatch provides:
- **Compile-time safety**: missing backend implementations are compile errors
- **Zero-cost abstraction**: no runtime vtable lookup for operation dispatch
- **Independent compilation**: each backend compiles separately, no cross-deps
- **Extensibility**: new backends implement existing traits without modifying core
### Why operations on client, not on Tensor?
- Client carries backend state (CUDA stream, WebGPU queue, thread pool config)
- Prevents accidentally mixing backends in one expression
- Makes the compute target explicit in every call
### Why no vendor library dependencies?
numr uses native kernels exclusively — no cuBLAS, MKL, or vendor wrappers.
This ensures:
- Code works on any hardware the backend supports
- No 10GB+ SDK installation requirements
- Full portability to new backends (WebGPU, ROCm)
- Predictable, auditable kernel behavior
---
## Module Map
```
src/
├── lib.rs — entry point, prelude, DefaultRuntime
├── error.rs — Error enum (thiserror)
├── dtype/ — DType, Element, Complex64/128, dispatch macros
├── tensor/ — Tensor<R>, Storage<R>, Layout
├── runtime/
│ ├── traits/ — Runtime, Device, RuntimeClient
│ ├── cpu/ — CpuRuntime, CpuClient, SIMD kernels
│ ├── cuda/ — CudaRuntime, CudaClient, PTX loader
│ └── wgpu/ — WgpuRuntime, WgpuClient, WGSL pipelines
├── ops/
│ ├── traits/ — one file per operation category
│ ├── impl_generic/ — shared algorithms for composite ops
│ ├── cpu/ — CPU trait impls
│ ├── cuda/ — CUDA trait impls
│ └── wgpu/ — WebGPU trait impls
├── algorithm/ — FFT, linalg, special functions, polynomials
├── autograd/ — Var<R>, GradFn, backward, var_ops/
└── sparse/ — SparseTensor, COO/CSR/CSC (feature-gated)
```