# numr
**THE foundational numerical computing library for Rust.**
numr provides dense tensors, linear algebra, FFT, statistics, advanced random number generation, and automatic differentiation—with the same API and algorithms across CPU, CUDA, and WebGPU backends.
## Why numr?
The Rust numerical computing ecosystem is fragmented. You need one library for tensors (ndarray), another for linear algebra (nalgebra/faer), another for FFT (rustfft), another for random numbers, another for statistics. They don't interoperate. They don't have GPU support. They're not optimized together.
numr consolidates everything:
| Tensors | ndarray | Tensor<R> |
| Linear algebra | nalgebra / faer | numr::linalg |
| FFT | rustfft | numr::fft |
| Sparse | sprs / ndsparse | numr::sparse (feature-gated) |
| Statistics | statrs | numr::statistics |
| Random numbers | rand + manual distributions | numr::random + multivariate |
| GPU support | None | CPU, CUDA, WebGPU |
| Automatic differentiation | None | numr::autograd |
A Rust developer should never need to look elsewhere for numerical computing.
## Architecture
numr is designed with a simple principle: **same code, any backend**.
```
┌──────────────────────────────────────────────────────────────┐
│ Your Application │
│ (any backend-agnostic code) │
└──────────────────────────────────────────────────────────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌───▼────┐
│ CPU │ │ CUDA │ │ WebGPU │
│ Runtime │ │ Runtime │ │Runtime │
└────┬────┘ └────┬────┘ └───┬────┘
│ │ │
┌────▼──────────┬─────────┴───────┬───────────▼───┐
│ Trait │ │ │
│ Implemen- │ Same Algorithm │ Different │
│ tations │ Different Code │ Hardware │
└───────────────┴─────────────────┴───────────────┘
```
## Operations
numr implements a comprehensive set of tensor operations across CPU, CUDA, and WebGPU:
### Core Arithmetic
- **UnaryOps**: neg, abs, sqrt, exp, log, sin, cos, tan, sinh, cosh, tanh, floor, ceil, round, and more
- **BinaryOps**: add, sub, mul, div, pow, maximum, minimum (all with NumPy-style broadcasting)
- **ScalarOps**: tensor-scalar arithmetic
### Shape and Data Movement
- **ShapeOps**: cat, stack, split, chunk, repeat, pad, roll
- **IndexingOps**: gather, scatter, index_select, masked_select, masked_fill, embedding_lookup
- **SortingOps**: sort, argsort, topk, unique, nonzero, searchsorted
### Reductions
- **ReduceOps**: sum, mean, max, min, prod (with precision variants)
- **CumulativeOps**: cumsum, cumprod, logsumexp
### Comparisons and Logical
- **CompareOps**: eq, ne, lt, le, gt, ge
- **LogicalOps**: logical_and, logical_or, logical_xor, logical_not
- **ConditionalOps**: where (ternary conditional)
### Neural Network Operations
- **ActivationOps**: relu, sigmoid, silu, gelu, leaky_relu, elu, softmax
- **NormalizationOps**: rms_norm, layer_norm
### Linear Algebra
- **MatmulOps**: matmul, matmul_bias (fused GEMM+bias)
- **LinalgOps**: solve, lstsq, pinverse, inverse, det, trace, matrix_rank, diag, matrix_norm, kron, khatri_rao
### Statistics and Probability
- **StatisticalOps**: var, std, skew, kurtosis, quantile, percentile, median, cov, corrcoef
- **RandomOps**: rand, randn, randint, multinomial, bernoulli, poisson, binomial, beta, gamma, exponential, chi_squared, student_t, f_distribution
- **MultivariateRandomOps**: multivariate_normal, wishart, dirichlet
- **QuasirandomOps**: Sobol, Halton sequences
### Distance Metrics
- **DistanceOps**: euclidean, manhattan, cosine, hamming, jaccard, minkowski, chebyshev, correlation
### Algorithm Modules
**Linear Algebra (`numr::linalg`):**
- **Decompositions**: LU, QR, Cholesky, SVD, Schur, full eigendecomposition, generalized eigenvalues
- **Solvers**: solve, lstsq, pinverse
- **Matrix functions**: exp, log, sqrt, sign
- **Utilities**: det, trace, rank, matrix norms
**Fast Fourier Transform (`numr::fft`):**
- FFT/IFFT (1D, 2D, ND) - Stockham algorithm
- Real FFT (RFFT/IRFFT)
**Matrix Multiplication (`numr::matmul`):**
- Tiled GEMM with register blocking
- Bias fusion support
**Special Functions (`numr::special`):**
- Gamma functions: gamma, lgamma, digamma, polygamma
- Error functions: erf, erfc, erfcinv
- Bessel functions: J0, J1, Jn, Y0, Y1, Yn
- Inverse special functions: erfcinv
**Sparse Tensors (`numr::sparse`, feature-gated):**
- Formats: CSR, CSC, COO
- Operations: SpGEMM (sparse matrix multiplication), SpMV (sparse matrix-vector), DSMM (dense-sparse matrix)
## Dtypes
numr supports a wide range of numeric types:
| f64 | 8B | ✓ | ✓ | ✗ | - |
| f32 | 4B | ✓ | ✓ | ✓ | - |
| f16 | 2B | ✓ | ✓ | ✓ | `f16` |
| bf16 | 2B | ✓ | ✓ | ✗ | `f16` |
| fp8e4m3 | 1B | ✓ | ✓ | ✗ | `fp8` |
| fp8e5m2 | 1B | ✓ | ✓ | ✗ | `fp8` |
| i64 | 8B | ✓ | ✓ | ✗ | - |
| i32 | 4B | ✓ | ✓ | ✓ | - |
| i16 | 2B | ✓ | ✓ | ✗ | - |
| i8 | 1B | ✓ | ✓ | ✗ | - |
| u64 | 8B | ✓ | ✓ | ✗ | - |
| u32 | 4B | ✓ | ✓ | ✓ | - |
| u16 | 2B | ✓ | ✓ | ✗ | - |
| u8 | 1B | ✓ | ✓ | ✓ | - |
| bool | 1B | ✓ | ✓ | ✓ | - |
Every operation supports every compatible dtype. No hardcoded f32-only kernels.
## Backends
All backends implement identical algorithms with native kernels—no cuBLAS, MKL, or vendor library dependencies.
| CPU (x86-64) | CPU | cpu (default) | ✓ | AVX-512/AVX2 SIMD |
| CPU (ARM) | CPU | cpu | Planned | NEON SIMD |
| NVIDIA GPU | CUDA | cuda | ✓ | Native PTX kernels |
| AMD GPU | WebGPU | wgpu | ✓ | WGSL shaders |
| Intel GPU | WebGPU | wgpu | ✓ | WGSL shaders |
| Apple GPU | WebGPU | wgpu | ✓ | WGSL shaders |
| AMD GPU | ROCm | - | Planned | Native HIP kernels |
### Why Native Kernels?
1. **Fewer dependencies**: No 2GB+ CUDA toolkit, no MKL installation
2. **Portability**: Same code on CPU, NVIDIA, AMD, Intel, Apple
3. **Transparency**: Understand exactly what code runs on your hardware
4. **Maintainability**: Your code doesn't break when vendor updates drop
5. **Performance**: Kernels optimize for YOUR workloads, not generic cases
## Quick Start
### CPU Example
```rust
use numr::prelude::*;
use numr::runtime::cpu::CpuRuntime;
fn main() -> Result<()> {
// Create tensors
let a = Tensor::<CpuRuntime>::from_slice(
&[1.0, 2.0, 3.0, 4.0],
&[2, 2],
)?;
let b = Tensor::<CpuRuntime>::from_slice(
&[5.0, 6.0, 7.0, 8.0],
&[2, 2],
)?;
// Arithmetic (with broadcasting)
let c = a.add(&b)?;
let d = a.mul(&b)?;
// Matrix multiplication
let e = a.matmul(&b)?;
// Reductions
let sum = c.sum()?;
let mean = c.mean()?;
let max = c.max()?;
// Element-wise functions
let exp = a.exp()?;
let sqrt = a.sqrt()?;
// Reshaping (zero-copy)
let flat = c.reshape(&[4])?;
let transposed = c.transpose()?;
Ok(())
}
```
### GPU Example (CUDA)
```rust
use numr::prelude::*;
use numr::runtime::cuda::CudaRuntime;
fn main() -> Result<()> {
// Create on GPU
let device = CudaRuntime::default_device()?;
let a = Tensor::<CudaRuntime>::randn(&[1024, 1024], &device)?;
let b = Tensor::<CudaRuntime>::randn(&[1024, 1024], &device)?;
// Operations run on GPU (native CUDA kernels)
let c = a.matmul(&b)?;
// Transfer result to CPU when needed
let cpu_result = c.to_cpu()?;
let data = cpu_result.to_vec::<f32>()?;
Ok(())
}
```
### Backend-Generic Code
```rust
use numr::prelude::*;
use numr::runtime::Runtime;
use numr::tensor::Tensor;
// Works on CPU, CUDA, or WebGPU
fn matrix_operations<R: Runtime>(
a: &Tensor<R>,
b: &Tensor<R>,
client: &R::Client,
) -> Result<Tensor<R>> {
// Same code, any backend
let c = client.add(a, b)?;
let d = client.matmul(&c, a)?;
client.sum(&d)
}
// Use the same function on different hardware
fn main() -> Result<()> {
let a_cpu = Tensor::<CpuRuntime>::randn(&[128, 128], &device_cpu)?;
let b_cpu = Tensor::<CpuRuntime>::randn(&[128, 128], &device_cpu)?;
let result_cpu = matrix_operations(&a_cpu, &b_cpu, &client_cpu)?;
#[cfg(feature = "cuda")]
{
let device_cuda = CudaRuntime::default_device()?;
let a_cuda = Tensor::<CudaRuntime>::randn(&[128, 128], &device_cuda)?;
let b_cuda = Tensor::<CudaRuntime>::randn(&[128, 128], &device_cuda)?;
let result_cuda = matrix_operations(&a_cuda, &b_cuda, &client_cuda)?;
}
Ok(())
}
```
### Linear Algebra
```rust
use numr::prelude::*;
use numr::algorithm::linalg::{LinalgOps, Decomposition};
fn main() -> Result<()> {
let a = Tensor::<CpuRuntime>::randn(&[64, 64], &device)?;
// LU decomposition
let (p, l, u) = client.lu(&a)?;
// QR decomposition
let (q, r) = client.qr(&a)?;
// SVD
let (u, s, vt) = client.svd(&a)?;
// Eigendecomposition
let (eigenvalues, eigenvectors) = client.eig(&a)?;
// Solve linear system: Ax = b
let b = Tensor::<CpuRuntime>::randn(&[64, 32], &device)?;
let x = client.solve(&a, &b)?;
// Determinant, trace, rank
let det = client.det(&a)?;
let tr = client.trace(&a)?;
let rank = client.matrix_rank(&a)?;
Ok(())
}
```
### FFT
```rust
use numr::prelude::*;
use numr::algorithm::fft::FftOps;
fn main() -> Result<()> {
let x = Tensor::<CpuRuntime>::randn(&[1024], &device)?;
// Complex FFT
let fft_result = client.fft(&x)?;
let inverse = client.ifft(&fft_result)?;
// Real FFT (more efficient for real-valued inputs)
let rfft_result = client.rfft(&x)?;
let irfft_result = client.irfft(&rfft_result, 1024)?;
// 2D FFT
let image = Tensor::<CpuRuntime>::randn(&[256, 256], &device)?;
let fft_2d = client.fft_2d(&image)?;
Ok(())
}
```
### Statistics and Distributions
```rust
use numr::prelude::*;
fn main() -> Result<()> {
let data = Tensor::<CpuRuntime>::randn(&[1000], &device)?;
// Descriptive statistics
let mean = client.mean(&data)?;
let std = client.std(&data)?;
let var = client.var(&data)?;
let median = client.median(&data)?;
let q25 = client.quantile(&data, 0.25)?;
// Statistical measures
let skewness = client.skew(&data)?;
let kurtosis = client.kurtosis(&data)?;
// Covariance and correlation
let x = Tensor::<CpuRuntime>::randn(&[100, 5], &device)?;
let y = Tensor::<CpuRuntime>::randn(&[100, 5], &device)?;
let cov = client.cov(&x)?;
let corr = client.corrcoef(&x)?;
// Random distributions
let normal = Tensor::<CpuRuntime>::randn(&[1000], &device)?; // mean=0, std=1
let uniform = Tensor::<CpuRuntime>::rand(&[1000], &device)?; // [0, 1)
let gamma = client.gamma(&[1000], shape, scale, &device)?;
let poisson = client.poisson(&[1000], lambda, &device)?;
// Multivariate distributions
let mvn = client.multivariate_normal(&[100], &mean, &cov)?;
let wishart = client.wishart(&[10], df, &scale_matrix)?;
Ok(())
}
```
## Installation
### CPU-only (default)
```toml
[dependencies]
numr = "*"
```
### With GPU Support
```toml
[dependencies]
# NVIDIA CUDA (requires CUDA 12.0+)
numr = { version = "*", features = ["cuda"] }
# Cross-platform GPU (NVIDIA, AMD, Intel, Apple)
numr = { version = "*", features = ["wgpu"] }
```
### With Optional Features
```toml
[dependencies]
numr = { version = "*", features = [
"cuda", # NVIDIA GPU support
"wgpu", # Cross-platform GPU (WebGPU)
"f16", # Half-precision (F16, BF16)
"fp8", # 8-bit floating point
"sparse", # Sparse tensors
] }
```
## Feature Flags
| `cpu` | CPU backend (AVX-512/AVX2 on x86-64, NEON planned) | ✓ |
| `cuda` | NVIDIA CUDA backend | ✗ |
| `wgpu` | Cross-platform GPU (WebGPU) | ✗ |
| `rayon` | Multi-threaded CPU via Rayon | ✓ |
| `f16` | Half-precision floats (F16, BF16) | ✗ |
| `fp8` | 8-bit floats (FP8E4M3, FP8E5M2) | ✗ |
| `sparse` | Sparse tensor support (CSR, CSC, COO) | ✗ |
## Building from Source
```bash
# CPU only
cargo build --release
# With CUDA
cargo build --release --features cuda
# With WebGPU
cargo build --release --features wgpu
# With all features
cargo build --release --features cuda,wgpu,f16,fp8,sparse
# Run tests
cargo test --release
cargo test --release --features cuda
cargo test --release --features wgpu
# Run benchmarks
cargo bench
```
## How numr Fits in the Stack
numr is the **foundation** that everything else builds on:
```
┌────────────────────────────────────┐
│ Applications (oxidizr, blazr) │
│ Your domain-specific code │
└────────────────┬───────────────────┘
│
┌────────────────▼───────────────────┐
│ boostr - ML Framework │
│ (neural networks, attention) │
│ Builds on numr ops │
└────────────────┬───────────────────┘
│
┌────────────────▼───────────────────┐
│ solvr - Scientific Computing │
│ (optimization, ODE, interpolation)│
│ Builds on numr ops and linalg │
└────────────────┬───────────────────┘
│
┌────────────────▼───────────────────┐
│ numr - Foundations │
│ (tensors, linalg, FFT, random) │
│ Native CPU, CUDA, WebGPU kernels │
└────────────────────────────────────┘
```
When numr's kernels improve, everything above improves automatically.
## Kernels and Extensibility
numr provides default kernels for all operations. You can also:
- **Use default kernels**: All operations work out of the box with optimized SIMD (CPU), PTX (CUDA), and WGSL (WebGPU) kernels
- **Replace specific kernels**: Swap in your own optimized kernels for performance-critical paths
- **Add new operations**: Define new traits and implement kernels for all backends
For detailed guidance on writing custom kernels, adding new operations, and backend-specific optimization techniques, see **[docs/extending-numr.md](docs/extending-numr.md)**.
## License
Apache-2.0