numr 0.0.0-beta.1

# numr

**THE foundational numerical computing library for Rust.**

numr provides dense tensors, linear algebra, FFT, statistics, advanced random number generation, and automatic differentiation—with the same API and algorithms across CPU, CUDA, and WebGPU backends.

## Why numr?

The Rust numerical computing ecosystem is fragmented. You need one library for tensors (ndarray), another for linear algebra (nalgebra/faer), another for FFT (rustfft), another for random numbers, another for statistics. They don't interoperate. They don't have GPU support. They're not optimized together.

numr consolidates everything:

| Task                      | Old Ecosystem               | numr                         |
| ------------------------- | --------------------------- | ---------------------------- |
| Tensors                   | ndarray                     | Tensor<R>                    |
| Linear algebra            | nalgebra / faer             | numr::linalg                 |
| FFT                       | rustfft                     | numr::fft                    |
| Sparse                    | sprs / ndsparse             | numr::sparse (feature-gated) |
| Statistics                | statrs                      | numr::statistics             |
| Random numbers            | rand + manual distributions | numr::random + multivariate  |
| GPU support               | None                        | CPU, CUDA, WebGPU            |
| Automatic differentiation | None                        | numr::autograd               |

A Rust developer should never need to look elsewhere for numerical computing.

## Architecture

numr is designed with a simple principle: **same code, any backend**.

```
┌──────────────────────────────────────────────────────────────┐
│                    Your Application                          │
│               (any backend-agnostic code)                    │
└──────────────────────────────────────────────────────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
   ┌────▼────┐          ┌────▼────┐          ┌───▼────┐
   │ CPU     │          │ CUDA    │          │ WebGPU │
   │ Runtime │          │ Runtime │          │Runtime │
   └────┬────┘          └────┬────┘          └───┬────┘
        │                    │                   │
   ┌────▼──────────┬─────────┴───────┬───────────▼───┐
   │     Trait     │                 │               │
   │  Implemen-    │  Same Algorithm │  Different    │
   │  tations      │  Different Code │  Hardware     │
   └───────────────┴─────────────────┴───────────────┘
```

## Operations

numr implements a comprehensive set of tensor operations across CPU, CUDA, and WebGPU:

### Core Arithmetic

- **UnaryOps**: neg, abs, sqrt, exp, log, sin, cos, tan, sinh, cosh, tanh, floor, ceil, round, and more
- **BinaryOps**: add, sub, mul, div, pow, maximum, minimum (all with NumPy-style broadcasting)
- **ScalarOps**: tensor-scalar arithmetic

### Shape and Data Movement

- **ShapeOps**: cat, stack, split, chunk, repeat, pad, roll
- **IndexingOps**: gather, scatter, index_select, masked_select, masked_fill, embedding_lookup
- **SortingOps**: sort, argsort, topk, unique, nonzero, searchsorted

### Reductions

- **ReduceOps**: sum, mean, max, min, prod (with precision variants)
- **CumulativeOps**: cumsum, cumprod, logsumexp

### Comparisons and Logical

- **CompareOps**: eq, ne, lt, le, gt, ge
- **LogicalOps**: logical_and, logical_or, logical_xor, logical_not
- **ConditionalOps**: where (ternary conditional)

### Neural Network Operations

- **ActivationOps**: relu, sigmoid, silu, gelu, leaky_relu, elu, softmax
- **NormalizationOps**: rms_norm, layer_norm

### Linear Algebra

- **MatmulOps**: matmul, matmul_bias (fused GEMM+bias)
- **LinalgOps**: solve, lstsq, pinverse, inverse, det, trace, matrix_rank, diag, matrix_norm, kron, khatri_rao

### Statistics and Probability

- **StatisticalOps**: var, std, skew, kurtosis, quantile, percentile, median, cov, corrcoef
- **RandomOps**: rand, randn, randint, multinomial, bernoulli, poisson, binomial, beta, gamma, exponential, chi_squared, student_t, f_distribution
- **MultivariateRandomOps**: multivariate_normal, wishart, dirichlet
- **QuasirandomOps**: Sobol, Halton sequences

### Distance Metrics

- **DistanceOps**: euclidean, manhattan, cosine, hamming, jaccard, minkowski, chebyshev, correlation

### Algorithm Modules

**Linear Algebra (`numr::linalg`):**

- **Decompositions**: LU, QR, Cholesky, SVD, Schur, full eigendecomposition, generalized eigenvalues
- **Solvers**: solve, lstsq, pinverse
- **Matrix functions**: exp, log, sqrt, sign
- **Utilities**: det, trace, rank, matrix norms

**Fast Fourier Transform (`numr::fft`):**

- FFT/IFFT (1D, 2D, ND) - Stockham algorithm
- Real FFT (RFFT/IRFFT)

**Matrix Multiplication (`numr::matmul`):**

- Tiled GEMM with register blocking
- Bias fusion support

**Special Functions (`numr::special`):**

- Gamma functions: gamma, lgamma, digamma, polygamma
- Error functions: erf, erfc, erfcinv
- Bessel functions: J0, J1, Jn, Y0, Y1, Yn
- Inverse special functions: erfcinv

**Sparse Tensors (`numr::sparse`, feature-gated):**

- Formats: CSR, CSC, COO
- Operations: SpGEMM (sparse matrix multiplication), SpMV (sparse matrix-vector), DSMM (dense-sparse matrix)

## Dtypes

numr supports a wide range of numeric types:

| Type    | Size | CPU | CUDA | WebGPU | Feature |
| ------- | ---- | --- | ---- | ------ | ------- |
| f64     | 8B   | ✓   | ✓    | ✗      | -       |
| f32     | 4B   | ✓   | ✓    | ✓      | -       |
| f16     | 2B   | ✓   | ✓    | ✓      | `f16`   |
| bf16    | 2B   | ✓   | ✓    | ✗      | `f16`   |
| fp8e4m3 | 1B   | ✓   | ✓    | ✗      | `fp8`   |
| fp8e5m2 | 1B   | ✓   | ✓    | ✗      | `fp8`   |
| i64     | 8B   | ✓   | ✓    | ✗      | -       |
| i32     | 4B   | ✓   | ✓    | ✓      | -       |
| i16     | 2B   | ✓   | ✓    | ✗      | -       |
| i8      | 1B   | ✓   | ✓    | ✗      | -       |
| u64     | 8B   | ✓   | ✓    | ✗      | -       |
| u32     | 4B   | ✓   | ✓    | ✓      | -       |
| u16     | 2B   | ✓   | ✓    | ✗      | -       |
| u8      | 1B   | ✓   | ✓    | ✓      | -       |
| bool    | 1B   | ✓   | ✓    | ✓      | -       |

Every operation supports every compatible dtype. No hardcoded f32-only kernels.

## Backends

All backends implement identical algorithms with native kernels—no cuBLAS, MKL, or vendor library dependencies.

| Hardware     | Backend | Feature       | Status  | Notes              |
| ------------ | ------- | ------------- | ------- | ------------------ |
| CPU (x86-64) | CPU     | cpu (default) | ✓       | AVX-512/AVX2 SIMD  |
| CPU (ARM)    | CPU     | cpu           | Planned | NEON SIMD          |
| NVIDIA GPU   | CUDA    | cuda          | ✓       | Native PTX kernels |
| AMD GPU      | WebGPU  | wgpu          | ✓       | WGSL shaders       |
| Intel GPU    | WebGPU  | wgpu          | ✓       | WGSL shaders       |
| Apple GPU    | WebGPU  | wgpu          | ✓       | WGSL shaders       |
| AMD GPU      | ROCm    | -             | Planned | Native HIP kernels |

### Why Native Kernels?

1. **Fewer dependencies**: No 2GB+ CUDA toolkit, no MKL installation
2. **Portability**: Same code on CPU, NVIDIA, AMD, Intel, Apple
3. **Transparency**: Understand exactly what code runs on your hardware
4. **Maintainability**: Your code doesn't break when vendor updates drop
5. **Performance**: Kernels optimize for YOUR workloads, not generic cases

## Quick Start

### CPU Example

```rust
use numr::prelude::*;
use numr::runtime::cpu::CpuRuntime;

fn main() -> Result<()> {
    // Create tensors
    let a = Tensor::<CpuRuntime>::from_slice(
        &[1.0, 2.0, 3.0, 4.0],
        &[2, 2],
    )?;
    let b = Tensor::<CpuRuntime>::from_slice(
        &[5.0, 6.0, 7.0, 8.0],
        &[2, 2],
    )?;

    // Arithmetic (with broadcasting)
    let c = a.add(&b)?;
    let d = a.mul(&b)?;

    // Matrix multiplication
    let e = a.matmul(&b)?;

    // Reductions
    let sum = c.sum()?;
    let mean = c.mean()?;
    let max = c.max()?;

    // Element-wise functions
    let exp = a.exp()?;
    let sqrt = a.sqrt()?;

    // Reshaping (zero-copy)
    let flat = c.reshape(&[4])?;
    let transposed = c.transpose()?;

    Ok(())
}
```

### GPU Example (CUDA)

```rust
use numr::prelude::*;
use numr::runtime::cuda::CudaRuntime;

fn main() -> Result<()> {
    // Create on GPU
    let device = CudaRuntime::default_device()?;
    let a = Tensor::<CudaRuntime>::randn(&[1024, 1024], &device)?;
    let b = Tensor::<CudaRuntime>::randn(&[1024, 1024], &device)?;

    // Operations run on GPU (native CUDA kernels)
    let c = a.matmul(&b)?;

    // Transfer result to CPU when needed
    let cpu_result = c.to_cpu()?;
    let data = cpu_result.to_vec::<f32>()?;

    Ok(())
}
```

### Backend-Generic Code

```rust
use numr::prelude::*;
use numr::runtime::Runtime;
use numr::tensor::Tensor;

// Works on CPU, CUDA, or WebGPU
fn matrix_operations<R: Runtime>(
    a: &Tensor<R>,
    b: &Tensor<R>,
    client: &R::Client,
) -> Result<Tensor<R>> {
    // Same code, any backend
    let c = client.add(a, b)?;
    let d = client.matmul(&c, a)?;
    client.sum(&d)
}

// Use the same function on different hardware
fn main() -> Result<()> {
    let a_cpu = Tensor::<CpuRuntime>::randn(&[128, 128], &device_cpu)?;
    let b_cpu = Tensor::<CpuRuntime>::randn(&[128, 128], &device_cpu)?;
    let result_cpu = matrix_operations(&a_cpu, &b_cpu, &client_cpu)?;

    #[cfg(feature = "cuda")]
    {
        let device_cuda = CudaRuntime::default_device()?;
        let a_cuda = Tensor::<CudaRuntime>::randn(&[128, 128], &device_cuda)?;
        let b_cuda = Tensor::<CudaRuntime>::randn(&[128, 128], &device_cuda)?;
        let result_cuda = matrix_operations(&a_cuda, &b_cuda, &client_cuda)?;
    }

    Ok(())
}
```

### Linear Algebra

```rust
use numr::prelude::*;
use numr::algorithm::linalg::{LinalgOps, Decomposition};

fn main() -> Result<()> {
    let a = Tensor::<CpuRuntime>::randn(&[64, 64], &device)?;

    // LU decomposition
    let (p, l, u) = client.lu(&a)?;

    // QR decomposition
    let (q, r) = client.qr(&a)?;

    // SVD
    let (u, s, vt) = client.svd(&a)?;

    // Eigendecomposition
    let (eigenvalues, eigenvectors) = client.eig(&a)?;

    // Solve linear system: Ax = b
    let b = Tensor::<CpuRuntime>::randn(&[64, 32], &device)?;
    let x = client.solve(&a, &b)?;

    // Determinant, trace, rank
    let det = client.det(&a)?;
    let tr = client.trace(&a)?;
    let rank = client.matrix_rank(&a)?;

    Ok(())
}
```

### FFT

```rust
use numr::prelude::*;
use numr::algorithm::fft::FftOps;

fn main() -> Result<()> {
    let x = Tensor::<CpuRuntime>::randn(&[1024], &device)?;

    // Complex FFT
    let fft_result = client.fft(&x)?;
    let inverse = client.ifft(&fft_result)?;

    // Real FFT (more efficient for real-valued inputs)
    let rfft_result = client.rfft(&x)?;
    let irfft_result = client.irfft(&rfft_result, 1024)?;

    // 2D FFT
    let image = Tensor::<CpuRuntime>::randn(&[256, 256], &device)?;
    let fft_2d = client.fft_2d(&image)?;

    Ok(())
}
```

### Statistics and Distributions

```rust
use numr::prelude::*;

fn main() -> Result<()> {
    let data = Tensor::<CpuRuntime>::randn(&[1000], &device)?;

    // Descriptive statistics
    let mean = client.mean(&data)?;
    let std = client.std(&data)?;
    let var = client.var(&data)?;
    let median = client.median(&data)?;
    let q25 = client.quantile(&data, 0.25)?;

    // Statistical measures
    let skewness = client.skew(&data)?;
    let kurtosis = client.kurtosis(&data)?;

    // Covariance and correlation
    let x = Tensor::<CpuRuntime>::randn(&[100, 5], &device)?;
    let y = Tensor::<CpuRuntime>::randn(&[100, 5], &device)?;
    let cov = client.cov(&x)?;
    let corr = client.corrcoef(&x)?;

    // Random distributions
    let normal = Tensor::<CpuRuntime>::randn(&[1000], &device)?; // mean=0, std=1
    let uniform = Tensor::<CpuRuntime>::rand(&[1000], &device)?; // [0, 1)
    let gamma = client.gamma(&[1000], shape, scale, &device)?;
    let poisson = client.poisson(&[1000], lambda, &device)?;

    // Multivariate distributions
    let mvn = client.multivariate_normal(&[100], &mean, &cov)?;
    let wishart = client.wishart(&[10], df, &scale_matrix)?;

    Ok(())
}
```

## Installation

### CPU-only (default)

```toml
[dependencies]
numr = "*"
```

### With GPU Support

```toml
[dependencies]
# NVIDIA CUDA (requires CUDA 12.0+)
numr = { version = "*", features = ["cuda"] }

# Cross-platform GPU (NVIDIA, AMD, Intel, Apple)
numr = { version = "*", features = ["wgpu"] }
```

### With Optional Features

```toml
[dependencies]
numr = { version = "*", features = [
    "cuda",      # NVIDIA GPU support
    "wgpu",      # Cross-platform GPU (WebGPU)
    "f16",       # Half-precision (F16, BF16)
    "fp8",       # 8-bit floating point
    "sparse",    # Sparse tensors
] }
```

## Feature Flags

| Feature  | Description                                        | Default |
| -------- | -------------------------------------------------- | ------- |
| `cpu`    | CPU backend (AVX-512/AVX2 on x86-64, NEON planned) | ✓       |
| `cuda`   | NVIDIA CUDA backend                                | ✗       |
| `wgpu`   | Cross-platform GPU (WebGPU)                        | ✗       |
| `rayon`  | Multi-threaded CPU via Rayon                       | ✓       |
| `f16`    | Half-precision floats (F16, BF16)                  | ✗       |
| `fp8`    | 8-bit floats (FP8E4M3, FP8E5M2)                    | ✗       |
| `sparse` | Sparse tensor support (CSR, CSC, COO)              | ✗       |

## Building from Source

```bash
# CPU only
cargo build --release

# With CUDA
cargo build --release --features cuda

# With WebGPU
cargo build --release --features wgpu

# With all features
cargo build --release --features cuda,wgpu,f16,fp8,sparse

# Run tests
cargo test --release
cargo test --release --features cuda
cargo test --release --features wgpu

# Run benchmarks
cargo bench
```

## How numr Fits in the Stack

numr is the **foundation** that everything else builds on:

```
┌────────────────────────────────────┐
│  Applications (oxidizr, blazr)     │
│  Your domain-specific code         │
└────────────────┬───────────────────┘
                 │
┌────────────────▼───────────────────┐
│  boostr - ML Framework             │
│  (neural networks, attention)      │
│  Builds on numr ops                │
└────────────────┬───────────────────┘
                 │
┌────────────────▼───────────────────┐
│  solvr - Scientific Computing      │
│  (optimization, ODE, interpolation)│
│  Builds on numr ops and linalg     │
└────────────────┬───────────────────┘
                 │
┌────────────────▼───────────────────┐
│  numr - Foundations                │
│  (tensors, linalg, FFT, random)    │
│  Native CPU, CUDA, WebGPU kernels  │
└────────────────────────────────────┘
```

When numr's kernels improve, everything above improves automatically.

## Kernels and Extensibility

numr provides default kernels for all operations. You can also:

- **Use default kernels**: All operations work out of the box with optimized SIMD (CPU), PTX (CUDA), and WGSL (WebGPU) kernels
- **Replace specific kernels**: Swap in your own optimized kernels for performance-critical paths
- **Add new operations**: Define new traits and implement kernels for all backends

For detailed guidance on writing custom kernels, adding new operations, and backend-specific optimization techniques, see **[docs/extending-numr.md](docs/extending-numr.md)**.

## License

Apache-2.0