rlx 0.2.1 - Docs.rs

# rlx

A small ML compiler and runtime for transformer inference and training.
JAX-shaped IR + autodiff + transforms (`jvp`, `hvp`, `vmap`) on top of
backend-specific kernels for CPU, Apple Silicon (Metal / MLX), NVIDIA
(CUDA), AMD (ROCm), Google TPU, and cross-platform GPU (wgpu).

This is the **prelude crate** — pulls in `rlx-ir` / `rlx-opt` /
`rlx-runtime` and re-exports the common types. Most code only needs
one `use rlx::prelude::*;`.

## Install

```toml
[dependencies]
rlx = { version = "0.2", features = ["cpu"] }
```

For common platforms, single-flag aggregates compose the right
fragments:

```toml
rlx = { version = "0.2", features = ["apple-silicon"] }   # cpu + metal + Accelerate
rlx = { version = "0.2", features = ["nvidia"] }          # cpu + cuda
rlx = { version = "0.2", features = ["edge"] }            # cpu + cortexm
rlx = { version = "0.2", features = ["all-cpu"] }         # cpu + gguf + linalg
```

> **`mlx` and `rocm` features.** `rlx-mlx` and `rlx-rocm` aren't on
> crates.io (vendor-bundled submodule / workspace-relative kernel
> sources). Enabling those features on a crates.io build will fail
> to resolve. Use a git source instead:
>
> ```toml
> rlx = { git = "https://github.com/MIT-RLX/rlx", features = ["apple-silicon", "mlx"] }
> ```

## Quickstart

```rust
use rlx::prelude::*;

let mut g = Graph::new("hello");
let x = g.input("x", Shape::new(&[1, 4], DType::F32));
let w = g.param("w", Shape::new(&[4, 2], DType::F32));
let y = g.matmul(x, w, Shape::new(&[1, 2], DType::F32));
g.set_outputs(vec![y]);

let mut compiled = Session::new(Device::Cpu).compile(g);
compiled.set_param("w", &[1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0]);
let out = compiled.run(&[("x", &[1.0, 2.0, 3.0, 4.0])]);
```

## Prelude + namespaces

| import                       | gives you                                                                |
|------------------------------|--------------------------------------------------------------------------|
| `use rlx::prelude::*;`       | `Graph`, `Session`, `DType`, `Device`, `Result`, `Activation`, `BinaryOp`, `jvp`, `vmap`, … |
| `use rlx::ops::*;`           | IR helper enums: `Activation`, `BinaryOp`, `CmpOp`, `MaskKind`, `ChainStep`, `ChainOperand` |
| `use rlx::quant::*;`         | `QuantScheme`, `QuantMap`                                                |
| `use rlx::gguf::*;`          | GGUF parser + dequant (`gguf` feature)                                   |
| `use rlx::autodiff::*;`      | `jvp`, `hvp`, `vmap`                                                     |
| `use rlx::ir::…`             | full `rlx-ir` surface (everything the prelude doesn't lift)              |
| `use rlx::runtime::…`        | full `rlx-runtime` surface (backends, custom Session config)             |

`rlx::Result<T>` and `rlx::Error` are aliases of `anyhow::Result<T>`
and `anyhow::Error` — the whole stack returns those.

## Feature matrix

### Backends

| feature             | backend                              | platform                  |
|---------------------|--------------------------------------|---------------------------|
| `cpu` *(default)*   | NEON / AVX + Accelerate / OpenBLAS   | every host                |
| `metal`             | Metal Performance Shaders + MSL      | macOS (Apple Silicon)     |
| `mlx`               | Apple MLX (vendored)                 | macOS (Apple Silicon)     |
| `gpu`               | wgpu (Vulkan / DX12 / WebGPU / Metal)| cross-platform            |
| `cuda`              | cuBLAS / cuDNN / NVRTC               | Linux / Windows + NVIDIA  |
| `rocm`              | hipBLAS / MIOpen                     | Linux + AMD               |
| `tpu`               | libtpu PJRT plugin                   | Linux + GCP TPU           |
| `blas-accelerate`   | macOS Accelerate                     | macOS                     |
| `blas-mkl`          | Intel MKL                            | Intel / AMD CPUs          |
| `blas-openblas`     | OpenBLAS                             | cross-platform CPU        |

### Companion crates

Off by default; turn on per workload:

| feature    | what                                                              |
|------------|-------------------------------------------------------------------|
| `gguf`     | GGUF v1 / v2 / v3 parser + dequant → `rlx::gguf`                  |
| `bench`    | uniform benchmark harness → `rlx::bench`                          |
| `sparse`   | sparse linear algebra (custom-op scaffold) → `rlx::sparse`        |
| `linalg`   | dense linalg via LAPACK (custom-op scaffold) → `rlx::linalg`      |
| `cortexm`  | INT8 ARMv7E-M kernels → `rlx::cortexm` (no `Backend` impl)        |
| `fpga`     | IR → SystemVerilog datapath synthesis → `rlx::fpga` (no `Backend`)|

`cortexm` and `fpga` don't go through the `Session` / `Backend`
pipeline — they're specialty targets exposed for direct use.

### Convenience aggregates

| feature           | expands to                              |
|-------------------|-----------------------------------------|
| `apple-silicon`   | `cpu` + `metal` + `blas-accelerate`     |
| `nvidia`          | `cpu` + `cuda`                          |
| `edge`            | `cpu` + `cortexm`                       |
| `all-cpu`         | `cpu` + `gguf` + `linalg`               |

`mlx` and `rocm` aren't in any aggregate (vendor-bundled). To opt
in, add the feature explicitly to a git-source dep:

```toml
rlx = { git = "https://github.com/MIT-RLX/rlx", features = ["apple-silicon", "mlx"] }
```

## Documentation

- API reference: <https://docs.rs/rlx>
- Workspace overview + per-crate READMEs: <https://github.com/MIT-RLX/rlx>

## License

GPL-3.0-only. See [`LICENSE`](https://github.com/MIT-RLX/rlx/blob/main/LICENSE).