rlx

A small ML compiler and runtime for transformer inference and training. JAX-shaped IR + autodiff + transforms (jvp, hvp, vmap) on top of backend-specific kernels for CPU, Apple Silicon (Metal / MLX), NVIDIA (CUDA), AMD (ROCm), Google TPU, and cross-platform GPU (wgpu).

This is the prelude crate — pulls in rlx-ir / rlx-opt / rlx-runtime and re-exports the common types. Most code only needs one use rlx::prelude::*;.

Install

[dependencies]
rlx = { version = "0.2", features = ["cpu"] }

For common platforms, single-flag aggregates compose the right fragments:

rlx = { version = "0.2", features = ["apple-silicon"] }   # cpu + metal + Accelerate
rlx = { version = "0.2", features = ["nvidia"] }          # cpu + cuda
rlx = { version = "0.2", features = ["edge"] }            # cpu + cortexm
rlx = { version = "0.2", features = ["all-cpu"] }         # cpu + gguf + linalg

mlx and rocm features. rlx-mlx and rlx-rocm aren't on crates.io (vendor-bundled submodule / workspace-relative kernel sources). Enabling those features on a crates.io build will fail to resolve. Use a git source instead:
rlx = { git = "https://github.com/MIT-RLX/rlx", features = ["apple-silicon", "mlx"] }

Quickstart

use rlx::prelude::*;

let mut g = Graph::new("hello");
let x = g.input("x", Shape::new(&[1, 4], DType::F32));
let w = g.param("w", Shape::new(&[4, 2], DType::F32));
let y = g.matmul(x, w, Shape::new(&[1, 2], DType::F32));
g.set_outputs(vec![y]);

let mut compiled = Session::new(Device::Cpu).compile(g);
compiled.set_param("w", &[1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0]);
let out = compiled.run(&[("x", &[1.0, 2.0, 3.0, 4.0])]);

Prelude + namespaces

import	gives you
`use rlx::prelude::*;`	`Graph`, `Session`, `DType`, `Device`, `Result`, `Activation`, `BinaryOp`, `jvp`, `vmap`, …
`use rlx::ops::*;`	IR helper enums: `Activation`, `BinaryOp`, `CmpOp`, `MaskKind`, `ChainStep`, `ChainOperand`
`use rlx::quant::*;`	`QuantScheme`, `QuantMap`
`use rlx::gguf::*;`	GGUF parser + dequant (`gguf` feature)
`use rlx::autodiff::*;`	`jvp`, `hvp`, `vmap`
`use rlx::ir::…`	full `rlx-ir` surface (everything the prelude doesn't lift)
`use rlx::runtime::…`	full `rlx-runtime` surface (backends, custom Session config)

rlx::Result<T> and rlx::Error are aliases of anyhow::Result<T> and anyhow::Error — the whole stack returns those.

Feature matrix

Backends

feature	backend	platform
`cpu` (default)	NEON / AVX + Accelerate / OpenBLAS	every host
`metal`	Metal Performance Shaders + MSL	macOS (Apple Silicon)
`mlx`	Apple MLX (vendored)	macOS (Apple Silicon)
`gpu`	wgpu (Vulkan / DX12 / WebGPU / Metal)	cross-platform
`cuda`	cuBLAS / cuDNN / NVRTC	Linux / Windows + NVIDIA
`rocm`	hipBLAS / MIOpen	Linux + AMD
`tpu`	libtpu PJRT plugin	Linux + GCP TPU
`blas-accelerate`	macOS Accelerate	macOS
`blas-mkl`	Intel MKL	Intel / AMD CPUs
`blas-openblas`	OpenBLAS	cross-platform CPU

Companion crates

Off by default; turn on per workload:

feature	what
`gguf`	GGUF v1 / v2 / v3 parser + dequant → `rlx::gguf`
`bench`	uniform benchmark harness → `rlx::bench`
`sparse`	sparse linear algebra (custom-op scaffold) → `rlx::sparse`
`linalg`	dense linalg via LAPACK (custom-op scaffold) → `rlx::linalg`
`cortexm`	INT8 ARMv7E-M kernels → `rlx::cortexm` (no `Backend` impl)
`fpga`	IR → SystemVerilog datapath synthesis → `rlx::fpga` (no `Backend`)

cortexm and fpga don't go through the Session / Backend pipeline — they're specialty targets exposed for direct use.

Convenience aggregates

feature	expands to
`apple-silicon`	`cpu` + `metal` + `blas-accelerate`
`nvidia`	`cpu` + `cuda`
`edge`	`cpu` + `cortexm`
`all-cpu`	`cpu` + `gguf` + `linalg`

mlx and rocm aren't in any aggregate (vendor-bundled). To opt in, add the feature explicitly to a git-source dep:

rlx = { git = "https://github.com/MIT-RLX/rlx", features = ["apple-silicon", "mlx"] }

Documentation

API reference: https://docs.rs/rlx
Workspace overview + per-crate READMEs: https://github.com/MIT-RLX/rlx

License

GPL-3.0-only. See LICENSE.

rlx 0.2.0

rlx