rlx-runtime

User-facing API for RLX — Session::new(Device).compile(graph) → CompiledGraph, which holds the executable, the arena, the weights, and the device handle.

What's here

Session — entry point; selects a backend via Device.
CompiledGraph (compiled.rs) — run / set_param / set_input. Zero allocation per call.
Backend trait + ExecutableGraph — every backend (CPU, Metal, MLX, CUDA, ROCm, wgpu, TPU) implements these. Every backend declares its supported OpKinds, and legalize_for_backend rejects unsupported graphs at compile time.
registry.rs / op_registry.rs — backend factory + per-op registration plumbing for downstream extension.
Device lives in rlx-driver::device; this crate just consumes it. Variants: Cpu, Metal, Mlx, Ane, Cuda, Rocm, Tpu, Gpu (wgpu), Vulkan, OpenGl, DirectX, WebGpu.
device_ext.rs — Device::is_available() lookup against the registry (keeps the runtime→driver dep direction one-way).
weights.rs — WeightLoader trait + BytesWeightLoader. Promote to registry per plan #24 / #56.
arena.rs — device-side arena buffer.
CompileCache (compile_cache.rs) — graph-fingerprint → compiled-artifact cache.
subgraph.rs — run_if / run_while helpers; the IR has If/While ops but executor wiring is pending (see Op::If/While docstring).
PrecisionPolicy — re-export from rlx-opt. AMP / always-f16 / always-f32 / always-bf16.
trace.rs — runtime tracing (verbose env-gated).
cost.rs — heterogeneous cost model that picks Cpu vs. Metal vs. MLX per graph.
FFT dispatch — Op::Fft on CPU / Metal / MLX / CUDA / ROCm / wgpu / TPU. Pow-2 f32 uses native GPU kernels where available; other shapes and dtypes use partial host sync. Graph helpers (rfft, irfft, stft, …) live in rlx_ir::ops::fft_ops.
stream.rs — async command stream (Metal-side; CPU is sync).
paged_kv — paged KV cache + continuous batching primitives.

Re-exports: Tick, time_ns from rlx_ir::measure. Use these for any sub-ms timing in the user-facing layer.

Cargo features

feature	backend
`cpu` (default)	`rlx-cpu`
`metal`	`rlx-metal` (macOS)
`mlx`	`rlx-mlx` (macOS)
`gpu`	`rlx-wgpu` (cross-platform)
`cuda`	`rlx-cuda`
`rocm`	`rlx-rocm`
`tpu`	`rlx-tpu`
`blas-accelerate`	macOS Accelerate
`blas-mkl`	Intel MKL
`blas-openblas`	OpenBLAS

Install

[dependencies]
rlx-runtime = { version = "0.2", features = ["cpu"] }

Heads-up. The mlx and rocm features pull in rlx-mlx and rlx-rocm, which aren't on crates.io for 0.1.0 (workspace- relative submodule / kernel-source paths). Enabling those features on a crates.io build of rlx-runtime will fail to resolve. Use a git source on the whole workspace instead:
rlx-runtime = { git = "https://github.com/MIT-RLX/rlx", features = ["mlx"] }

Most users want the rlx prelude crate; it re-exports rlx_runtime::Session and friends at the top level.

Quickstart

use rlx_ir::{DType, Graph, Shape};
use rlx_runtime::{Device, Session};

let mut g = Graph::new("hello");
let x = g.input("x", Shape::new(&[1, 4], DType::F32));
let w = g.param("w", Shape::new(&[4, 2], DType::F32));
let y = g.matmul(x, w, Shape::new(&[1, 2], DType::F32));
g.set_outputs(vec![y]);

let mut compiled = Session::new(Device::Cpu).compile(g);
compiled.set_param("w", &[1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0]);
let out = compiled.run(&[("x", &[1.0, 2.0, 3.0, 4.0])]);

Build / test

cargo build -p rlx-runtime --features cpu                       # CPU only
cargo build -p rlx-runtime --features cpu,metal                 # +Metal
cargo test  -p rlx-runtime --release

Gotchas

Backend selection is feature-gated. --features metal is mandatory to instantiate Device::Metal; otherwise Session::new(Metal) panics at registry lookup. Same applies to cuda, rocm, mlx, wgpu.
set_param accepts &[f32] of the declared shape's element count. Mismatched len is a runtime panic, not a compile-time error.
Compile cache key includes the graph fingerprint and the precision policy — bumping precision invalidates entries.
For long-running serving paths, prefer CompileCache over recompiling per request.

License

GPL-3.0-only.

rlx-runtime 0.2.2