rlx-runtime
User-facing API for RLX — Session::new(Device).compile(graph) →
CompiledGraph, which holds the executable, the arena, the weights,
and the device handle.
What's here
Session— entry point; selects a backend viaDevice.CompiledGraph(compiled.rs) —run/set_param/set_input. Zero allocation per call.Backendtrait +ExecutableGraph— every backend (CPU, Metal, MLX, CUDA, ROCm, wgpu, TPU) implements these. Every backend declares its supportedOpKinds, andlegalize_for_backendrejects unsupported graphs at compile time.registry.rs/op_registry.rs— backend factory + per-op registration plumbing for downstream extension.Devicelives inrlx-driver::device; this crate just consumes it. Variants:Cpu,Metal,Mlx,Ane,Cuda,Rocm,Tpu,Gpu(wgpu),Vulkan,OpenGl,DirectX,WebGpu.device_ext.rs—Device::is_available()lookup against the registry (keeps the runtime→driver dep direction one-way).weights.rs—WeightLoadertrait +BytesWeightLoader. Promote to registry per plan #24 / #56.arena.rs— device-side arena buffer.CompileCache(compile_cache.rs) — graph-fingerprint → compiled-artifact cache.subgraph.rs—run_if/run_whilehelpers; the IR has If/While ops but executor wiring is pending (see Op::If/While docstring).PrecisionPolicy— re-export fromrlx-opt. AMP / always-f16 / always-f32 / always-bf16.trace.rs— runtime tracing (verbose env-gated).cost.rs— heterogeneous cost model that picks Cpu vs. Metal vs. MLX per graph.- FFT dispatch —
Op::Ffton CPU / Metal / MLX / CUDA / ROCm / wgpu / TPU. Pow-2 f32 uses native GPU kernels where available; other shapes and dtypes use partial host sync. Graph helpers (rfft,irfft,stft, …) live inrlx_ir::ops::fft_ops. stream.rs— async command stream (Metal-side; CPU is sync).paged_kv— paged KV cache + continuous batching primitives.
Re-exports: Tick, time_ns from rlx_ir::measure. Use these for any
sub-ms timing in the user-facing layer.
Cargo features
| feature | backend |
|---|---|
cpu (default) |
rlx-cpu |
metal |
rlx-metal (macOS) |
mlx |
rlx-mlx (macOS) |
gpu |
rlx-wgpu (cross-platform) |
cuda |
rlx-cuda |
rocm |
rlx-rocm |
tpu |
rlx-tpu |
blas-accelerate |
macOS Accelerate |
blas-mkl |
Intel MKL |
blas-openblas |
OpenBLAS |
Install
[]
= { = "0.2", = ["cpu"] }
Heads-up. The
mlxandrocmfeatures pull inrlx-mlxandrlx-rocm, which aren't on crates.io for 0.1.0 (workspace- relative submodule / kernel-source paths). Enabling those features on a crates.io build ofrlx-runtimewill fail to resolve. Use a git source on the whole workspace instead:= { = "https://github.com/MIT-RLX/rlx", = ["mlx"] }
Most users want the rlx prelude
crate; it re-exports rlx_runtime::Session and friends at the top
level.
Quickstart
use ;
use ;
let mut g = new;
let x = g.input;
let w = g.param;
let y = g.matmul;
g.set_outputs;
let mut compiled = new.compile;
compiled.set_param;
let out = compiled.run;
Build / test
Gotchas
- Backend selection is feature-gated.
--features metalis mandatory to instantiateDevice::Metal; otherwiseSession::new(Metal)panics at registry lookup. Same applies tocuda,rocm,mlx,wgpu. set_paramaccepts&[f32]of the declared shape's element count. Mismatched len is a runtime panic, not a compile-time error.- Compile cache key includes the graph fingerprint and the precision policy — bumping precision invalidates entries.
- For long-running serving paths, prefer
CompileCacheover recompiling per request.
License
GPL-3.0-only.