Skip to main content

Crate rlx

Crate rlx 

Source
Expand description

§RLX

A small ML compiler + runtime for transformer inference and training, with a JAX-shaped IR + autodiff + transforms (jvp, hvp, vmap) on top of CPU / Apple Silicon (Metal / MLX) / NVIDIA (CUDA) / AMD (ROCm) / Google TPU / cross-platform GPU (wgpu) / FPGA / Cortex-M backends.

This is the prelude crate — pulls in the framework-level workspace members and re-exports the common types so a one-line use rlx::prelude::*; covers most usage.

§Three usage patterns

§1. Build + run a graph by hand

use rlx::prelude::*;

let mut g = Graph::new("hello");
let x = g.input("x", Shape::new(&[1, 4], DType::F32));
let w = g.param("w", Shape::new(&[4, 2], DType::F32));
let y = g.matmul(x, w, Shape::new(&[1, 2], DType::F32));
let scaled = g.mul(x, g.constant(2.0, DType::F32)); // GraphExt literal
g.set_outputs(vec![y, scaled]);

let mut compiled = Session::new(Device::Cpu).compile(g);
compiled.set_param("w", &[1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0]);
let out = compiled.run(&[("x", &[1.0, 2.0, 3.0, 4.0])]);

§Module map

Every workspace crate is reachable as a module on rlx:

pathcratewhat
rlx::irrlx-irIR types, ops, graph builder
rlx::optrlx-optfacade: rlx-fusion + rlx-autodiff + rlx-compile
rlx::driverrlx-driverDevice enum, registries
rlx::runtimerlx-runtimeSession, CompiledGraph
rlx::macrosrlx-macros#[rlx_model] proc macro
rlx::ggufrlx-ggufGGUF parser + dequant (feature gguf)
rlx::onnxrlx-onnxONNX Runtime .onnx inference (feature onnx)
rlx::benchrlx-benchbenchmark harness (feature bench)
rlx::sparserlx-sparsedownstream: sparse linalg (feature sparse)
rlx::splatrlx-splat3D Gaussian splatting (feature splat)register(), decomposed IR ops
rlx::linalgrlx-linalgdownstream: dense linalg via LAPACK (feature linalg)
rlx::cortexmrlx-cortexmINT8 ARMv7E-M kernels (feature cortexm) — no Backend impl, kernels only
rlx::fpgarlx-fpgaIR → SystemVerilog datapath synthesis (feature fpga) — no Backend impl

§Convenience namespaces

Grouped re-exports for related concerns — use these when you want one focused subset without star-importing the whole prelude:

namespacewhat
[rlx::quant]QuantScheme, QuantMap (IR quantization metadata)
[rlx::ops]Activation, BinaryOp, CmpOp, MaskKind, ChainStep, ChainOperand
[rlx::autodiff]jvp, hvp, vmap + the autodiff entry points
[rlx::prelude]star-import target covering the 95% case

§Backend feature gates

Pick the ones that match your hardware. Multiple backends can be enabled at once; the runtime picks one per Session.

featurebackendplatform
cpu (default)NEON / AVX + Accelerate / OpenBLASevery host
metalMetal Performance Shaders + MSLmacOS (Apple Silicon)
mlxApple MLX (vendored)macOS (Apple Silicon)
gpuwgpu (Vulkan / DX12 / WebGPU / Metal)cross-platform
cudacuBLAS / cuDNN / NVRTCLinux / Windows + NVIDIA
rocmhipBLAS / MIOpenLinux + AMD
tpulibtpu PJRT pluginLinux + GCP TPU
blas-acceleratemacOS AcceleratemacOS
blas-mklIntel MKLIntel / AMD CPUs
blas-openblasOpenBLAScross-platform CPU

§Convenience aggregates

Single-flag setups for common platforms. Each composes the fragments most users want for that target.

featureexpands to
apple-siliconcpu + metal + blas-accelerate
nvidiacpu + cuda
edgecpu + cortexm
all-cpucpu + gguf + linalg

mlx and rocm aren’t in any aggregate because their crates aren’t on crates.io (vendor-bundled submodule / workspace- relative kernel sources). To opt in, depend on the workspace via git and add the feature explicitly:

rlx = { git = "https://github.com/MIT-RLX/rlx", features = ["apple-silicon", "mlx"] }

Re-exports§

pub use rlx_ir as ir;
pub use rlx_opt as opt;
pub use rlx_driver as driver;
pub use rlx_runtime as runtime;
pub use rlx_macros as macros;
pub use rlx_gguf as gguf;
pub use rlx_bench as bench;
pub use rlx_sparse as sparse;
pub use rlx_linalg as linalg;
pub use rlx_optim as optim;
pub use rlx_cortexm as cortexm;
pub use rlx_fpga as fpga;

Modules§

autodiff
Autodiff + transforms — re-exports the public entry points from rlx_opt. Use these when computing gradients or doing vmap / jvp / hvp over a graph.
ops
Op-builder helper enums — the variants the graph builder methods (g.binary, g.compare, g.activation, g.attention_kind, …) take as their first argument, plus the fused-chain primitives used by Op::ElementwiseRegion.
prelude
Star-import target covering the 95% case:
quant
Quantization metadata — schemes the IR carries per-tensor, plus the QuantMap graph-level annotation. Use these when wiring Op::DequantMatMul or attaching quant info to your own ops.
vmap

Macros§

register_backends
Register optional custom backends and companion custom-op crates.

Structs§

BackendsManifest
Parsed backends_manifest.json emitted by build.rs.
CompilePipeline
End-to-end compiler pipeline configuration.
CompileResult
End-to-end compiler output: optimized LIR + fusion diagnostics.
CompiledGraph
A compiled graph ready for execution.
DeviceBenchResult
One backend timed after compile + repeated execution.
DeviceCandidate
One row of backend introspection for UIs and logs.
DeviceFallbackError
Errors collected when every backend in the chain fails at run time.
DevicePolicy
Which backends a process may use — intersected with compile-time features and runtime availability.
DeviceRouter
Production helper: compile all viable backends up front, run with fallback chain.
Element
Per-element semantics that don’t fit into a flat DType enum (plan #40). Mirrors MAX’s layout/element.mojo Element type: DType says “f8”, but two FP8 variants exist (e4m3 and e5m2) with different range/precision tradeoffs. Saturation policy (clamp on overflow vs. wrap) is similarly orthogonal.
FlexibleSession
Compile-time settings without a fixed Device.
FusionOptions
Per-target fusion toggles (env-driven on Metal today).
FusionReport
Before/after fusion statistics and missed-pattern tally.
Graph
A computation graph — the core IR data structure.
GraphDevices
A graph plus lazy per-device compiled executables.
GraphModule
Unified model module — primary builder surface above HIR/MIR/LIR.
HirModule
High-level module — model builder output.
LirModule
Low-level module — backend compile input after optimization.
MirModule
Mid-level module — optimizer input.
MissedFusion
A single fusion opportunity that remains in the graph.
Node
A single node in the computation graph.
NodeId
Stable identifier for a node in the graph. Indices are never reused.
NodeOrigin
Where a MIR node came from and how it was produced.
ParseDeviceError
Failed to parse a device name.
PipelineInspect
Text dump of each compiler pipeline stage.
Session
A session manages graph compilation and execution on a device.
Shape
Tensor shape: ordered list of dimensions + element type.
Tick
Opaque tick reading. Subtract two of these to get a Duration.

Enums§

DType
Scalar element type. Matches hardware-supported types.
Device
Target device for graph execution.
DevicePickStrategy
How [GraphDevices::resolve_with_inputs] picks a backend when no hint is set.
FusionPolicy
How HIR block ops lower to MIR.
FusionTarget
Compile target that selects a fusion pipeline.
GraphStage
Which stage of the HIR → MIR → LIR pipeline a GraphModule holds.
HirOp
High-level operation — blocks and escape hatches.
MissReason
Why a recognizable fusion pattern was not collapsed.
Op
An operation in the RLX IR graph.
OpKind
PLAN L4: discriminant for each Op variant. Used by Op::kind + the Backend::supported_ops trait method to declare which ops a backend can lower; the LegalizeForBackend pass in rlx-opt checks the graph against this set and fails the compile when an unsupported op is present (instead of silent fallback).
Precision
Which numeric precision to use for an op. (Subset of DType — only the ones we currently dispatch on.)
PrecisionPolicy
Declarative precision policy for graph compilation.
QuantScheme
How a tensor is quantized. Mirrors the schemes RLX needs for LLM inference on Apple Silicon: blockwise int8 (GPTQ-style), blockwise int4 (Q4_K), and per-tensor fp8 (e4m3 / e5m2).

Traits§

GraphExt
Extension trait for shape-inferred graph building.
Pass
A graph-to-graph transformation pass.

Functions§

available_devices
Every variant currently available — Cargo-feature-gated or runtime-registered.
benchmark_devices
Compile (if needed) and time each backend; returns results sorted fastest first.
device_chain_from_env
Ordered fallback chain from RLX_DEVICE_CHAIN (cuda,gpu,cpu).
device_from_env
Default device hint from RLX_DEVICE (or PREFIX_DEVICE via device_from_env_key).
device_label
Stable lower-case label for device (matches Cargo feature names).
device_report
Explain which backends are viable for graph under policy.
devices_for
Intersection of available_devices and supports_graph. Use with crate::GraphDevices or crate::DevicePolicy to restrict the set.
devices_for_with_policy
Backends on this host that can lower graph, filtered by policy.
fastest_device
Highest-priority backend that is compiled in and live on this host.
fastest_device_for
Pick the fastest backend for graph on this host.
fusion_passes
Return the ordered fusion passes for target.
fusion_passes_for_supported
Return the ordered fusion passes allowed for supported.
graph_param_names
Param names declared in graph (Op::Param).
hvp
Hessian-vector product via forward-over-reverse.
inspect_graph
Annotated graph dump (MIR body). Alias for pretty_print.
inspect_graph_diff
Summarize graph changes between pipeline stages.
inspect_hir
Annotated HIR module dump.
inspect_hir_stats
One-line HIR summary (header + op histogram).
inspect_lir
Annotated LIR dump: optimized MIR + buffer plan + schedule.
inspect_mir
Annotated MIR module dump (optimized tensor DAG).
inspect_mir_diff
Diff two MIR snapshots (typically pre/post fusion).
inspect_mir_stats
One-line MIR summary.
inspect_pipeline
Inspect every lowering stage for hir through pipeline.
is_available
jvp
Compute the JVP graph for forward, perturbing each Input / Param named in tangent_for. Returns a new graph whose outputs are [primals..., tangents...], in the order forward listed them.
maybe_dump_pipeline
Write a full pipeline dump when RLX_IR_DUMP is set (path prefix or directory).
node_label
Best-effort label for diagnostics (origin label, node name, or id).
parse_device
Lower-case Cargo feature names and common aliases → Device.
parse_device_list
Parse comma/semicolon/whitespace-separated device lists (RLX_DEVICES=cpu,metal).
resolve_device
Resolve the backend to use: explicit hint → env → fastest for graph.
resolve_device_chain
First device in chain that is viable under policy for graph.
run_with_fallback
Try chain in order; return the first successful result from run.
scalar_constant_bytes
Encode a scalar as little-endian bytes for crate::op::Op::Constant.
supported_for_target
Per-target op claims used when a backend doesn’t supply an explicit supported_ops slice. Must stay aligned with each backend’s *_SUPPORTED_OPS in rlx-runtime/src/backend.rs.
supports_op
True when supported is empty (no claim) or contains kind.
vmap
Vectorize forward over a leading batch axis.

Type Aliases§

CalibrationRecord
Map of tap NodeId → calibrated quant params.
Error
Crate-wide error type — alias of anyhow::Error.
Result
Crate-wide result type — alias of anyhow::Result<T>. Use this in main() and library boundaries.