Crate rlx

Expand description

§RLX

A small ML compiler + runtime for transformer inference and training, with a JAX-shaped IR + autodiff + transforms (jvp, hvp, vmap) on top of CPU / Apple Silicon (Metal / MLX) / NVIDIA (CUDA) / AMD (ROCm) / Google TPU / cross-platform GPU (wgpu) / FPGA / Cortex-M backends.

This is the prelude crate — pulls in the framework-level workspace members and re-exports the common types so a one-line use rlx::prelude::*; covers most usage.

§Three usage patterns

§1. Build + run a graph by hand

use rlx::prelude::*;

let mut g = Graph::new("hello");
let x = g.input("x", Shape::new(&[1, 4], DType::F32));
let w = g.param("w", Shape::new(&[4, 2], DType::F32));
let y = g.matmul(x, w, Shape::new(&[1, 2], DType::F32));
let scaled = g.mul(x, g.constant(2.0, DType::F32)); // GraphExt literal
g.set_outputs(vec![y, scaled]);

let mut compiled = Session::new(Device::Cpu).compile(g);
compiled.set_param("w", &[1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0]);
let out = compiled.run(&[("x", &[1.0, 2.0, 3.0, 4.0])]);

§Module map

Every workspace crate is reachable as a module on rlx:

path	crate	what
`rlx::ir`	`rlx-ir`	IR types, ops, graph builder
`rlx::opt`	`rlx-opt`	facade: `rlx-fusion` + `rlx-autodiff` + `rlx-compile`
`rlx::driver`	`rlx-driver`	`Device` enum, registries
`rlx::runtime`	`rlx-runtime`	`Session`, `CompiledGraph`
`rlx::macros`	`rlx-macros`	`#[rlx_model]` proc macro
`rlx::gguf`	`rlx-gguf`	GGUF parser + dequant (feature `gguf`)
`rlx::onnx`	`rlx-onnx`	ONNX Runtime `.onnx` inference (feature `onnx`)
`rlx::bench`	`rlx-bench`	benchmark harness (feature `bench`)
`rlx::sparse`	`rlx-sparse`	downstream: sparse linalg (feature `sparse`)
`rlx::splat`	`rlx-splat`	3D Gaussian splatting (feature `splat`) — `register()`, decomposed IR ops
`rlx::linalg`	`rlx-linalg`	downstream: dense linalg via LAPACK (feature `linalg`)
`rlx::cortexm`	`rlx-cortexm`	INT8 ARMv7E-M kernels (feature `cortexm`) — no `Backend` impl, kernels only
`rlx::fpga`	`rlx-fpga`	IR → SystemVerilog datapath synthesis (feature `fpga`) — no `Backend` impl

§Convenience namespaces

Grouped re-exports for related concerns — use these when you want one focused subset without star-importing the whole prelude:

namespace	what
[`rlx::quant`]	`QuantScheme`, `QuantMap` (IR quantization metadata)
[`rlx::ops`]	`Activation`, `BinaryOp`, `CmpOp`, `MaskKind`, `ChainStep`, `ChainOperand`
[`rlx::autodiff`]	`jvp`, `hvp`, `vmap` + the autodiff entry points
[`rlx::prelude`]	star-import target covering the 95% case

§Backend feature gates

Pick the ones that match your hardware. Multiple backends can be enabled at once; the runtime picks one per Session.

feature	backend	platform
`cpu` (default)	NEON / AVX + Accelerate / OpenBLAS	every host
`metal`	Metal Performance Shaders + MSL	macOS (Apple Silicon)
`mlx`	Apple MLX (vendored)	macOS (Apple Silicon)
`gpu`	wgpu (Vulkan / DX12 / WebGPU / Metal)	cross-platform
`cuda`	cuBLAS / cuDNN / NVRTC	Linux / Windows + NVIDIA
`rocm`	hipBLAS / MIOpen	Linux + AMD
`tpu`	libtpu PJRT plugin	Linux + GCP TPU
`blas-accelerate`	macOS Accelerate	macOS
`blas-mkl`	Intel MKL	Intel / AMD CPUs
`blas-openblas`	OpenBLAS	cross-platform CPU

§Convenience aggregates

Single-flag setups for common platforms. Each composes the fragments most users want for that target.

feature	expands to
`apple-silicon`	`cpu` + `metal` + `blas-accelerate`
`nvidia`	`cpu` + `cuda`
`edge`	`cpu` + `cortexm`
`all-cpu`	`cpu` + `gguf` + `linalg`

mlx and rocm aren’t in any aggregate because their crates aren’t on crates.io (vendor-bundled submodule / workspace- relative kernel sources). To opt in, depend on the workspace via git and add the feature explicitly:

rlx = { git = "https://github.com/MIT-RLX/rlx", features = ["apple-silicon", "mlx"] }

Re-exports§

pub use rlx_ir as ir;
pub use rlx_opt as opt;
pub use rlx_driver as driver;
pub use rlx_runtime as runtime;
pub use rlx_macros as macros;
pub use rlx_gguf as gguf;
pub use rlx_bench as bench;
pub use rlx_sparse as sparse;
pub use rlx_linalg as linalg;
pub use rlx_optim as optim;
pub use rlx_cortexm as cortexm;
pub use rlx_fpga as fpga;

Modules§

autodiff: Autodiff + transforms — re-exports the public entry points from rlx_opt. Use these when computing gradients or doing vmap / jvp / hvp over a graph.
ops: Op-builder helper enums — the variants the graph builder methods (g.binary, g.compare, g.activation, g.attention_kind, …) take as their first argument, plus the fused-chain primitives used by Op::ElementwiseRegion.
prelude: Star-import target covering the 95% case:
quant: Quantization metadata — schemes the IR carries per-tensor, plus the QuantMap graph-level annotation. Use these when wiring Op::DequantMatMul or attaching quant info to your own ops.
vmap

Macros§

register_backends: Register optional custom backends and companion custom-op crates.

Structs§

BackendsManifest: Parsed backends_manifest.json emitted by build.rs.
CompilePipeline: End-to-end compiler pipeline configuration.
CompileResult: End-to-end compiler output: optimized LIR + fusion diagnostics.
CompiledGraph: A compiled graph ready for execution.
DeviceBenchResult: One backend timed after compile + repeated execution.
DeviceCandidate: One row of backend introspection for UIs and logs.
DeviceFallbackError: Errors collected when every backend in the chain fails at run time.
DevicePolicy: Which backends a process may use — intersected with compile-time features and runtime availability.
DeviceRouter: Production helper: compile all viable backends up front, run with fallback chain.
Element: Per-element semantics that don’t fit into a flat DType enum (plan #40). Mirrors MAX’s layout/element.mojo Element type: DType says “f8”, but two FP8 variants exist (e4m3 and e5m2) with different range/precision tradeoffs. Saturation policy (clamp on overflow vs. wrap) is similarly orthogonal.
FlexibleSession: Compile-time settings without a fixed Device.
FusionOptions: Per-target fusion toggles (env-driven on Metal today).
FusionReport: Before/after fusion statistics and missed-pattern tally.
Graph: A computation graph — the core IR data structure.
GraphDevices: A graph plus lazy per-device compiled executables.
GraphModule: Unified model module — primary builder surface above HIR/MIR/LIR.
HirModule: High-level module — model builder output.
LirModule: Low-level module — backend compile input after optimization.
MirModule: Mid-level module — optimizer input.
MissedFusion: A single fusion opportunity that remains in the graph.
Node: A single node in the computation graph.
NodeId: Stable identifier for a node in the graph. Indices are never reused.
NodeOrigin: Where a MIR node came from and how it was produced.
ParseDeviceError: Failed to parse a device name.
PipelineInspect: Text dump of each compiler pipeline stage.
Session: A session manages graph compilation and execution on a device.
Shape: Tensor shape: ordered list of dimensions + element type.
Tick: Opaque tick reading. Subtract two of these to get a Duration.

Enums§

DType: Scalar element type. Matches hardware-supported types.
Device: Target device for graph execution.
DevicePickStrategy: How [GraphDevices::resolve_with_inputs] picks a backend when no hint is set.
FusionPolicy: How HIR block ops lower to MIR.
FusionTarget: Compile target that selects a fusion pipeline.
GraphStage: Which stage of the HIR → MIR → LIR pipeline a GraphModule holds.
HirOp: High-level operation — blocks and escape hatches.
MissReason: Why a recognizable fusion pattern was not collapsed.
Op: An operation in the RLX IR graph.
OpKind: PLAN L4: discriminant for each Op variant. Used by Op::kind + the Backend::supported_ops trait method to declare which ops a backend can lower; the LegalizeForBackend pass in rlx-opt checks the graph against this set and fails the compile when an unsupported op is present (instead of silent fallback).
Precision: Which numeric precision to use for an op. (Subset of DType — only the ones we currently dispatch on.)
PrecisionPolicy: Declarative precision policy for graph compilation.
QuantScheme: How a tensor is quantized. Mirrors the schemes RLX needs for LLM inference on Apple Silicon: blockwise int8 (GPTQ-style), blockwise int4 (Q4_K), and per-tensor fp8 (e4m3 / e5m2).

Traits§

GraphExt: Extension trait for shape-inferred graph building.
Pass: A graph-to-graph transformation pass.

Functions§

available_devices: Every variant currently available — Cargo-feature-gated or runtime-registered.
benchmark_devices: Compile (if needed) and time each backend; returns results sorted fastest first.
device_chain_from_env: Ordered fallback chain from RLX_DEVICE_CHAIN (cuda,gpu,cpu).
device_from_env: Default device hint from RLX_DEVICE (or PREFIX_DEVICE via device_from_env_key).
device_label: Stable lower-case label for device (matches Cargo feature names).
device_report: Explain which backends are viable for graph under policy.
devices_for: Intersection of available_devices and supports_graph. Use with crate::GraphDevices or crate::DevicePolicy to restrict the set.
devices_for_with_policy: Backends on this host that can lower graph, filtered by policy.
fastest_device: Highest-priority backend that is compiled in and live on this host.
fastest_device_for: Pick the fastest backend for graph on this host.
fusion_passes: Return the ordered fusion passes for target.
fusion_passes_for_supported: Return the ordered fusion passes allowed for supported.
graph_param_names: Param names declared in graph (Op::Param).
hvp: Hessian-vector product via forward-over-reverse.
inspect_graph: Annotated graph dump (MIR body). Alias for pretty_print.
inspect_graph_diff: Summarize graph changes between pipeline stages.
inspect_hir: Annotated HIR module dump.
inspect_hir_stats: One-line HIR summary (header + op histogram).
inspect_lir: Annotated LIR dump: optimized MIR + buffer plan + schedule.
inspect_mir: Annotated MIR module dump (optimized tensor DAG).
inspect_mir_diff: Diff two MIR snapshots (typically pre/post fusion).
inspect_mir_stats: One-line MIR summary.
inspect_pipeline: Inspect every lowering stage for hir through pipeline.
is_available
jvp: Compute the JVP graph for forward, perturbing each Input / Param named in tangent_for. Returns a new graph whose outputs are [primals..., tangents...], in the order forward listed them.
maybe_dump_pipeline: Write a full pipeline dump when RLX_IR_DUMP is set (path prefix or directory).
node_label: Best-effort label for diagnostics (origin label, node name, or id).
parse_device: Lower-case Cargo feature names and common aliases → Device.
parse_device_list: Parse comma/semicolon/whitespace-separated device lists (RLX_DEVICES=cpu,metal).
resolve_device: Resolve the backend to use: explicit hint → env → fastest for graph.
resolve_device_chain: First device in chain that is viable under policy for graph.
run_with_fallback: Try chain in order; return the first successful result from run.
scalar_constant_bytes: Encode a scalar as little-endian bytes for crate::op::Op::Constant.
supported_for_target: Per-target op claims used when a backend doesn’t supply an explicit supported_ops slice. Must stay aligned with each backend’s *_SUPPORTED_OPS in rlx-runtime/src/backend.rs.
supports_op: True when supported is empty (no claim) or contains kind.
vmap: Vectorize forward over a leading batch axis.

Type Aliases§

CalibrationRecord: Map of tap NodeId → calibrated quant params.
Error: Crate-wide error type — alias of anyhow::Error.
Result: Crate-wide result type — alias of anyhow::Result<T>. Use this in main() and library boundaries.

Crate rlx

Crate rlx Copy item path

§RLX

§Three usage patterns

§1. Build + run a graph by hand

§Module map

§Convenience namespaces

§Backend feature gates

§Convenience aggregates

Re-exports§

Modules§

Macros§

Structs§

Enums§

Traits§

Functions§

Type Aliases§

Crate rlx