1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
// RLX — versatile ML compiler + runtime.
// Copyright (C) 2026 Eugene Hauptmann, Nataliya Kosmyna.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, version 3.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <https://www.gnu.org/licenses/>.
//! # RLX
//!
//! A small ML compiler + runtime for transformer inference and training,
//! with a JAX-shaped IR + autodiff + transforms (`jvp`, `hvp`, `vmap`)
//! on top of CPU / Apple Silicon (Metal / MLX) / NVIDIA (CUDA) / AMD
//! (ROCm) / Google TPU / cross-platform GPU (wgpu) / FPGA / Cortex-M
//! backends.
//!
//! This is the **prelude crate** — pulls in the framework-level
//! workspace members and re-exports the common types so a one-line
//! `use rlx::prelude::*;` covers most usage.
//!
//! ## Three usage patterns
//!
//! ### 1. Build + run a graph by hand
//!
//! ```ignore
//! use rlx::prelude::*;
//!
//! let mut g = Graph::new("hello");
//! let x = g.input("x", Shape::new(&[1, 4], DType::F32));
//! let w = g.param("w", Shape::new(&[4, 2], DType::F32));
//! let y = g.matmul(x, w, Shape::new(&[1, 2], DType::F32));
//! g.set_outputs(vec![y]);
//!
//! let mut compiled = Session::new(Device::Cpu).compile(g);
//! compiled.set_param("w", &[1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0]);
//! let out = compiled.run(&[("x", &[1.0, 2.0, 3.0, 4.0])]);
//! ```
//!
//! ## Module map
//!
//! Every workspace crate is reachable as a module on `rlx`:
//!
//! | path | crate | what |
//! |-----------------|-----------------|---------------------------------------------------------------------------------|
//! | `rlx::ir` | `rlx-ir` | IR types, ops, graph builder |
//! | `rlx::opt` | `rlx-opt` | facade: `rlx-fusion` + `rlx-autodiff` + `rlx-compile` |
//! | `rlx::driver` | `rlx-driver` | `Device` enum, registries |
//! | `rlx::runtime` | `rlx-runtime` | `Session`, `CompiledGraph` |
//! | `rlx::macros` | `rlx-macros` | `#[rlx_model]` proc macro |
//! | `rlx::gguf` | `rlx-gguf` | GGUF parser + dequant *(feature `gguf`)* |
//! | `rlx::bench` | `rlx-bench` | benchmark harness *(feature `bench`)* |
//! | `rlx::sparse` | `rlx-sparse` | downstream: sparse linalg *(feature `sparse`)* |
//! | `rlx::splat` | `rlx-splat` | 3D Gaussian splatting *(feature `splat`)* — `register()`, decomposed IR ops |
//! | `rlx::linalg` | `rlx-linalg` | downstream: dense linalg via LAPACK *(feature `linalg`)* |
//! | `rlx::cortexm` | `rlx-cortexm` | INT8 ARMv7E-M kernels *(feature `cortexm`)* — no `Backend` impl, kernels only |
//! | `rlx::fpga` | `rlx-fpga` | IR → SystemVerilog datapath synthesis *(feature `fpga`)* — no `Backend` impl |
//!
//! ## Convenience namespaces
//!
//! Grouped re-exports for related concerns — use these when you want
//! one focused subset without star-importing the whole prelude:
//!
//! | namespace | what |
//! |----------------------|-------------------------------------------------------------------------------|
//! | [`rlx::quant`] | `QuantScheme`, `QuantMap` (IR quantization metadata) |
//! | [`rlx::ops`] | `Activation`, `BinaryOp`, `CmpOp`, `MaskKind`, `ChainStep`, `ChainOperand` |
//! | [`rlx::autodiff`] | `jvp`, `hvp`, `vmap` + the autodiff entry points |
//! | [`rlx::prelude`] | star-import target covering the 95% case |
//!
//! ## Backend feature gates
//!
//! Pick the ones that match your hardware. Multiple backends can be
//! enabled at once; the runtime picks one per `Session`.
//!
//! | feature | backend | platform |
//! |---------------------|--------------------------------------|---------------------------|
//! | `cpu` *(default)* | NEON / AVX + Accelerate / OpenBLAS | every host |
//! | `metal` | Metal Performance Shaders + MSL | macOS (Apple Silicon) |
//! | `mlx` | Apple MLX (vendored) | macOS (Apple Silicon) |
//! | `gpu` | wgpu (Vulkan / DX12 / WebGPU / Metal)| cross-platform |
//! | `cuda` | cuBLAS / cuDNN / NVRTC | Linux / Windows + NVIDIA |
//! | `rocm` | hipBLAS / MIOpen | Linux + AMD |
//! | `tpu` | libtpu PJRT plugin | Linux + GCP TPU |
//! | `blas-accelerate` | macOS Accelerate | macOS |
//! | `blas-mkl` | Intel MKL | Intel / AMD CPUs |
//! | `blas-openblas` | OpenBLAS | cross-platform CPU |
//!
//! ## Convenience aggregates
//!
//! Single-flag setups for common platforms. Each composes the
//! fragments most users want for that target.
//!
//! | feature | expands to |
//! |-------------------|---------------------------------------------|
//! | `apple-silicon` | `cpu` + `metal` + `blas-accelerate` |
//! | `nvidia` | `cpu` + `cuda` |
//! | `edge` | `cpu` + `cortexm` |
//! | `all-cpu` | `cpu` + `gguf` + `linalg` |
//!
//! `mlx` and `rocm` aren't in any aggregate because their crates
//! aren't on crates.io (vendor-bundled submodule / workspace-
//! relative kernel sources). To opt in, depend on the workspace via
//! git and add the feature explicitly:
//!
//! ```toml
//! rlx = { git = "https://github.com/MIT-RLX/rlx", features = ["apple-silicon", "mlx"] }
//! ```
// ── Module re-exports ───────────────────────────────────────────
/// Tensor IR — types, shapes, ops, graph builder.
/// See [`rlx-ir`](https://crates.io/crates/rlx-ir).
pub use rlx_ir as ir;
/// Graph rewrites + autodiff + vmap.
/// See [`rlx-opt`](https://crates.io/crates/rlx-opt).
pub use rlx_opt as opt;
/// Device enum + cross-cutting types.
/// See [`rlx-driver`](https://crates.io/crates/rlx-driver).
pub use rlx_driver as driver;
/// User-facing `Session` / `CompiledGraph`.
/// See [`rlx-runtime`](https://crates.io/crates/rlx-runtime).
pub use rlx_runtime as runtime;
/// Procedural macros (`#[rlx_model]`, `pipeline_schedule!`).
/// See [`rlx-macros`](https://crates.io/crates/rlx-macros).
pub use rlx_macros as macros;
/// GGUF v1 / v2 / v3 parser + dequant.
/// See [`rlx-gguf`](https://crates.io/crates/rlx-gguf).
pub use rlx_gguf as gguf;
/// Uniform benchmark harness.
/// See [`rlx-bench`](https://crates.io/crates/rlx-bench).
pub use rlx_bench as bench;
/// Downstream: sparse linear algebra (custom-op scaffold).
/// See [`rlx-sparse`](https://crates.io/crates/rlx-sparse).
pub use rlx_sparse as sparse;
/// Downstream: dense linalg via LAPACK (custom-op scaffold).
/// See [`rlx-linalg`](https://crates.io/crates/rlx-linalg).
pub use rlx_linalg as linalg;
/// Downstream: 3D Gaussian splatting (CPU reference render custom op).
/// See [`rlx-splat`](https://crates.io/crates/rlx-splat).
pub use rlx_splat as splat;
/// Downstream: UMAP / fast-umap custom ops (k-NN from pairwise distances).
pub use rlx_umap as umap;
/// `no_std` ARMv7E-M INT8 kernels (Cortex-M4F / M7). Doesn't
/// implement `Backend` — call the kernels (`dense`, `conv2d`,
/// `maxpool`, `relu`, `argmax`) directly.
/// See [`rlx-cortexm`](https://crates.io/crates/rlx-cortexm).
pub use rlx_cortexm as cortexm;
/// IR → SystemVerilog datapath synthesis. Doesn't implement
/// `Backend` — synth + P&R takes minutes; the entry point is
/// `rlx::fpga::codegen::emit_model`.
/// See [`rlx-fpga`](https://crates.io/crates/rlx-fpga).
pub use rlx_fpga as fpga;
// ── Error types ─────────────────────────────────────────────────
//
// The whole stack returns `anyhow::Result<T>` — `rlx::Result` /
// `rlx::Error` make that the obvious choice for downstream code
// without forcing an explicit `anyhow` dep at the call site.
/// Crate-wide result type — alias of `anyhow::Result<T>`. Use this
/// in `main()` and library boundaries.
pub type Result<T, E = Error> = Result;
/// Crate-wide error type — alias of `anyhow::Error`.
pub type Error = Error;
// ── Flat re-exports for the most-common types ───────────────────
//
// These cover ~90% of user code: build a graph with rlx_ir types,
// compile + run it through Session, then read back outputs. Less
// common types stay reachable via the module re-exports above.
pub use Device;
pub use QuantScheme;
pub use ;
pub use ;
pub use ;
pub use ;
// ── Grouped namespaces ──────────────────────────────────────────
/// Quantization metadata — schemes the IR carries per-tensor, plus
/// the `QuantMap` graph-level annotation. Use these when wiring
/// `Op::DequantMatMul` or attaching quant info to your own ops.
///
/// ```ignore
/// use rlx::quant::QuantScheme;
///
/// let scheme = QuantScheme::GgufQ4K; // GGUF Q4_K super-block
/// assert!(scheme.is_gguf());
/// assert_eq!(scheme.gguf_block_bytes(), 144);
/// ```
/// Op-builder helper enums — the variants the graph builder methods
/// (`g.binary`, `g.compare`, `g.activation`, `g.attention_kind`, …)
/// take as their first argument, plus the fused-chain primitives
/// used by `Op::ElementwiseRegion`.
///
/// ```ignore
/// use rlx::{Graph, Shape, DType};
/// use rlx::ops::{Activation, BinaryOp};
///
/// let mut g = Graph::new("ex");
/// let x = g.input("x", Shape::new(&[4], DType::F32));
/// let y = g.input("y", Shape::new(&[4], DType::F32));
/// let s = g.binary(BinaryOp::Add, x, y, Shape::new(&[4], DType::F32));
/// let r = g.activation(Activation::Silu, s, Shape::new(&[4], DType::F32));
/// g.set_outputs(vec![r]);
/// ```
/// Autodiff + transforms — re-exports the public entry points from
/// `rlx_opt`. Use these when computing gradients or doing
/// `vmap` / `jvp` / `hvp` over a graph.
///
/// ```ignore
/// use rlx::autodiff::{jvp, vmap};
/// ```
// ── Prelude — single `use rlx::prelude::*;` for the 95% case ────
//
// Includes the graph-building / runtime types, common IR helper
// enums, and autodiff entry points. Skips less-common
// types — those stay reachable via the module re-exports above.
/// Star-import target covering the 95% case:
///
/// ```ignore
/// use rlx::prelude::*;
///
/// // graph building
/// let mut g = Graph::new("ex");
/// let x = g.input("x", Shape::new(&[1, 4], DType::F32));
///
/// // compile + run
/// let mut compiled = Session::new(Device::Cpu).compile(g);
/// let out = compiled.run(&[("x", &[1.0; 4])]);
///
/// ```