Expand description
RLX Runtime — the user-facing API.
Provides a unified Session that compiles and executes IR graphs
on the selected backend. Backend selection is via Cargo features:
[dependencies]
rlx-runtime = { version = "0.1", features = ["cpu"] } # CPU (default)
rlx-runtime = { version = "0.1", features = ["blas-accelerate"] } # CPU + Apple Accelerate
rlx-runtime = { version = "0.1", features = ["blas-mkl"] } # CPU + Intel MKL
# rlx-runtime = { version = "0.1", features = ["gpu"] } # GPU via wgpu
# rlx-runtime = { version = "0.1", features = ["cuda"] } # GPU via CUDA§Example
use rlx_runtime::*;
use rlx_ir::*;
// Build a graph
let mut g = Graph::new("example");
let x = g.input("x", Shape::new(&[2, 4], DType::F32));
let w = g.param("w", Shape::new(&[4, 3], DType::F32));
let b = g.param("b", Shape::new(&[3], DType::F32));
let mm = g.matmul(x, w, Shape::new(&[2, 3], DType::F32));
let out = g.binary(op::BinaryOp::Add, mm, b, Shape::new(&[2, 3], DType::F32));
g.set_outputs(vec![out]);
// Compile and execute
let session = Session::new(Device::Cpu);
let mut compiled = session.compile(g);
compiled.set_param("w", &[1.0f32; 12]);
compiled.set_param("b", &[0.0f32; 3]);
let result = compiled.run(&[("x", &[1.0f32; 8])]);Re-exports§
pub use aot_cache::AotCache;pub use aot_cache::AotCacheError;pub use backend::Backend;pub use backend::ExecutableGraph;pub use backend::compile_hir;pub use backend::compile_module;pub use compile_cache::BucketedCompileCache;pub use compile_cache::CacheRunInput;pub use compile_cache::CompileCache;pub use compile_cache::DynamicDimCompileCache;pub use compile_cache::pad_rows;pub use compile_cache::slice_rows;pub use compiled::CompiledGraph;pub use device_ext::available_devices;pub use device_ext::dispatch_report_for_device;pub use device_ext::dispatch_report_for_device_with_options;pub use device_ext::first_unsupported_op;pub use device_ext::first_unsupported_op_with_options;pub use device_ext::full_name;pub use device_ext::is_available;pub use device_ext::legalize_graph_for_device;pub use device_ext::legalize_graph_for_device_with_options;pub use device_ext::legalize_graph_for_device_with_report;pub use device_ext::supports;pub use device_ext::supports_graph;pub use device_ext::supports_graph_with_options;pub use expert_pool::ExpertPool;pub use expert_pool::ExpertPoolConfig;pub use expert_pool::ExpertPoolStats;pub use expert_pool::ExpertRefreshPolicy;pub use expert_pool::ExpertRefreshResult;pub use expert_pool::MoEExecMode;pub use expert_pool::gpu_expert_budget_from_vram;pub use kv_cache::LayerKvCache;pub use memory_estimate::MoeOffloadEstimate;pub use memory_estimate::estimate_moe_offload;pub use model_pipeline::ModelCompilePipeline;pub use options::CompileOptions;pub use precision::Precision;pub use reflect::ModelReflection;pub use reflect::load_hir_template_with_extensions;pub use reflect::specialize_entry;pub use registry::BackendFactory;pub use registry::backend_for;pub use registry::register_backend;pub use registry::registered_devices;pub use session::Session;pub use stages::compile_graph_stages;pub use stages::compile_graph_stages_for_backend;pub use stages::compile_hir_stages;pub use stages::compile_module_stages;pub use stages::fusion_target_for;pub use stages::graph_from_lir;pub use stages::maybe_log_fusion;pub use stages::options_with_supported_ops;pub use stages::pipeline_for;pub use subgraph::SubgraphCache;pub use subgraph::run_if;pub use subgraph::run_while;pub use expert_pool::merged_resident_mask;pub use expert_pool::per_layer_resident_masks;pub use moe_expert_store::ExpertStackF32;pub use moe_expert_store::LayerMoeWeights;pub use moe_expert_store::MoeExpertStore;pub use weight_registry::WeightEntry;pub use weight_registry::WeightHandle;pub use weight_registry::WeightKind;pub use weight_registry::WeightRegistry;pub use weights::BytesWeightLoader;pub use weights::WeightLoader;
Modules§
- aot_
cache - AOT cache — persist optimized LIR modules and reload for backend compile.
- attn_
mask - Attention-mask helpers for bucketed decode (pad-to-upper, slice-back).
- backend
- Backend trait — abstraction over CPU/GPU/CUDA execution.
- compile_
cache - Shape-bucketed compile cache.
- compiled
- Compiled graph — the hot-path execution object.
- cost
- Cross-backend cost interface.
- custom_
ops - Custom-op extensibility (plan #25).
- device_
ext - Engine-layer extensions for
rlx_driver::Device(plan #58). - env
- Unified
RLX_*configuration — readable from code overrides or process env. - expert_
pool - MoE expert residency pool (TIDE-style predictive offload).
- hwinfo
- Hardware introspection (plan #47).
- jacfwd
- Forward-mode Jacobian materialization.
- kernel_
trace - Compile-time gated kernel tracing (plan #7).
- kv_
cache - Per-layer K/V cache for autoregressive decode (Whisper, Qwen, Gemma, …).
- logit_
verify - Logit / output verification (plan #61).
- lora_
scheduler - LoRA-aware request scheduling (plan #33).
- memory_
estimate - Pre-load memory estimation (plan #35).
- mock_
requests - Mock request payloads for tests (plan #64).
- model_
pipeline - Three-step model compile pipeline (template → specialize → backend).
- moe_
expert_ store - Per-expert F32 weight slabs for MoE offload (TIDE-style migration source).
- nan_
check - NaN/inf check epilogue (plan #18).
- op
- Operation types — every tensor op in the RLX IR.
- op_
registry - Op registry — re-exported from
rlx-ir. - options
- Unified compile options.
- paged_
kv - Paged KV cache + continuous batching (plan #31).
- perfetto
- PLAN L3 — Perfetto / chrome-trace JSON tracing. Lives in
rlx-ir(alongside theTickcycle counter it depends on) so every backend can instrument per-thunk without crate-dep gymnastics. Re-exported here so callers see one consistentrlx_runtime::perfetto::TraceSpan. PLAN L3: Perfetto / chrome-trace JSON output for cross-backend timeline capture. - phase
- Phase-aware streaming inference (plan #16).
- precision
- Precision selection for graph execution.
- record_
replay - Record/replay middleware (plan #63).
- reflect
- Model reflection services (Slang compiler/runtime API §5).
- registry
- Backend registry — a single registration point for all backends.
- router
- Multi-protocol request router (plan #32).
- session
- Session — the main entry point for compiling and executing graphs.
- spec_
decode - Speculative decoding scheduling pattern (plan #34).
- stages
- Shared HIR → MIR → LIR compile stages for runtime backends.
- subgraph
- Sub-graph execution helper.
- telemetry
- Telemetry primitives (plan #65).
- trace
- Tracing API — build IR graphs by recording operations on traced tensors.
- validators
- Composable request validators (plan #84).
- weight_
registry - Named weights registry (plan #24).
- weights
- Weight-loading abstraction.
- worker_
pool - Worker pool with isolation primitives (plan #36).
Macros§
- ktrace
- Compile-time gated kernel trace. Expands to a no-op call without
the
kernel-tracefeature; the optimizer removes it entirely. - pipeline_
schedule - Compile-time pipeline scheduler (plan #11). See
pipeline_schedule_implin this crate’s privatepipelinemodule for the full grammar.
Structs§
- Barrier
Token - Opaque ticket returned by
AsyncCopy::issue. Pass back toAsyncCopy::waitto block until the corresponding copy is done. Tokens are scoped to one engine — don’t pass them across. - Binding
Manifest - Full I/O + parameter manifest for a compiled graph.
- Buffer
- A buffer that knows where its bytes live.
- Buffer
Handle - External, persistent buffer reference. Created once, bound at compile,
carried across many
compiled.run()invocations. - Cache
Buster - Cache-busting buffer — sized to evict L1+L2 on Apple Silicon
(M-series: 192 KB L1d / core, 16 MB L2 shared per cluster).
Borrowed from MAX’s
internal_utils/_cache_busting.mojo(#19). - Double
Buffer - Two-buffer ring.
current()is what compute reads this step;next_mut()is where the next async copy should land. Callswap()after waiting on the current copy to advance. - Graph
- A computation graph — the core IR data structure.
- HirReflection
- Introspection of an unspecialized
HirModule(loadModule analogue). - IoBinding
Entry - One named graph boundary with arena layout after buffer planning.
- Kernel
Dispatch Config - Per-compile overrides on top of
KernelDispatchPolicy. - Local
Transport - Single-machine in-process transport. All
num_ranks“ranks” share oneSymmetricHeapinstance, so put / get are just locks + memcpy. Useful for unit tests and for algorithm-correctness checking of collective ops without a real cluster. - Manifest
Diff - Compare template vs specialized manifests (dims / arena may differ).
- Model
Component - Full specialization + binding bundle (Slang shader-component analogue).
- Model
Variant - Concrete shape bucket for compile-once / specialize-at-runtime workflows.
- MoeResidency
Stats - MoeTopk
Capture - Shared capture buffer — one entry per MoE router TopK in schedule order.
- Node
- A single node in the computation graph.
- NodeId
- Stable identifier for a node in the graph. Indices are never reused.
- Pipeline
Inspect - Text dump of each compiler pipeline stage.
- Rank
- Identifier for a participant in a collective. Ranks are
0..num_ranksand stay stable for the lifetime of a transport. - RlxEnv
- Bulk builder for code-side
RLX_*overrides. - Runtime
Overrides - RAII guard: installs overrides on construction, restores previous values on drop.
- Shape
- Tensor shape: ordered list of dimensions + element type.
- Symmetric
Buffer (rank, offset, len)view into a symmetric heap. The same(offset, len)pair is valid on every rank — that’s what “symmetric” means.- Symmetric
Heap - Per-rank symmetric memory: a
Vec<u8>per rank, all the same size. Owned by theLocalTransport. - Sync
Copy - CPU “async” copy — actually synchronous.
issue()does amemcpyimmediately and returns a fresh token;wait()is a no-op. Useful as the test fixture and for code paths that don’t actually need overlap. - Sync
Stream - Default implementation for synchronous backends — work has already
happened by the time
submitis called. - Tick
- Opaque tick reading. Subtract two of these to get a
Duration. - Weight
Block - Nested parameter block (Slang PerFrame / material grouping).
Enums§
- Collective
Error - Compilation
Mode - When the backend executable is produced relative to the host loop.
- DType
- Scalar element type. Matches hardware-supported types.
- Device
- Target device for graph execution.
- Kernel
Dispatch Policy - When to use native backend kernels vs the shared IR common body.
- Model
Phase - Coarse execution phase (prefill vs decode vs encoder).
- Op
- An operation in the RLX IR graph.
- OpKind
- High-level op categorization for precision policies.
- Precision
Policy - Declarative precision policy for graph compilation.
- Reduce
Kind - Element-wise reduction operator for collectives.
Traits§
- Async
Copy - Pluggable async-copy engine. Backends (
SyncCopyfor CPU, futureMetalBlitCopyfor GPU) implement this. - Command
Stream - Per-backend command stream.
- Device
Arena - Per-backend arena interface.
- Symmetric
Transport - One-sided operation surface.
put(buf, src)writessrcintobuf.rank’s memory atbuf.offset;get(buf, dst)reads frombuf.rank’s memory intodst. Both calls block until completion (a future async impl can return a future).
Functions§
- all_
gather - AllGather: every rank ends up with the concatenation of all
per-rank
localslices, in rank order. - all_
reduce - AllReduce: every rank ends up with
op({values from every rank}). - apply_
hir_ extensions - Apply all registered extensions in order.
- inspect_
graph - Annotated graph dump (MIR body). Alias for
pretty_print. - inspect_
hir - Annotated HIR module dump.
- inspect_
hir_ stats - One-line HIR summary (header + op histogram).
- inspect_
lir - Annotated LIR dump: optimized MIR + buffer plan + schedule.
- inspect_
mir - Annotated MIR module dump (optimized tensor DAG).
- inspect_
mir_ stats - One-line MIR summary.
- inspect_
pipeline - Inspect every lowering stage for
hirthroughpipeline. - reduce_
scatter - ReduceScatter: equivalent to AllReduce followed by partition
— every rank ends up with one
chunk_size-element slice of the reduced result. Rankrgets element indices[r*chunk_size, (r+1)*chunk_size). - register_
hir_ extension - Register a named extension (call from
initor model crate startup). - registered_
hir_ extensions - Registered extension names in registration order.
- time_ns
- Time
f, returning(result, elapsed_ns). Inlined so the surrounding loop can keep the closure body in registers.
Type Aliases§
- HirExtension
Fn - Transform applied after model flow build, before MIR lower.
Attribute Macros§
- rlx_
model - AOT compilation macro for RLX models.