Skip to main content

Crate rlx_runtime

Crate rlx_runtime 

Source
Expand description

RLX Runtime — the user-facing API.

Provides a unified Session that compiles and executes IR graphs on the selected backend. Backend selection is via Cargo features:

[dependencies]
rlx-runtime = { version = "0.1", features = ["cpu"] }                # CPU (default)
rlx-runtime = { version = "0.1", features = ["blas-accelerate"] }    # CPU + Apple Accelerate
rlx-runtime = { version = "0.1", features = ["blas-mkl"] }           # CPU + Intel MKL
# rlx-runtime = { version = "0.1", features = ["gpu"] }             # GPU via wgpu
# rlx-runtime = { version = "0.1", features = ["cuda"] }            # GPU via CUDA

§Example

use rlx_runtime::*;
use rlx_ir::*;

// Build a graph
let mut g = Graph::new("example");
let x = g.input("x", Shape::new(&[2, 4], DType::F32));
let w = g.param("w", Shape::new(&[4, 3], DType::F32));
let b = g.param("b", Shape::new(&[3], DType::F32));
let mm = g.matmul(x, w, Shape::new(&[2, 3], DType::F32));
let out = g.binary(op::BinaryOp::Add, mm, b, Shape::new(&[2, 3], DType::F32));
g.set_outputs(vec![out]);

// Compile and execute
let session = Session::new(Device::Cpu);
let mut compiled = session.compile(g);
compiled.set_param("w", &[1.0f32; 12]);
compiled.set_param("b", &[0.0f32; 3]);
let result = compiled.run(&[("x", &[1.0f32; 8])]);

Re-exports§

pub use aot_cache::AotCache;
pub use aot_cache::AotCacheError;
pub use backend::Backend;
pub use backend::ExecutableGraph;
pub use backend::compile_hir;
pub use backend::compile_module;
pub use compile_cache::BucketedCompileCache;
pub use compile_cache::CacheRunInput;
pub use compile_cache::CompileCache;
pub use compile_cache::DynamicDimCompileCache;
pub use compile_cache::pad_rows;
pub use compile_cache::slice_rows;
pub use compiled::CompiledGraph;
pub use device_ext::available_devices;
pub use device_ext::dispatch_report_for_device;
pub use device_ext::dispatch_report_for_device_with_options;
pub use device_ext::first_unsupported_op;
pub use device_ext::first_unsupported_op_with_options;
pub use device_ext::full_name;
pub use device_ext::is_available;
pub use device_ext::legalize_graph_for_device;
pub use device_ext::legalize_graph_for_device_with_options;
pub use device_ext::legalize_graph_for_device_with_report;
pub use device_ext::supports;
pub use device_ext::supports_graph;
pub use device_ext::supports_graph_with_options;
pub use expert_pool::ExpertPool;
pub use expert_pool::ExpertPoolConfig;
pub use expert_pool::ExpertPoolStats;
pub use expert_pool::ExpertRefreshPolicy;
pub use expert_pool::ExpertRefreshResult;
pub use expert_pool::MoEExecMode;
pub use expert_pool::gpu_expert_budget_from_vram;
pub use kv_cache::LayerKvCache;
pub use memory_estimate::MoeOffloadEstimate;
pub use memory_estimate::estimate_moe_offload;
pub use model_pipeline::ModelCompilePipeline;
pub use options::CompileOptions;
pub use precision::Precision;
pub use reflect::ModelReflection;
pub use reflect::load_hir_template_with_extensions;
pub use reflect::specialize_entry;
pub use registry::BackendFactory;
pub use registry::backend_for;
pub use registry::register_backend;
pub use registry::registered_devices;
pub use session::Session;
pub use stages::compile_graph_stages;
pub use stages::compile_graph_stages_for_backend;
pub use stages::compile_hir_stages;
pub use stages::compile_module_stages;
pub use stages::fusion_target_for;
pub use stages::graph_from_lir;
pub use stages::maybe_log_fusion;
pub use stages::options_with_supported_ops;
pub use stages::pipeline_for;
pub use subgraph::SubgraphCache;
pub use subgraph::run_if;
pub use subgraph::run_while;
pub use expert_pool::merged_resident_mask;
pub use expert_pool::per_layer_resident_masks;
pub use moe_expert_store::ExpertStackF32;
pub use moe_expert_store::LayerMoeWeights;
pub use moe_expert_store::MoeExpertStore;
pub use weight_registry::WeightEntry;
pub use weight_registry::WeightHandle;
pub use weight_registry::WeightKind;
pub use weight_registry::WeightRegistry;
pub use weights::BytesWeightLoader;
pub use weights::WeightLoader;

Modules§

aot_cache
AOT cache — persist optimized LIR modules and reload for backend compile.
attn_mask
Attention-mask helpers for bucketed decode (pad-to-upper, slice-back).
backend
Backend trait — abstraction over CPU/GPU/CUDA execution.
compile_cache
Shape-bucketed compile cache.
compiled
Compiled graph — the hot-path execution object.
cost
Cross-backend cost interface.
custom_ops
Custom-op extensibility (plan #25).
device_ext
Engine-layer extensions for rlx_driver::Device (plan #58).
env
Unified RLX_* configuration — readable from code overrides or process env.
expert_pool
MoE expert residency pool (TIDE-style predictive offload).
hwinfo
Hardware introspection (plan #47).
jacfwd
Forward-mode Jacobian materialization.
kernel_trace
Compile-time gated kernel tracing (plan #7).
kv_cache
Per-layer K/V cache for autoregressive decode (Whisper, Qwen, Gemma, …).
logit_verify
Logit / output verification (plan #61).
lora_scheduler
LoRA-aware request scheduling (plan #33).
memory_estimate
Pre-load memory estimation (plan #35).
mock_requests
Mock request payloads for tests (plan #64).
model_pipeline
Three-step model compile pipeline (template → specialize → backend).
moe_expert_store
Per-expert F32 weight slabs for MoE offload (TIDE-style migration source).
nan_check
NaN/inf check epilogue (plan #18).
op
Operation types — every tensor op in the RLX IR.
op_registry
Op registry — re-exported from rlx-ir.
options
Unified compile options.
paged_kv
Paged KV cache + continuous batching (plan #31).
perfetto
PLAN L3 — Perfetto / chrome-trace JSON tracing. Lives in rlx-ir (alongside the Tick cycle counter it depends on) so every backend can instrument per-thunk without crate-dep gymnastics. Re-exported here so callers see one consistent rlx_runtime::perfetto::TraceSpan. PLAN L3: Perfetto / chrome-trace JSON output for cross-backend timeline capture.
phase
Phase-aware streaming inference (plan #16).
precision
Precision selection for graph execution.
record_replay
Record/replay middleware (plan #63).
reflect
Model reflection services (Slang compiler/runtime API §5).
registry
Backend registry — a single registration point for all backends.
router
Multi-protocol request router (plan #32).
session
Session — the main entry point for compiling and executing graphs.
spec_decode
Speculative decoding scheduling pattern (plan #34).
stages
Shared HIR → MIR → LIR compile stages for runtime backends.
subgraph
Sub-graph execution helper.
telemetry
Telemetry primitives (plan #65).
trace
Tracing API — build IR graphs by recording operations on traced tensors.
validators
Composable request validators (plan #84).
weight_registry
Named weights registry (plan #24).
weights
Weight-loading abstraction.
worker_pool
Worker pool with isolation primitives (plan #36).

Macros§

ktrace
Compile-time gated kernel trace. Expands to a no-op call without the kernel-trace feature; the optimizer removes it entirely.
pipeline_schedule
Compile-time pipeline scheduler (plan #11). See pipeline_schedule_impl in this crate’s private pipeline module for the full grammar.

Structs§

BarrierToken
Opaque ticket returned by AsyncCopy::issue. Pass back to AsyncCopy::wait to block until the corresponding copy is done. Tokens are scoped to one engine — don’t pass them across.
BindingManifest
Full I/O + parameter manifest for a compiled graph.
Buffer
A buffer that knows where its bytes live.
BufferHandle
External, persistent buffer reference. Created once, bound at compile, carried across many compiled.run() invocations.
CacheBuster
Cache-busting buffer — sized to evict L1+L2 on Apple Silicon (M-series: 192 KB L1d / core, 16 MB L2 shared per cluster). Borrowed from MAX’s internal_utils/_cache_busting.mojo (#19).
DoubleBuffer
Two-buffer ring. current() is what compute reads this step; next_mut() is where the next async copy should land. Call swap() after waiting on the current copy to advance.
Graph
A computation graph — the core IR data structure.
HirReflection
Introspection of an unspecialized HirModule (loadModule analogue).
IoBindingEntry
One named graph boundary with arena layout after buffer planning.
KernelDispatchConfig
Per-compile overrides on top of KernelDispatchPolicy.
LocalTransport
Single-machine in-process transport. All num_ranks “ranks” share one SymmetricHeap instance, so put / get are just locks + memcpy. Useful for unit tests and for algorithm-correctness checking of collective ops without a real cluster.
ManifestDiff
Compare template vs specialized manifests (dims / arena may differ).
ModelComponent
Full specialization + binding bundle (Slang shader-component analogue).
ModelVariant
Concrete shape bucket for compile-once / specialize-at-runtime workflows.
MoeResidencyStats
MoeTopkCapture
Shared capture buffer — one entry per MoE router TopK in schedule order.
Node
A single node in the computation graph.
NodeId
Stable identifier for a node in the graph. Indices are never reused.
PipelineInspect
Text dump of each compiler pipeline stage.
Rank
Identifier for a participant in a collective. Ranks are 0..num_ranks and stay stable for the lifetime of a transport.
RlxEnv
Bulk builder for code-side RLX_* overrides.
RuntimeOverrides
RAII guard: installs overrides on construction, restores previous values on drop.
Shape
Tensor shape: ordered list of dimensions + element type.
SymmetricBuffer
(rank, offset, len) view into a symmetric heap. The same (offset, len) pair is valid on every rank — that’s what “symmetric” means.
SymmetricHeap
Per-rank symmetric memory: a Vec<u8> per rank, all the same size. Owned by the LocalTransport.
SyncCopy
CPU “async” copy — actually synchronous. issue() does a memcpy immediately and returns a fresh token; wait() is a no-op. Useful as the test fixture and for code paths that don’t actually need overlap.
SyncStream
Default implementation for synchronous backends — work has already happened by the time submit is called.
Tick
Opaque tick reading. Subtract two of these to get a Duration.
WeightBlock
Nested parameter block (Slang PerFrame / material grouping).

Enums§

CollectiveError
CompilationMode
When the backend executable is produced relative to the host loop.
DType
Scalar element type. Matches hardware-supported types.
Device
Target device for graph execution.
KernelDispatchPolicy
When to use native backend kernels vs the shared IR common body.
ModelPhase
Coarse execution phase (prefill vs decode vs encoder).
Op
An operation in the RLX IR graph.
OpKind
High-level op categorization for precision policies.
PrecisionPolicy
Declarative precision policy for graph compilation.
ReduceKind
Element-wise reduction operator for collectives.

Traits§

AsyncCopy
Pluggable async-copy engine. Backends (SyncCopy for CPU, future MetalBlitCopy for GPU) implement this.
CommandStream
Per-backend command stream.
DeviceArena
Per-backend arena interface.
SymmetricTransport
One-sided operation surface. put(buf, src) writes src into buf.rank’s memory at buf.offset; get(buf, dst) reads from buf.rank’s memory into dst. Both calls block until completion (a future async impl can return a future).

Functions§

all_gather
AllGather: every rank ends up with the concatenation of all per-rank local slices, in rank order.
all_reduce
AllReduce: every rank ends up with op({values from every rank}).
apply_hir_extensions
Apply all registered extensions in order.
inspect_graph
Annotated graph dump (MIR body). Alias for pretty_print.
inspect_hir
Annotated HIR module dump.
inspect_hir_stats
One-line HIR summary (header + op histogram).
inspect_lir
Annotated LIR dump: optimized MIR + buffer plan + schedule.
inspect_mir
Annotated MIR module dump (optimized tensor DAG).
inspect_mir_stats
One-line MIR summary.
inspect_pipeline
Inspect every lowering stage for hir through pipeline.
reduce_scatter
ReduceScatter: equivalent to AllReduce followed by partition — every rank ends up with one chunk_size-element slice of the reduced result. Rank r gets element indices [r*chunk_size, (r+1)*chunk_size).
register_hir_extension
Register a named extension (call from init or model crate startup).
registered_hir_extensions
Registered extension names in registration order.
time_ns
Time f, returning (result, elapsed_ns). Inlined so the surrounding loop can keep the closure body in registers.

Type Aliases§

HirExtensionFn
Transform applied after model flow build, before MIR lower.

Attribute Macros§

rlx_model
AOT compilation macro for RLX models.