Skip to main content

Crate rlx_cuda

Crate rlx_cuda 

Source
Expand description

RLX CUDA backend — NVIDIA GPUs via the pure-Rust cudarc crate.

Same overall shape as rlx-wgpu (device singleton, arena buffer, per-op kernels, command-stream-per-forward-pass) but targeting CUDA via cudarc::driver for memory + dispatch and cudarc::cublas for matmul. Element-wise / reduction / shape kernels are CUDA C++ source strings compiled at init time via NVRTC — same pattern as rlx-wgpu’s WGSL kernels.

The crate uses cudarc’s fallback-dynamic-loading feature so it compiles on Mac (and any other host without a CUDA SDK). is_available() returns false when libcuda can’t be dlopen()’d — every other entry point checks this and degrades cleanly.

Layout:

  • deviceCudaContext singleton (per-process), driver init
  • arena — device buffer + per-node offsets
  • kernels — CUDA C++ source strings + NVRTC compile + cuModule cache
  • backendCudaExecutable: IR lowering, schedule, run

Re-exports§

pub use backend::CompileMode;
pub use backend::CudaExecutable;
pub use backend::ExecMode;

Modules§

arena
CUDA device-memory arena.
backend
CudaExecutable — lowers an rlx-ir Graph into a sequence of CUDA kernel launches against a pre-allocated device buffer.
calibrate
On-disk CUDA calibration for cost-model ranking.
device
Per-process CUDA context singleton.
fft_dispatch
fft_host
gdn_host
Host-side Op::GatedDeltaNet for CUDA device arenas (D2H → CPU → H2D).
gguf_gpu
GPU GGUF K-quant dequant + cuBLAS matmul for Op::DequantMatMul and grouped MoE Op::DequantGroupedMatMul.
gguf_host
Host-side GGUF K-quant Op::DequantMatMul for CUDA device arenas.
host_staging
im2col_host
kernels
CUDA C++ kernel sources + NVRTC compilation cache.
llada2_gate_host
Host-side Op::Custom("llada2.group_limited_gate") for CUDA arenas.
log_mel_backward_host
log_mel_host
sam_ops_host
Legacy host-side SAM conv/norm ops (D2H → CPU → H2D). Superseded by native kernels in kernels/layer_norm2d.cu and kernels/conv_transpose2d.cu; kept for manual debugging.
splat_host
Host-side [Op::GaussianSplatRender] / backward for CUDA arenas (D2H → CPU → H2D).
training_bwd_host
Host-side training backward ops for CUDA device arenas (D2H → CPU → H2D).
umap_knn_host
Host-side Op::Custom("umap.knn") for CUDA arenas.
unfuse
IR-level “unfusion” pass for the CUDA backend.
welch_peaks_dispatch
welch_peaks_host

Functions§

is_available
True if a CUDA driver is reachable. With dynamic-loading, this returns false on hosts without libcuda (Mac, headless boxes, CI runners without GPUs) — the crate still compiled, but no kernel dispatch is possible.