Crate rlx_cuda

Expand description

RLX CUDA backend — NVIDIA GPUs via the pure-Rust cudarc crate.

Same overall shape as rlx-wgpu (device singleton, arena buffer, per-op kernels, command-stream-per-forward-pass) but targeting CUDA via cudarc::driver for memory + dispatch and cudarc::cublas for matmul. Element-wise / reduction / shape kernels are CUDA C++ source strings compiled at init time via NVRTC — same pattern as rlx-wgpu’s WGSL kernels.

The crate uses cudarc’s fallback-dynamic-loading feature so it compiles on Mac (and any other host without a CUDA SDK). is_available() returns false when libcuda can’t be dlopen()’d — every other entry point checks this and degrades cleanly.

Layout:

device — CudaContext singleton (per-process), driver init
arena — device buffer + per-node offsets
kernels — CUDA C++ source strings + NVRTC compile + cuModule cache
backend — CudaExecutable: IR lowering, schedule, run

Re-exports§

pub use backend::CompileMode;
pub use backend::CudaExecutable;
pub use backend::ExecMode;

Modules§

arena: CUDA device-memory arena.
backend: CudaExecutable — lowers an rlx-ir Graph into a sequence of CUDA kernel launches against a pre-allocated device buffer.
calibrate: On-disk CUDA calibration for cost-model ranking.
device: Per-process CUDA context singleton.
fft_dispatch
fft_host
gdn_host: Host-side Op::GatedDeltaNet for CUDA device arenas (D2H → CPU → H2D).
gguf_gpu: GPU GGUF K-quant dequant + cuBLAS matmul for Op::DequantMatMul and grouped MoE Op::DequantGroupedMatMul.
gguf_host: Host-side GGUF K-quant Op::DequantMatMul for CUDA device arenas.
host_staging
im2col_host
kernels: CUDA C++ kernel sources + NVRTC compilation cache.
llada2_gate_host: Host-side Op::Custom("llada2.group_limited_gate") for CUDA arenas.
log_mel_backward_host
log_mel_host
sam_ops_host: Legacy host-side SAM conv/norm ops (D2H → CPU → H2D). Superseded by native kernels in kernels/layer_norm2d.cu and kernels/conv_transpose2d.cu; kept for manual debugging.
splat_host: Host-side [Op::GaussianSplatRender] / backward for CUDA arenas (D2H → CPU → H2D).
training_bwd_host: Host-side training backward ops for CUDA device arenas (D2H → CPU → H2D).
umap_knn_host: Host-side Op::Custom("umap.knn") for CUDA arenas.
unfuse: IR-level “unfusion” pass for the CUDA backend.
welch_peaks_dispatch
welch_peaks_host

Functions§

is_available: True if a CUDA driver is reachable. With dynamic-loading, this returns false on hosts without libcuda (Mac, headless boxes, CI runners without GPUs) — the crate still compiled, but no kernel dispatch is possible.

Crate rlx_cuda

Crate rlx_cuda Copy item path

Re-exports§

Modules§

Functions§

Crate rlx_cuda