Expand description
RLX CUDA backend — NVIDIA GPUs via the pure-Rust cudarc crate.
Same overall shape as rlx-wgpu (device singleton, arena buffer, per-op
kernels, command-stream-per-forward-pass) but targeting CUDA via
cudarc::driver for memory + dispatch and cudarc::cublas for matmul.
Element-wise / reduction / shape kernels are CUDA C++ source strings
compiled at init time via NVRTC — same pattern as rlx-wgpu’s WGSL
kernels.
The crate uses cudarc’s fallback-dynamic-loading feature so it
compiles on Mac (and any other host without a CUDA SDK). is_available()
returns false when libcuda can’t be dlopen()’d — every other entry
point checks this and degrades cleanly.
Layout:
device—CudaContextsingleton (per-process), driver initarena— device buffer + per-node offsetskernels— CUDA C++ source strings + NVRTC compile + cuModule cachebackend—CudaExecutable: IR lowering, schedule, run
Re-exports§
pub use backend::CompileMode;pub use backend::CudaExecutable;pub use backend::ExecMode;
Modules§
- arena
- CUDA device-memory arena.
- backend
CudaExecutable— lowers an rlx-ir Graph into a sequence of CUDA kernel launches against a pre-allocated device buffer.- calibrate
- On-disk CUDA calibration for cost-model ranking.
- device
- Per-process CUDA context singleton.
- fft_
dispatch - fft_
host - gdn_
host - Host-side
Op::GatedDeltaNetfor CUDA device arenas (D2H → CPU → H2D). - gguf_
gpu - GPU GGUF K-quant dequant + cuBLAS matmul for
Op::DequantMatMuland grouped MoEOp::DequantGroupedMatMul. - gguf_
host - Host-side GGUF K-quant
Op::DequantMatMulfor CUDA device arenas. - host_
staging - im2col_
host - kernels
- CUDA C++ kernel sources + NVRTC compilation cache.
- llada2_
gate_ host - Host-side
Op::Custom("llada2.group_limited_gate")for CUDA arenas. - log_
mel_ backward_ host - log_
mel_ host - sam_
ops_ host - Legacy host-side SAM conv/norm ops (D2H → CPU → H2D).
Superseded by native kernels in
kernels/layer_norm2d.cuandkernels/conv_transpose2d.cu; kept for manual debugging. - splat_
host - Host-side [
Op::GaussianSplatRender] / backward for CUDA arenas (D2H → CPU → H2D). - training_
bwd_ host - Host-side training backward ops for CUDA device arenas (D2H → CPU → H2D).
- umap_
knn_ host - Host-side
Op::Custom("umap.knn")for CUDA arenas. - unfuse
- IR-level “unfusion” pass for the CUDA backend.
- welch_
peaks_ dispatch - welch_
peaks_ host
Functions§
- is_
available - True if a CUDA driver is reachable. With
dynamic-loading, this returns false on hosts withoutlibcuda(Mac, headless boxes, CI runners without GPUs) — the crate still compiled, but no kernel dispatch is possible.