1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
// RLX — versatile ML compiler + runtime.
// Copyright (C) 2026 Eugene Hauptmann, Nataliya Kosmyna.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, version 3.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <https://www.gnu.org/licenses/>.
//! RLX CUDA backend — NVIDIA GPUs via the pure-Rust `cudarc` crate.
//
// FFI shim helpers (cudnn_conv*, cublaslt_matmul_fused, etc.) inherently
// take many arguments — they mirror the underlying C API surface.
// Suppressing the lint at crate scope avoids drowning out signal warnings.
//!
//! Same overall shape as rlx-wgpu (device singleton, arena buffer, per-op
//! kernels, command-stream-per-forward-pass) but targeting CUDA via
//! `cudarc::driver` for memory + dispatch and `cudarc::cublas` for matmul.
//! Element-wise / reduction / shape kernels are CUDA C++ source strings
//! compiled at init time via NVRTC — same pattern as rlx-wgpu's WGSL
//! kernels.
//!
//! The crate uses `cudarc`'s `fallback-dynamic-loading` feature so it
//! compiles on Mac (and any other host without a CUDA SDK). `is_available()`
//! returns false when libcuda can't be dlopen()'d — every other entry
//! point checks this and degrades cleanly.
//!
//! Layout:
//! - `device` — `CudaContext` singleton (per-process), driver init
//! - `arena` — device buffer + per-node offsets
//! - `kernels` — CUDA C++ source strings + NVRTC compile + cuModule cache
//! - `backend` — `CudaExecutable`: IR lowering, schedule, run
pub use ;
/// HIP-CPU validation path — runs `.cu` kernels on CPU threads so we
/// can numerically validate them on Mac/Docker without renting a CUDA
/// box. Strictly a dev feature; never enabled in production.
/// True if a CUDA driver is reachable. With `dynamic-loading`, this
/// returns false on hosts without `libcuda` (Mac, headless boxes, CI
/// runners without GPUs) — the crate still compiled, but no kernel
/// dispatch is possible.