rlx-wgpu
Cross-platform GPU backend via the wgpu crate. Single backend serves Metal (macOS), Vulkan (Linux), DirectX 12 (Windows), and WebGPU (browsers). WGSL kernels, pure Rust deps — no FFI, no submodules.
What's here
- WGSL kernels — fp32 matmul (8×8 tile), cooperative-matrix
matmul (32×32 tile,
simdgroup_matrix/KHR_cooperative_matrix), f16-storage matmul. device.rs— wgpu instance/adapter/device singleton. Sync wrapper viapollster::block_onso the rest of the backend matches the rlx-cpu / rlx-metal / rlx-mlx synchronous shape.buffer.rs/Arena— single contiguous storage buffer; per- node offsets fromrlx-opt::memory::plan_memory_aligned. f32 host I/O viaqueue.write_buffer/ staging-buffer-mapped readback.kernels/matmul.wgsl— fp32 matmul, one workgroup per 8×8 output tile. Functional, not optimized.kernels/mod.rs—OnceLock-cached pipeline + bind-group layout. First dispatch pays the WGSL → SPIR-V/MSL/HLSL translation cost (~ms); subsequent dispatches reuse the compiled pipeline.backend.rs—WgpuExecutable. Anything not in the supported op set panics at compile time with a clear "fall back to CPU/Metal/MLX" diagnostic.
Op coverage
Today: MatMul (2D), Op::Input, Op::Param, Op::Constant.
Anything else fails at compile time with a clear "fall back to
CPU/Metal/MLX" diagnostic.
The roadmap is to land ops in BERT-shaped order: element-wise binary,
layer norm, softmax, attention, gather, transpose. Adding an op means:
WGSL source, a MatmulPipeline-style cache entry, a Step variant, a
dispatch in run. PRs welcome.
Install
[]
= "0.1"
Or via rlx's gpu feature.
Build / test
Through rlx-runtime:
Status
Functional, less battle-tested than rlx-metal / rlx-mlx on Apple
Silicon. Coop-matrix paths under active validation. The matmul kernel
is correctness-first — order of magnitude slower than what's possible.
Gotchas
- Wgpu is async; we wrap with
pollster::block_onfor sync semantics. Future work: an asynccommit_no_wait-style API to amortize submit latency, mirroring rlx-metal. - The matmul kernel is correctness-first. It loops over K per thread with no register blocking or shared-memory tiling — order of magnitude slower than what's possible. Optimization comes after the op set is broad enough to run a real model.
- Shader compilation is lazy + cached via
OnceLock. First dispatch pays the WGSL → SPIR-V/MSL/HLSL translation cost (~ms); subsequent dispatches reuse the compiled pipeline.
License
GPL-3.0-only.