scry-gpu
Compute-only GPU dispatch for Rust. No render passes, no swapchains, no framebuffers — upload data, run a WGSL shader, read results back.
use Device;
let gpu = auto?;
let input = gpu.upload?;
let output = gpu.?;
let shader = "
@group(0) @binding(0) var<storage, read> input: array<f32>;
@group(0) @binding(1) var<storage, read_write> output: array<f32>;
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
let i = gid.x;
if i < arrayLength(&input) {
output[i] = input[i] * 2.0;
}
}";
gpu.dispatch?;
let result: = output.download?;
assert_eq!;
Status: 0.x, pre-1.0 — the API may break between minor versions. The Vulkan backend is the most exercised path. CUDA is supported via cudarc + NVRTC for sites that prefer it; Metal is on the roadmap.
Why this exists
wgpu is a great GPU library, but its dependency tree is built around graphics: window surfaces, swapchains, render pipelines. For pure compute workloads (machine learning, data-parallel math, simulation) most of that is unused weight on compile time and binary size. scry-gpu skips the graphics surface and goes directly to ash (Vulkan) and naga (WGSL → SPIR-V), with optional cudarc for CUDA.
The result is a smaller dependency footprint, faster builds, and an API shaped around dispatching kernels rather than drawing pixels.
Design
- Compute only. No graphics state. The public type surface is
Device,Buffer<T>,Kernel,Batch— anddispatch. - Auto-dispatch. Workgroup counts are computed from the invocation count and the shader's
@workgroup_size. No manualceil(n / 64)at every call site. - Typed buffers.
Buffer<f32>::upload(&[f32])andBuffer<f32>::download() -> Vec<f32>. Staging, alignment, and host-device sync are internal. - Pipeline cache. SPIR-V module hashes are cached at
~/.cache/scry-gpu/<vendor>-<device>.bin, so kernel compilation pays once across runs. - Thread-safe.
DeviceisSend + Sync; submission is internally serialized.
Backends
| Backend | Feature | Status |
|---|---|---|
| Vulkan | vulkan (default) |
Primary. Tested on AMD, NVIDIA, Intel iGPU. |
| CUDA | cuda |
Optional. cudarc + NVRTC + cuBLAS for matmul. |
| Metal | — | Planned. |
[]
= "0.1" # vulkan only (default)
= { = "0.1", = ["cuda"] } # vulkan + cuda
Device::auto() picks the fastest available backend. Use Device::with_backend(BackendKind::Vulkan) to pin one explicitly.
Built-in shaders
scry_gpu::shaders exposes WGSL strings for the kernels that ML workloads hit most often:
- Tiled matrix multiply — 16×16 shared-memory tiles, plus a coarse 64×64 variant for large matrices
- Pairwise squared Euclidean distance —
n_q × n_tdistance matrix from row-major query and reference matrices
When the cuda feature is on, the same kernels are exposed as CUDA C strings for NVRTC compilation. For matmul on CUDA, prefer Device::cublas_matmul over custom kernels — cuBLAS reaches >80 % of peak immediately.
Multi-dispatch batches
For workloads with many small kernel launches (e.g. neural network forward/backward passes), Device::batch() records dispatches into a single command buffer with one fence wait at the end:
let mut batch = gpu.batch?;
batch.run?;
batch.run?;
batch.submit?; // single fence
This avoids the per-dispatch submission overhead that dominates small-kernel timelines.
Requirements
- Vulkan 1.2+ drivers (default backend). On Linux, Mesa 22+ for AMD/Intel; NVIDIA proprietary or open-kernel drivers.
- CUDA 12.0+ if using the
cudafeature. cudarc detects the toolkit at build time. - Rust 1.85+ (edition 2021, workspace MSRV).
Tests fail gracefully (GpuError::NoDevice) on systems without a usable GPU rather than hanging or panicking.
Benchmarks
# Backend comparisons
The bench_compute example covers SAXPY, reduction, and matmul under steady-state dispatch with cached kernels — the regime that matters for sustained ML training, not first-call latency.
License
Dual-licensed under MIT or Apache-2.0, at your option.