Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
cartan-gpu
Portable GPU compute primitives for the cartan ecosystem.
Part of the cartan workspace, but its own detached Cargo workspace because the build depends on C/C++ toolchains (VkFFT, glslang) and on optional CUDA / NVIDIA driver components.
What this crate does
cartan-gpu exposes a small, opinionated GPU surface to the rest of the
cartan stack:
- Vulkan path (always on):
wgpu29 device + queue + typedGpuBuffer<T>storage, with a kernel-loading scaffold. - VkFFT path (
vkfft, default on): 1D / 2D / 3D forward and inverse FFTs throughVkFftBackend, using a vendored VkFFT 1.3.4 submodule for the actual GPU code. - CUDA path (
cuda):CudaDeviceovercudarc0.19 (driver API only — no FFT yet at this level). - cuFFT path (
cufft): mirror of the VkFFT path on the CUDA side —CuFftBackendovercudarc::cufft, with cuBLAS-based on-device normalisation. - Unified FFT trait: both
VkFftBackendandCuFftBackendimplement the sameFfttrait with an associatedBuffertype, so call-site code is identical across backends. - Runtime backend selection:
UniFftBackend+UniBufferenum dispatch let you defer the Vulkan-vs-CUDA choice to session start. - Zero-copy interop (
vkfft + cufft, Linux):SharedFftBufferexportsVkDeviceMemoryviaVK_KHR_external_memory_fdand imports it into CUDA viacuImportExternalMemory, so VkFFT and cuFFT can operate on the same physical bytes without any host roundtrip or staging buffer.
Forward-then-inverse is identity on every path: VkFFT uses
cfg.normalize = 1, cuFFT post-scales by 1/N with cublasSscal_v2.
Quick start (Vulkan FFT)
use ;
use Complex32;
let dev = new.unwrap;
let mut fft = new.unwrap;
let n = 1024_usize;
let host: = .map.collect;
let mut buf = from_slice.unwrap;
fft.fft_1d.unwrap;
fft.fft_1d.unwrap;
let back = buf.to_vec.unwrap;
// `back` equals `host` to within numerical precision.
The cuFFT path has the same shape:
use ;
let cuda = new.unwrap;
let mut fft = new.unwrap;
let mut buf = from_slice.unwrap;
fft.fft_1d.unwrap;
fft.fft_1d.unwrap;
let back = buf.to_vec.unwrap;
Runtime backend selection
UniFftBackend lets a single call site work against either backend:
use ;
let engine = if var.is_ok else ;
let mut buf = from_slice.unwrap;
let mut engine = engine; // mut for the trait method
engine.fft_1d.unwrap;
Zero-copy Vulkan ↔ CUDA
On Linux with vkfft + cufft enabled, a single GPU allocation is
addressable from both APIs. The Vulkan VkFFT path writes, the CUDA cuFFT
path reads, no host roundtrip:
use ;
let vk = new.unwrap;
let cuda = new.unwrap;
let mut vk_fft = new.unwrap;
let mut cu_fft = new.unwrap;
let n = 1024_usize;
let mut buf = new.unwrap;
buf.upload.unwrap;
vk_fft.fft_1d_shared.unwrap;
cu_fft.fft_1d_shared.unwrap;
let back = buf.download.unwrap;
// `back` ≈ `host`; the FFT data never touched the CPU between Vk and CUDA.
Verified on NVIDIA RTX 5060 Laptop (Blackwell SM 12.0, CUDA 13.1,
driver 580.x): Vk Forward → CUDA Inverse round-trip L-inf = 9e-7.
Cross-API memory sharing is gated by a same-GPU UUID match (Vulkan
VkPhysicalDeviceIDProperties.deviceUUID vs CUDA cuDeviceGetUuid) so
multi-GPU hosts can't silently fall into a broken non-shared mapping.
Features
| Feature | Default | Pulls in | Notes |
|---|---|---|---|
vkfft |
yes | cartan-gpu-sys, ash 0.38 |
Vulkan FFT via vendored VkFFT 1.3.4 |
cuda |
no | cudarc 0.19 (driver) |
CudaDevice only |
cufft |
no | cuda + cudarc/cufft + cudarc/cublas |
CuFftBackend |
The cuda-NNNNN ABI feature of cudarc is pinned to cuda-13010 (CUDA
13.1) at the cartan-gpu level; consumers on a different CUDA installation
can patch the dep or switch to cudarc/cuda-version-from-build-system.
Tests + benchmark
cargo test --features "vkfft cuda cufft" --tests
cargo bench --features "vkfft cuda cufft" --bench fft_compare
On the development machine (RTX 5060), the bench at n=1024 shows rustfft ≈ 711 ns CPU, VkFFT ≈ 160 µs, cuFFT ≈ 24.5 µs. GPU backends are launch-overhead-dominated at small N and dominate sharply at larger N.
Known limits and follow-ups
- The shared-memory path is Linux-only (uses
OPAQUE_FD); Windows would need a parallelOPAQUE_WIN32implementation. - Cross-backend handoff still uses CPU-side
queue_wait_idle/stream.synchronize. External-semaphore sync (importing aVkSemaphoreinto CUDA as acuExternalSemaphore) is the perf-optimisation next step; the FFT compute cost dominates for any reasonable problem size, so the CPU wait is correctness-good even if not optimal. - 2D batched FFTs go through
plan_manyon the cuFFT side; the VkFFT path supports batched 1D out of the box and would need additional shim wiring for batched higher dims.