wgsl-fft
GPU-accelerated FFT in Rust using wgpu compute shaders.
Implements the FFT — a two-buffer ping-pong formulation
where each stage reads from one buffer and writes to the other. This eliminates the separate
bit-reversal pass and removes all inter-stage memory hazards, allowing the entire transform
to run in a single GPU compute pass with one queue.submit() call.
GPU-accelerated: This library uses wgpu compute shaders for GPU acceleration. If no GPU is available, wgpu's fallback adapter provides CPU-based software rendering.
The WGSL compute kernels are embedded as raw WGSL strings in src/shaders.rs.
Usage
[]
= "0.1"
= "0.4"
Forward FFT (Batch Processing)
use GpuFft;
use Complex;
// Create FFT instance - returns Result since GPU might not be available
let fft = new?;
// Single FFT (pass vector with one element for backward compatibility)
let single_input = vec!;
let single_spectrum = fft.fft?;
assert_eq!;
assert_eq!;
// Batch FFT (process multiple signals efficiently)
let batch_inputs = vec!;
let batch_spectra = fft.fft?;
assert_eq!; // Two FFT results
assert_eq!; // Each FFT has 1024 bins
Inverse FFT (Batch Processing)
// Compute inverse FFT (automatically scaled by 1/N)
let reconstructed_batch = fft.ifft?;
assert_eq!; // Two IFFT results
// Roundtrip: FFT(IFFT(x)) ≈ x (within numerical precision)
let roundtrip_error = batch_inputs.iter.zip
.map
.fold;
println!;
Batch Processing Features
NEW: The library now supports efficient batch processing of multiple FFTs/IFFTs:
- Single Vector Processing: Pass a vector containing one element:
fft.fft(&[single_vector]) - Batch Processing: Pass multiple vectors:
fft.fft(&[vector1, vector2, vector3]) - Performance: Batch processing provides ~1.2x speedup for multiple transforms
- Memory Efficiency: Processes vectors sequentially to minimize GPU memory usage
- Backward Compatibility: Existing code works by wrapping single vectors in an array
Batch Processing Example
// Process 8 signals of 4096 samples each
let batch_size = 8;
let fft_size = 4096;
let signals: =
.map
.collect;
// Batch FFT - much faster than processing individually
let spectra = fft.fft?;
// Process results
for in spectra.iter.enumerate
Requirements
- A wgpu-capable GPU (Vulkan, Metal, DX12, or WebGPU).
- Input length must be a power of two and non-empty.
- All vectors in a batch must have the same length.
Module Structure
The crate is organized into the following modules:
| Module | Purpose |
|---|---|
fft |
Main FFT implementation (GpuFft, FftExecutor, SizeCache) |
pipelines |
Pre-compiled FFT pipelines (FftPipelines, FftDirection) |
buffer |
Ping-pong buffer infrastructure (PingPongState, PingPongBuffers) |
pipeline |
Streaming pipeline builder and compute stages |
shaders |
WGSL compute shader source code |
benchmark |
Benchmarking utilities |
rivals |
Alternative FFT implementations for comparison |
All public types are re-exported from the crate root, so you can use wgsl_fft::GpuFft directly.
Algorithm
Stockham autosort formulation with single-pass execution:
- Single compute pass: All log₂(N) butterfly stages execute in one
queue.submit()call - Ping-pong buffers: Even stages read from buffer A and write to buffer B; odd stages read from B and write to A
- Natural order output: No separate bit-reversal pass needed due to autosort property
- Memory efficient: No inter-stage synchronization required since consecutive stages access different buffers
Performance characteristics (release build, NVIDIA GPU):
Single FFT Performance
| Size (N) | Throughput | GFLOPS | Latency |
|---|---|---|---|
| 256 | 2.7 MSamples/s | 0.11 | 0.095 ms |
| 1,024 | 9.6 MSamples/s | 0.48 | 0.107 ms |
| 4,096 | 32.0 MSamples/s | 1.92 | 0.128 ms |
| 16,384 | 86.2 MSamples/s | 6.04 | 0.190 ms |
| 65,536 | 145 MSamples/s | 11.60 | 0.452 ms |
| 262,144 | 160 MSamples/s | 14.45 | 1.632 ms |
Batch Processing Performance
Batch processing provides significant throughput improvements:
- 2x speedup: Processing 8 signals in batch vs. individually
- Reduced overhead: Single GPU kernel launch for entire batch
- Better utilization: Keeps GPU busy with continuous work
Accuracy: max element-wise L₂ error vs. rustfft is below 1e-3 for N = 1024 single-precision inputs.
Batch processing maintains identical numerical accuracy to single-vector processing.
Limitations
- Single-precision (
f32) only.
Shader development
The canonical shader source is in src/shaders.rs as raw WGSL strings.
The Stockham autosort kernel implements the entire FFT in a single compute shader
that processes all log₂(N) stages sequentially.
To verify correctness and run performance benchmarks:
# Accuracy test (vs rustfft) - now uses batch API
# Performance benchmark - tests batch processing
# Batch processing example
Performance Optimization
The implementation includes several optimizations:
- Single-pass execution: All FFT stages run in one compute pass
- Optimized synchronization: Reduced CPU-GPU synchronization overhead
- Efficient memory access: Ping-pong buffers eliminate memory hazards
- Automatic fallback: Uses wgpu's software fallback adapter when no GPU is available
For best performance:
- Use
--releasebuilds - Target larger FFT sizes (≥4096) where GPU advantages are most pronounced
- Ensure your system has a modern wgpu-compatible GPU
License
MIT — see LICENSE.