wgsl-fft 0.3.1

GPU-accelerated FFT using Webgpu compute shaders
Documentation

wgsl-fft

GPU-accelerated FFT in Rust using wgpu compute shaders.

Implements the FFT — a two-buffer ping-pong formulation where each stage reads from one buffer and writes to the other. This eliminates the separate bit-reversal pass and removes all inter-stage memory hazards, allowing the entire transform to run in a single GPU compute pass with one queue.submit() call.

GPU-accelerated: This library uses wgpu compute shaders for GPU acceleration. If no GPU is available, wgpu's fallback adapter provides CPU-based software rendering.

The WGSL compute kernels are embedded as raw WGSL strings in src/shaders.rs.

Usage

[dependencies]
wgsl-fft = "0.1"
num-complex  = "0.4"

Forward FFT (Batch Processing)

use wgsl_fft::GpuFft;
use num_complex::Complex;

// Create FFT instance - returns Result since GPU might not be available
let fft = GpuFft::new()?;

// Single FFT (pass vector with one element for backward compatibility)
let single_input = vec![vec![Complex { re: (i as f32 * 0.1).sin(), im: 0.0 }; 1024]];
let single_spectrum = fft.fft(&single_input)?;
assert_eq!(single_spectrum.len(), 1);
assert_eq!(single_spectrum[0].len(), 1024);

// Batch FFT (process multiple signals efficiently)
let batch_inputs = vec![
    vec![Complex { re: (i as f32 * 0.1).sin(), im: 0.0 }; 1024],
    vec![Complex { re: (i as f32 * 0.2).sin(), im: 0.0 }; 1024],
];
let batch_spectra = fft.fft(&batch_inputs)?;
assert_eq!(batch_spectra.len(), 2); // Two FFT results
assert_eq!(batch_spectra[0].len(), 1024); // Each FFT has 1024 bins

Inverse FFT (Batch Processing)

// Compute inverse FFT (automatically scaled by 1/N)
let reconstructed_batch = fft.ifft(&batch_spectra)?;
assert_eq!(reconstructed_batch.len(), 2); // Two IFFT results

// Roundtrip: FFT(IFFT(x)) ≈ x (within numerical precision)
let roundtrip_error = batch_inputs[0].iter().zip(reconstructed_batch[0].iter())
    .map(|(a, b)| ((a.re - b.re).powi(2) + (a.im - b.im).powi(2)).sqrt())
    .fold(0.0, f32::max);
println!("Max roundtrip error: {roundtrip_error:.2e}");

Batch Processing Features

NEW: The library now supports efficient batch processing of multiple FFTs/IFFTs:

  • Single Vector Processing: Pass a vector containing one element: fft.fft(&[single_vector])
  • Batch Processing: Pass multiple vectors: fft.fft(&[vector1, vector2, vector3])
  • Performance: Batch processing provides ~1.2x speedup for multiple transforms
  • Memory Efficiency: Processes vectors sequentially to minimize GPU memory usage
  • Backward Compatibility: Existing code works by wrapping single vectors in an array

Batch Processing Example

// Process 8 signals of 4096 samples each
let batch_size = 8;
let fft_size = 4096;
let signals: Vec<_> = (0..batch_size)
    .map(|i| vec![Complex::new(i as f32 * 0.1, 0.0); fft_size])
    .collect();

// Batch FFT - much faster than processing individually
let spectra = fft.fft(&signals)?;

// Process results
for (i, spectrum) in spectra.iter().enumerate() {
    println!("Signal {} FFT completed: {} bins", i, spectrum.len());
}

Requirements

  • A wgpu-capable GPU (Vulkan, Metal, DX12, or WebGPU).
  • Input length must be a power of two and non-empty.
  • All vectors in a batch must have the same length.

Module Structure

The crate is organized into the following modules:

Module Purpose
fft Main FFT implementation (GpuFft, FftExecutor, SizeCache)
pipelines Pre-compiled FFT pipelines (FftPipelines, FftDirection)
buffer Ping-pong buffer infrastructure (PingPongState, PingPongBuffers)
pipeline Streaming pipeline builder and compute stages
shaders WGSL compute shader source code
benchmark Benchmarking utilities
rivals Alternative FFT implementations for comparison

All public types are re-exported from the crate root, so you can use wgsl_fft::GpuFft directly.

Algorithm

Stockham autosort formulation with single-pass execution:

  • Single compute pass: All log₂(N) butterfly stages execute in one queue.submit() call
  • Ping-pong buffers: Even stages read from buffer A and write to buffer B; odd stages read from B and write to A
  • Natural order output: No separate bit-reversal pass needed due to autosort property
  • Memory efficient: No inter-stage synchronization required since consecutive stages access different buffers

Performance characteristics (release build, NVIDIA GPU):

Single FFT Performance

Size (N) Throughput GFLOPS Latency
256 2.7 MSamples/s 0.11 0.095 ms
1,024 9.6 MSamples/s 0.48 0.107 ms
4,096 32.0 MSamples/s 1.92 0.128 ms
16,384 86.2 MSamples/s 6.04 0.190 ms
65,536 145 MSamples/s 11.60 0.452 ms
262,144 160 MSamples/s 14.45 1.632 ms

Batch Processing Performance

Batch processing provides significant throughput improvements:

  • 2x speedup: Processing 8 signals in batch vs. individually
  • Reduced overhead: Single GPU kernel launch for entire batch
  • Better utilization: Keeps GPU busy with continuous work

Accuracy: max element-wise L₂ error vs. rustfft is below 1e-3 for N = 1024 single-precision inputs. Batch processing maintains identical numerical accuracy to single-vector processing.

Limitations

  • Single-precision (f32) only.

Shader development

The canonical shader source is in src/shaders.rs as raw WGSL strings. The Stockham autosort kernel implements the entire FFT in a single compute shader that processes all log₂(N) stages sequentially.

To verify correctness and run performance benchmarks:

# Accuracy test (vs rustfft) - now uses batch API
cargo test test_gpu_fft_matches_rustfft --release

# Performance benchmark - tests batch processing
cargo test fft_throughput --release -- --nocapture

# Batch processing example
cargo run --example test_batch_processing --release

Performance Optimization

The implementation includes several optimizations:

  • Single-pass execution: All FFT stages run in one compute pass
  • Optimized synchronization: Reduced CPU-GPU synchronization overhead
  • Efficient memory access: Ping-pong buffers eliminate memory hazards
  • Automatic fallback: Uses wgpu's software fallback adapter when no GPU is available

For best performance:

  • Use --release builds
  • Target larger FFT sizes (≥4096) where GPU advantages are most pronounced
  • Ensure your system has a modern wgpu-compatible GPU

License

MIT — see LICENSE.