wgsl-fft 0.1.0

GPU-accelerated FFT using Webgpu compute shaders
Documentation

wgsl-fft

GPU-accelerated FFT in Rust using wgpu compute shaders.

Implements the FFT — a two-buffer ping-pong formulation where each stage reads from one buffer and writes to the other. This eliminates the separate bit-reversal pass and removes all inter-stage memory hazards, allowing the entire transform to run in a single GPU compute pass with one queue.submit() call.

GPU-accelerated: This library uses wgpu compute shaders for GPU acceleration. If no GPU is available, wgpu's fallback adapter provides CPU-based software rendering.

The WGSL compute kernels are embedded as raw WGSL strings in src/shaders.rs.

Usage

[dependencies]
wgsl-fft = "0.1"
num-complex  = "0.4"

Forward FFT (Batch Processing)

use wgsl_fft::GpuFft;
use num_complex::Complex;

// Create FFT instance - returns Result since GPU might not be available
let fft = GpuFft::new()?;

// Single FFT (pass vector with one element for backward compatibility)
let single_input = vec![vec![Complex { re: (i as f32 * 0.1).sin(), im: 0.0 }; 1024]];
let single_spectrum = fft.fft(&single_input)?;
assert_eq!(single_spectrum.len(), 1);
assert_eq!(single_spectrum[0].len(), 1024);

// Batch FFT (process multiple signals efficiently)
let batch_inputs = vec![
    vec![Complex { re: (i as f32 * 0.1).sin(), im: 0.0 }; 1024],
    vec![Complex { re: (i as f32 * 0.2).sin(), im: 0.0 }; 1024],
];
let batch_spectra = fft.fft(&batch_inputs)?;
assert_eq!(batch_spectra.len(), 2); // Two FFT results
assert_eq!(batch_spectra[0].len(), 1024); // Each FFT has 1024 bins

Inverse FFT (Batch Processing)

// Compute inverse FFT (automatically scaled by 1/N)
let reconstructed_batch = fft.ifft(&batch_spectra)?;
assert_eq!(reconstructed_batch.len(), 2); // Two IFFT results

// Roundtrip: FFT(IFFT(x)) ≈ x (within numerical precision)
let roundtrip_error = batch_inputs[0].iter().zip(reconstructed_batch[0].iter())
    .map(|(a, b)| ((a.re - b.re).powi(2) + (a.im - b.im).powi(2)).sqrt())
    .fold(0.0, f32::max);
println!("Max roundtrip error: {roundtrip_error:.2e}");

Batch Processing Features

NEW: The library now supports efficient batch processing of multiple FFTs/IFFTs:

  • Single Vector Processing: Pass a vector containing one element: fft.fft(&[single_vector])
  • Batch Processing: Pass multiple vectors: fft.fft(&[vector1, vector2, vector3])
  • Performance: Batch processing provides ~1.2x speedup for multiple transforms
  • Memory Efficiency: Processes vectors sequentially to minimize GPU memory usage
  • Backward Compatibility: Existing code works by wrapping single vectors in an array

Batch Processing Example

// Process 8 signals of 4096 samples each
let batch_size = 8;
let fft_size = 4096;
let signals: Vec<_> = (0..batch_size)
    .map(|i| vec![Complex::new(i as f32 * 0.1, 0.0); fft_size])
    .collect();

// Batch FFT - much faster than processing individually
let spectra = fft.fft(&signals)?;

// Process results
for (i, spectrum) in spectra.iter().enumerate() {
    println!("Signal {} FFT completed: {} bins", i, spectrum.len());
}

Requirements

  • A wgpu-capable GPU (Vulkan, Metal, DX12, or WebGPU).
  • Input length must be a power of two and non-empty.
  • All vectors in a batch must have the same length.

Algorithm

Stockham autosort formulation with single-pass execution:

  • Single compute pass: All log₂(N) butterfly stages execute in one queue.submit() call
  • Ping-pong buffers: Even stages read from buffer A and write to buffer B; odd stages read from B and write to A
  • Natural order output: No separate bit-reversal pass needed due to autosort property
  • Memory efficient: No inter-stage synchronization required since consecutive stages access different buffers

Performance characteristics (release build, NVIDIA GPU):

Single FFT Performance

Size (N) Throughput GFLOPS Latency
256 2.7 MSamples/s 0.11 0.095 ms
1,024 9.6 MSamples/s 0.48 0.107 ms
4,096 32.0 MSamples/s 1.92 0.128 ms
16,384 86.2 MSamples/s 6.04 0.190 ms
65,536 145 MSamples/s 11.60 0.452 ms
262,144 160 MSamples/s 14.45 1.632 ms

Batch Processing Performance

Batch processing provides significant throughput improvements:

  • 2x speedup: Processing 8 signals in batch vs. individually
  • Reduced overhead: Single GPU kernel launch for entire batch
  • Better utilization: Keeps GPU busy with continuous work

Accuracy: max element-wise L₂ error vs. rustfft is below 1e-3 for N = 1024 single-precision inputs. Batch processing maintains identical numerical accuracy to single-vector processing.

Limitations

  • Single-precision (f32) only.

Shader development

The canonical shader source is in src/shaders.rs as raw WGSL strings. The Stockham autosort kernel implements the entire FFT in a single compute shader that processes all log₂(N) stages sequentially.

To verify correctness and run performance benchmarks:

# Accuracy test (vs rustfft) - now uses batch API
cargo test test_gpu_fft_matches_rustfft --release

# Performance benchmark - tests batch processing
cargo test fft_throughput --release -- --nocapture

# Batch processing example
cargo run --example test_batch_processing --release

Performance Optimization

The implementation includes several optimizations:

  • Single-pass execution: All FFT stages run in one compute pass
  • Optimized synchronization: Reduced CPU-GPU synchronization overhead
  • Efficient memory access: Ping-pong buffers eliminate memory hazards
  • Automatic fallback: Uses wgpu's software fallback adapter when no GPU is available

For best performance:

  • Use --release builds
  • Target larger FFT sizes (≥4096) where GPU advantages are most pronounced
  • Ensure your system has a modern wgpu-compatible GPU

License

MIT — see LICENSE.