wgpu-algorithms

A high-performance, safe WebGPU sorting and scanning library for Rust.

Safe Rust API. All memory management, bind groups, and synchronization are handled internally. No unsafe blocks in library code.

Performance Benchmarks

Benchmarks run on Apple M3 Max (Metal backend).

CPU: Rayon Parallel Sort (par_sort_unstable)
GPU Resident: Sorts data already on VRAM (Pipeline use-case)
GPU Round-Trip: Upload -> Sort -> Download (Utility use-case)

Items	CPU (Rayon)	GPU (Resident)	GPU (Round-Trip)	Verdict
100k	0.52 ms	6.0 ms	7.2 ms	❌ CPU Wins (Driver Overhead)
1M	4.5 ms	9.1 ms	10.1 ms	❌ CPU Wins
10M	44.1 ms	31.3 ms	40.9 ms	✅ GPU Wins (1.4x)
100M	506 ms	273 ms	407 ms	🚀 GPU Domination (1.85x)

Throughput (100M items):

Scan: ~5.2 Billion items/sec
Sort: ~365 Million items/sec

Prefix Scan (Inclusive Add)

Benchmarks include driver submission overhead (queue.submit + device.poll).

Items	Time	Throughput	Bandwidth (Effective)
100 M	19.2 ms	5.2 Gelem/s	~41.6 GB/s

Note: Bandwidth calculated as Read + Write (4 bytes * 2 * items / time).

Architecture

The library implements state-of-the-art parallel algorithms tailored for the WebGPU execution model:

LSD Radix Sort: A 2-bit pass (4 bins) decoupling "Counting" and "Scattering" kernels.
Hierarchical Scan: A "Reduce-Then-Scan" approach using 3 separate kernels (Downsweep, Scan-Aux, Upsweep) to handle arbitrary input sizes.
Vector Tiling (VT): Automatically adjusts items-per-thread based on GPU capability (e.g., VT=8 for Desktop, VT=4 for Mobile) to saturate memory bandwidth.

Features

Adaptive Sorting: Automatically switches between CPU (latency-optimized) and GPU (throughput-optimized) based on input size (< 1M items uses CPU).
Zero-Allocation Hot Loop: Reuses internal workspace buffers and pre-baked BindGroups to minimize driver pressure during animation loops.
WGPU Safe: Runs on Metal, Vulkan, DX12, and WebGPU without experimental features.

Usage

use wgpu_algorithms::{Context, Sorter};

#[tokio::main]
async fn main() {
    // 1. Initialize GPU Context
    let ctx = Context::init().await.unwrap();
    let mut sorter = Sorter::new(&ctx);

    // 2. Data
    let data: Vec<u32> = (0..50_000_000).map(|_| rand::random()).collect();

    // 3. Sort (Auto-selects CPU or GPU based on size)
    let sorted = sorter.sort(&data).await;
    
    // 4. Verification
    assert!(sorted[0] <= sorted[1]);
}

Installation

Add this to your Cargo.toml:

[dependencies]
wgpu-algorithms = "0.1.0"