forge-sort 0.2.3

GPU radix sort for Apple Silicon — 4,800+ Mkeys/s, 6 data types, zero-copy
docs.rs failed to build forge-sort-0.2.3
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

forge-sort

The fastest GPU radix sort on Apple Silicon. 28x faster than Rust's sort_unstable() (ipnsort), 8x faster than Rayon parallel sort, and 8x faster than the next best published Apple Silicon implementation.

use forge_sort::GpuSorter;

let mut sorter = GpuSorter::new()?;
sorter.sort_u32(&mut data)?;

Benchmarks

Measured on Apple M4 Pro (20-core GPU, 48GB unified memory). Median of 7 runs on random u32 data.

vs Rust sort_unstable (ipnsort)

Head-to-head comparison: Rust's standard library sort_unstable() — the fastest general-purpose sort in Rust (ipnsort, a pattern-defeating quicksort) — vs forge-sort's GPU radix sort. Same random u32 data, same process. Run with cargo run --release --example vs_stdlib.

Size sort_unstable (CPU) forge-sort (GPU) Speedup
10K 0.08 ms / 119 Mk/s 0.67 ms / 15 Mk/s CPU 7.9x
50K 0.46 ms / 108 Mk/s 0.73 ms / 69 Mk/s CPU 1.6x
100K 0.96 ms / 104 Mk/s 0.72 ms / 139 Mk/s GPU 1.3x
500K 4.58 ms / 109 Mk/s 0.97 ms / 513 Mk/s GPU 4.7x
1M 8.96 ms / 112 Mk/s 1.28 ms / 779 Mk/s GPU 7.0x
4M 39.6 ms / 101 Mk/s 3.17 ms / 1,260 Mk/s GPU 12.5x
16M 173.8 ms / 93 Mk/s 6.21 ms / 2,576 Mk/s GPU 28.0x

GPU crossover at ~100K elements. Below that, CPU wins due to GPU dispatch overhead (~0.6ms). Above that, GPU radix sort dominates -- O(n) radix vs O(n log n) comparison-based, with 273 GB/s memory bandwidth.

vs Rayon parallel sort

Size sort_unstable rayon par_sort forge-sort vs CPU vs Rayon
1M 8.82 ms / 113 Mk/s 2.67 ms / 374 Mk/s 1.20 ms / 836 Mk/s 7.4x 2.2x
4M 38.57 ms / 104 Mk/s 11.35 ms / 353 Mk/s 1.77 ms / 2,255 Mk/s 21.7x 6.4x
16M 168.19 ms / 95 Mk/s 44.08 ms / 363 Mk/s 5.75 ms / 2,784 Mk/s 29.3x 7.7x

sort_buffer (zero-copy, pure GPU speed)

When data is already in a Metal buffer, skip the memcpy for up to 1.8x more throughput:

Size sort_u32 (w/ memcpy) sort_buffer (zero-copy) Speedup
100K 428 Mk/s 428 Mk/s 1.00x
1M 2,510 Mk/s 3,432 Mk/s 1.37x
4M 3,097 Mk/s 5,242 Mk/s 1.69x
16M 2,693 Mk/s 4,131 Mk/s 1.53x

GPU wins above ~200K elements. Sweet spot is 4M where data fits in the System Level Cache.

Multi-type performance (all 6 types @ 16M elements)

Type sort (w/ memcpy) sort_buffer (zero-copy) argsort sort_pairs
u32 2,802 Mk/s 4,131 Mk/s 1,147 Mk/s 882 Mk/s
i32 2,208 Mk/s 3,255 Mk/s 1,101 Mk/s 804 Mk/s
f32 2,188 Mk/s 3,286 Mk/s 1,118 Mk/s 800 Mk/s
u64 918 Mk/s 1,174 Mk/s 536 Mk/s 450 Mk/s
i64 811 Mk/s 1,034 Mk/s 490 Mk/s 424 Mk/s
f64 813 Mk/s 1,025 Mk/s 503 Mk/s -

32-bit types sort at ~2,200-2,800 Mk/s. 64-bit types sort at ~800-900 Mk/s (2x data, ~3x slower due to halved tile size + extra passes). Float types match integer performance with uniform bit distribution. Narrow-range floats (e.g. all values in [-1000, 1000]) are slower due to MSD bucket imbalance.

vs other implementations

Implementation Platform 1M Mk/s 16M Mk/s
forge-sort (zero-copy) M4 Pro (Metal) 3,432 4,131
forge-sort (sort_u32) M4 Pro (Metal) 836 2,784
VSort M4 Pro (CPU+Metal) 99 -
VkRadixSort RTX 3070 (Vulkan) 53 -
MPS Apple Silicon no sort function no sort function

Metal Performance Shaders does not include a sort primitive. forge-sort fills that gap.

Algorithm

MSD+fused-inner 8-bit radix sort executed in 4 GPU dispatches within a single command encoder:

  1. MSD Histogram -- count byte[24:31] distribution across all tiles
  2. MSD Prep -- exclusive prefix sum + bucket descriptors
  3. Atomic MSD Scatter -- scatter to 256 buckets via atomic_fetch_add (no spin-wait)
  4. Fused Inner Sort -- 3-pass LSD (bytes 0-2) per bucket, self-contained histograms

All dispatches share one encoder with implicit device memory barriers between them. The inner sort kernel computes its own histograms for all 3 passes during its first scan, eliminating a separate precompute dispatch.

Why it's fast

  • 4 total passes over data (1 MSD + 3 inner LSD) vs 4-8 in traditional implementations
  • Atomic scatter replaces decoupled lookback -- zero spin-waiting, no tile status buffers
  • Fused inner kernel -- 3 LSD passes execute within a single dispatch, keeping data in SLC
  • ~22 KB threadgroup memory -- fits M4 Pro's 32 KB budget with room to spare
  • Per-simdgroup atomic histograms -- 8 parallel accumulators reduce TG memory contention by 8x

Requirements

  • macOS with Apple Silicon (M1 or later)
  • Metal 3.2 support (for atomic_thread_fence with device scope)
  • Rust 1.70+
  • Xcode Command Line Tools (for xcrun metal shader compiler)

Usage

Add to your Cargo.toml:

[dependencies]
forge-sort = "0.2"

Simple (sort_u32)

Copies data to GPU buffer and back. Easiest to use:

use forge_sort::{GpuSorter, SortError};

fn main() -> Result<(), SortError> {
    let mut sorter = GpuSorter::new()?;

    let mut data: Vec<u32> = vec![5, 3, 8, 1, 9, 2, 7, 4, 6, 0];
    sorter.sort_u32(&mut data)?;
    assert_eq!(data, vec![0, 1, 2, 3, 4, 5, 6, 7, 8, 9]);

    Ok(())
}

Zero-copy (sort_buffer)

Write directly into GPU-visible memory. No memcpy, maximum throughput:

use forge_sort::{GpuSorter, SortError};

fn main() -> Result<(), SortError> {
    let mut sorter = GpuSorter::new()?;

    // Allocate a buffer in unified memory (shared CPU/GPU)
    let mut buf = sorter.alloc_sort_buffer(16_000_000);

    // Write data directly into GPU-visible memory
    let slice = buf.as_mut_slice();
    for i in 0..16_000_000 {
        slice[i] = (16_000_000 - i) as u32;
    }
    buf.set_len(16_000_000);

    // Sort in-place — zero memcpy
    sorter.sort_buffer(&buf)?;

    // Read sorted results directly from GPU memory
    assert_eq!(buf.as_slice()[0], 1);

    Ok(())
}

API

pub struct GpuSorter { /* ... */ }
pub struct SortBuffer<T> { /* ... */ }

impl GpuSorter {
    pub fn new() -> Result<Self, SortError>;

    // Sort slices in-place (copies data to/from GPU)
    pub fn sort_u32(&mut self, data: &mut [u32]) -> Result<(), SortError>;
    pub fn sort_i32(&mut self, data: &mut [i32]) -> Result<(), SortError>;
    pub fn sort_f32(&mut self, data: &mut [f32]) -> Result<(), SortError>;
    pub fn sort_u64(&mut self, data: &mut [u64]) -> Result<(), SortError>;
    pub fn sort_i64(&mut self, data: &mut [i64]) -> Result<(), SortError>;
    pub fn sort_f64(&mut self, data: &mut [f64]) -> Result<(), SortError>;

    // Zero-copy sort (data stays in GPU memory)
    pub fn sort_buffer(&mut self, buf: &SortBuffer<u32>) -> Result<(), SortError>;
    pub fn sort_f32_buffer(&mut self, buf: &SortBuffer<f32>) -> Result<(), SortError>;
    pub fn sort_i32_buffer(&mut self, buf: &SortBuffer<i32>) -> Result<(), SortError>;
    pub fn sort_u64_buffer(&mut self, buf: &SortBuffer<u64>) -> Result<(), SortError>;
    pub fn sort_i64_buffer(&mut self, buf: &SortBuffer<i64>) -> Result<(), SortError>;
    pub fn sort_f64_buffer(&mut self, buf: &SortBuffer<f64>) -> Result<(), SortError>;

    // Argsort -- returns index permutation
    pub fn argsort_u32(&mut self, data: &[u32]) -> Result<Vec<u32>, SortError>;
    pub fn argsort_i32(&mut self, data: &[i32]) -> Result<Vec<u32>, SortError>;
    pub fn argsort_f32(&mut self, data: &[f32]) -> Result<Vec<u32>, SortError>;
    pub fn argsort_u64(&mut self, data: &[u64]) -> Result<Vec<u32>, SortError>;
    pub fn argsort_i64(&mut self, data: &[i64]) -> Result<Vec<u32>, SortError>;
    pub fn argsort_f64(&mut self, data: &[f64]) -> Result<Vec<u32>, SortError>;

    // Key-value pair sort -- co-sorts values by key order
    pub fn sort_pairs_u32(&mut self, keys: &mut [u32], values: &mut [u32]) -> Result<(), SortError>;
    pub fn sort_pairs_i32(&mut self, keys: &mut [i32], values: &mut [u32]) -> Result<(), SortError>;
    pub fn sort_pairs_f32(&mut self, keys: &mut [f32], values: &mut [u32]) -> Result<(), SortError>;
    pub fn sort_pairs_u64(&mut self, keys: &mut [u64], values: &mut [u32]) -> Result<(), SortError>;
    pub fn sort_pairs_i64(&mut self, keys: &mut [i64], values: &mut [u32]) -> Result<(), SortError>;

    // Buffer allocation
    pub fn alloc_sort_buffer<T>(&self, capacity: usize) -> SortBuffer<T>;
}

Running benchmarks

cargo bench -p forge-sort

Running tests

cargo test -p forge-sort -- --test-threads=1      # all 165 tests (serial for GPU)
cargo test -p forge-sort --release -- --test-threads=1  # faster
cargo test -p forge-sort -- --nocapture --test-threads=1  # see output

165 tests covering all 6 data types (u32/i32/f32/u64/i64/f64), argsort, sort_pairs, 8 sizes (100 to 16M), edge cases (empty, single, all-same, pre-sorted, reverse, non-aligned), buffer reuse, zero-copy SortBuffer API, interleaved usage, and performance sanity.

Architecture

forge-sort/
  build.rs              # compiles sort.metal with -std=metal3.2
  shaders/sort.metal    # 4 Metal compute kernels (~870 lines)
  src/lib.rs            # GpuSorter + SortBuffer (~3,300 lines)
  tests/correctness.rs  # 165 integration tests
  benches/sort_benchmark.rs

Part of the metal-forge-compute workspace. Depends on forge-primitives for Metal device initialization, buffer allocation, and PSO caching.

Memory usage

Elements GPU buffers Total
1M 2 x 4 MB + 6 KB ~8 MB
16M 2 x 64 MB + 6 KB ~128 MB
32M 2 x 128 MB + 6 KB ~256 MB

Buffers are allocated via Apple Silicon unified memory (shared CPU/GPU). They grow on demand and are never shrunk, so repeated calls with the same or smaller sizes have zero allocation overhead. sort_buffer() only allocates scratch space (1 data-sized buffer + 6 KB metadata).

Limitations

  • Synchronous -- all sort methods block until the GPU completes
  • Apple Silicon only -- requires Metal 3.2 (M1+)
  • u32 values for sort_pairs -- value array is always u32 (use argsort for arbitrary reordering)

License

Dual-licensed.

  • Open source: AGPL-3.0 — free for open-source projects that comply with AGPL terms (your code must also be open-sourced under a compatible license).
  • Commercial: Proprietary license available for closed-source / commercial use. Contact kavanagh.patrick@gmail.com for pricing.