docs.rs failed to build forge-sort-0.2.3
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

forge-sort

The fastest GPU radix sort on Apple Silicon. 28x faster than Rust's sort_unstable() (ipnsort), 8x faster than Rayon parallel sort, and 8x faster than the next best published Apple Silicon implementation.

use forge_sort::GpuSorter;

let mut sorter = GpuSorter::new()?;
sorter.sort_u32(&mut data)?;

Benchmarks

Measured on Apple M4 Pro (20-core GPU, 48GB unified memory). Median of 7 runs on random u32 data.

vs Rust `sort_unstable` (ipnsort)

Head-to-head comparison: Rust's standard library sort_unstable() — the fastest general-purpose sort in Rust (ipnsort, a pattern-defeating quicksort) — vs forge-sort's GPU radix sort. Same random u32 data, same process. Run with cargo run --release --example vs_stdlib.

Size	`sort_unstable` (CPU)	forge-sort (GPU)	Speedup
10K	0.08 ms / 119 Mk/s	0.67 ms / 15 Mk/s	CPU 7.9x
50K	0.46 ms / 108 Mk/s	0.73 ms / 69 Mk/s	CPU 1.6x
100K	0.96 ms / 104 Mk/s	0.72 ms / 139 Mk/s	GPU 1.3x
500K	4.58 ms / 109 Mk/s	0.97 ms / 513 Mk/s	GPU 4.7x
1M	8.96 ms / 112 Mk/s	1.28 ms / 779 Mk/s	GPU 7.0x
4M	39.6 ms / 101 Mk/s	3.17 ms / 1,260 Mk/s	GPU 12.5x
16M	173.8 ms / 93 Mk/s	6.21 ms / 2,576 Mk/s	GPU 28.0x

GPU crossover at ~100K elements. Below that, CPU wins due to GPU dispatch overhead (~0.6ms). Above that, GPU radix sort dominates -- O(n) radix vs O(n log n) comparison-based, with 273 GB/s memory bandwidth.

vs Rayon parallel sort

Size	`sort_unstable`	`rayon par_sort`	forge-sort	vs CPU	vs Rayon
1M	8.82 ms / 113 Mk/s	2.67 ms / 374 Mk/s	1.20 ms / 836 Mk/s	7.4x	2.2x
4M	38.57 ms / 104 Mk/s	11.35 ms / 353 Mk/s	1.77 ms / 2,255 Mk/s	21.7x	6.4x
16M	168.19 ms / 95 Mk/s	44.08 ms / 363 Mk/s	5.75 ms / 2,784 Mk/s	29.3x	7.7x

sort_buffer (zero-copy, pure GPU speed)

When data is already in a Metal buffer, skip the memcpy for up to 1.8x more throughput:

Size	sort_u32 (w/ memcpy)	sort_buffer (zero-copy)	Speedup
100K	428 Mk/s	428 Mk/s	1.00x
1M	2,510 Mk/s	3,432 Mk/s	1.37x
4M	3,097 Mk/s	5,242 Mk/s	1.69x
16M	2,693 Mk/s	4,131 Mk/s	1.53x

GPU wins above ~200K elements. Sweet spot is 4M where data fits in the System Level Cache.

Multi-type performance (all 6 types @ 16M elements)

Type	sort (w/ memcpy)	sort_buffer (zero-copy)	argsort	sort_pairs
u32	2,802 Mk/s	4,131 Mk/s	1,147 Mk/s	882 Mk/s
i32	2,208 Mk/s	3,255 Mk/s	1,101 Mk/s	804 Mk/s
f32	2,188 Mk/s	3,286 Mk/s	1,118 Mk/s	800 Mk/s
u64	918 Mk/s	1,174 Mk/s	536 Mk/s	450 Mk/s
i64	811 Mk/s	1,034 Mk/s	490 Mk/s	424 Mk/s
f64	813 Mk/s	1,025 Mk/s	503 Mk/s	-

32-bit types sort at ~2,200-2,800 Mk/s. 64-bit types sort at ~800-900 Mk/s (2x data, ~3x slower due to halved tile size + extra passes). Float types match integer performance with uniform bit distribution. Narrow-range floats (e.g. all values in [-1000, 1000]) are slower due to MSD bucket imbalance.

vs other implementations

Implementation	Platform	1M Mk/s	16M Mk/s
forge-sort (zero-copy)	M4 Pro (Metal)	3,432	4,131
forge-sort (sort_u32)	M4 Pro (Metal)	836	2,784
VSort	M4 Pro (CPU+Metal)	99	-
VkRadixSort	RTX 3070 (Vulkan)	53	-
MPS	Apple Silicon	no sort function	no sort function

Metal Performance Shaders does not include a sort primitive. forge-sort fills that gap.

Algorithm

MSD+fused-inner 8-bit radix sort executed in 4 GPU dispatches within a single command encoder:

MSD Histogram -- count byte[24:31] distribution across all tiles
MSD Prep -- exclusive prefix sum + bucket descriptors
Atomic MSD Scatter -- scatter to 256 buckets via atomic_fetch_add (no spin-wait)
Fused Inner Sort -- 3-pass LSD (bytes 0-2) per bucket, self-contained histograms

All dispatches share one encoder with implicit device memory barriers between them. The inner sort kernel computes its own histograms for all 3 passes during its first scan, eliminating a separate precompute dispatch.

Why it's fast

4 total passes over data (1 MSD + 3 inner LSD) vs 4-8 in traditional implementations
Atomic scatter replaces decoupled lookback -- zero spin-waiting, no tile status buffers
Fused inner kernel -- 3 LSD passes execute within a single dispatch, keeping data in SLC
~22 KB threadgroup memory -- fits M4 Pro's 32 KB budget with room to spare
Per-simdgroup atomic histograms -- 8 parallel accumulators reduce TG memory contention by 8x

Requirements

macOS with Apple Silicon (M1 or later)
Metal 3.2 support (for atomic_thread_fence with device scope)
Rust 1.70+
Xcode Command Line Tools (for xcrun metal shader compiler)

Usage

Add to your Cargo.toml:

[dependencies]
forge-sort = "0.2"

Simple (sort_u32)

Copies data to GPU buffer and back. Easiest to use:

use forge_sort::{GpuSorter, SortError};

fn main() -> Result<(), SortError> {
    let mut sorter = GpuSorter::new()?;

    let mut data: Vec<u32> = vec![5, 3, 8, 1, 9, 2, 7, 4, 6, 0];
    sorter.sort_u32(&mut data)?;
    assert_eq!(data, vec![0, 1, 2, 3, 4, 5, 6, 7, 8, 9]);

    Ok(())
}

Zero-copy (sort_buffer)

Write directly into GPU-visible memory. No memcpy, maximum throughput:

use forge_sort::{GpuSorter, SortError};

fn main() -> Result<(), SortError> {
    let mut sorter = GpuSorter::new()?;

    // Allocate a buffer in unified memory (shared CPU/GPU)
    let mut buf = sorter.alloc_sort_buffer(16_000_000);

    // Write data directly into GPU-visible memory
    let slice = buf.as_mut_slice();
    for i in 0..16_000_000 {
        slice[i] = (16_000_000 - i) as u32;
    }
    buf.set_len(16_000_000);

    // Sort in-place — zero memcpy
    sorter.sort_buffer(&buf)?;

    // Read sorted results directly from GPU memory
    assert_eq!(buf.as_slice()[0], 1);

    Ok(())
}

API

pub struct GpuSorter { /* ... */ }
pub struct SortBuffer<T> { /* ... */ }

impl GpuSorter {
    pub fn new() -> Result<Self, SortError>;

    // Sort slices in-place (copies data to/from GPU)
    pub fn sort_u32(&mut self, data: &mut [u32]) -> Result<(), SortError>;
    pub fn sort_i32(&mut self, data: &mut [i32]) -> Result<(), SortError>;
    pub fn sort_f32(&mut self, data: &mut [f32]) -> Result<(), SortError>;
    pub fn sort_u64(&mut self, data: &mut [u64]) -> Result<(), SortError>;
    pub fn sort_i64(&mut self, data: &mut [i64]) -> Result<(), SortError>;
    pub fn sort_f64(&mut self, data: &mut [f64]) -> Result<(), SortError>;

    // Zero-copy sort (data stays in GPU memory)
    pub fn sort_buffer(&mut self, buf: &SortBuffer<u32>) -> Result<(), SortError>;
    pub fn sort_f32_buffer(&mut self, buf: &SortBuffer<f32>) -> Result<(), SortError>;
    pub fn sort_i32_buffer(&mut self, buf: &SortBuffer<i32>) -> Result<(), SortError>;
    pub fn sort_u64_buffer(&mut self, buf: &SortBuffer<u64>) -> Result<(), SortError>;
    pub fn sort_i64_buffer(&mut self, buf: &SortBuffer<i64>) -> Result<(), SortError>;
    pub fn sort_f64_buffer(&mut self, buf: &SortBuffer<f64>) -> Result<(), SortError>;

    // Argsort -- returns index permutation
    pub fn argsort_u32(&mut self, data: &[u32]) -> Result<Vec<u32>, SortError>;
    pub fn argsort_i32(&mut self, data: &[i32]) -> Result<Vec<u32>, SortError>;
    pub fn argsort_f32(&mut self, data: &[f32]) -> Result<Vec<u32>, SortError>;
    pub fn argsort_u64(&mut self, data: &[u64]) -> Result<Vec<u32>, SortError>;
    pub fn argsort_i64(&mut self, data: &[i64]) -> Result<Vec<u32>, SortError>;
    pub fn argsort_f64(&mut self, data: &[f64]) -> Result<Vec<u32>, SortError>;

    // Key-value pair sort -- co-sorts values by key order
    pub fn sort_pairs_u32(&mut self, keys: &mut [u32], values: &mut [u32]) -> Result<(), SortError>;
    pub fn sort_pairs_i32(&mut self, keys: &mut [i32], values: &mut [u32]) -> Result<(), SortError>;
    pub fn sort_pairs_f32(&mut self, keys: &mut [f32], values: &mut [u32]) -> Result<(), SortError>;
    pub fn sort_pairs_u64(&mut self, keys: &mut [u64], values: &mut [u32]) -> Result<(), SortError>;
    pub fn sort_pairs_i64(&mut self, keys: &mut [i64], values: &mut [u32]) -> Result<(), SortError>;

    // Buffer allocation
    pub fn alloc_sort_buffer<T>(&self, capacity: usize) -> SortBuffer<T>;
}

Running benchmarks

cargo bench -p forge-sort

Running tests

cargo test -p forge-sort -- --test-threads=1      # all 165 tests (serial for GPU)
cargo test -p forge-sort --release -- --test-threads=1  # faster
cargo test -p forge-sort -- --nocapture --test-threads=1  # see output

165 tests covering all 6 data types (u32/i32/f32/u64/i64/f64), argsort, sort_pairs, 8 sizes (100 to 16M), edge cases (empty, single, all-same, pre-sorted, reverse, non-aligned), buffer reuse, zero-copy SortBuffer API, interleaved usage, and performance sanity.

Architecture

forge-sort/
  build.rs              # compiles sort.metal with -std=metal3.2
  shaders/sort.metal    # 4 Metal compute kernels (~870 lines)
  src/lib.rs            # GpuSorter + SortBuffer (~3,300 lines)
  tests/correctness.rs  # 165 integration tests
  benches/sort_benchmark.rs

Part of the metal-forge-compute workspace. Depends on forge-primitives for Metal device initialization, buffer allocation, and PSO caching.

Memory usage

Elements	GPU buffers	Total
1M	2 x 4 MB + 6 KB	~8 MB
16M	2 x 64 MB + 6 KB	~128 MB
32M	2 x 128 MB + 6 KB	~256 MB

Buffers are allocated via Apple Silicon unified memory (shared CPU/GPU). They grow on demand and are never shrunk, so repeated calls with the same or smaller sizes have zero allocation overhead. sort_buffer() only allocates scratch space (1 data-sized buffer + 6 KB metadata).

Limitations

Synchronous -- all sort methods block until the GPU completes
Apple Silicon only -- requires Metal 3.2 (M1+)
u32 values for sort_pairs -- value array is always u32 (use argsort for arbitrary reordering)

License

Dual-licensed.

Open source: AGPL-3.0 — free for open-source projects that comply with AGPL terms (your code must also be open-sourced under a compatible license).
Commercial: Proprietary license available for closed-source / commercial use. Contact kavanagh.patrick@gmail.com for pricing.

forge-sort 0.2.3