Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
forge-sort
The fastest GPU radix sort on Apple Silicon. 28x faster than Rust's sort_unstable() (ipnsort), 8x faster than Rayon parallel sort, and 8x faster than the next best published Apple Silicon implementation.
use GpuSorter;
let mut sorter = new?;
sorter.sort_u32?;
Benchmarks
Measured on Apple M4 Pro (20-core GPU, 48GB unified memory). Median of 7 runs on random u32 data.
vs Rust sort_unstable (ipnsort)
Head-to-head comparison: Rust's standard library sort_unstable() — the fastest general-purpose sort in Rust (ipnsort, a pattern-defeating quicksort) — vs forge-sort's GPU radix sort. Same random u32 data, same process. Run with cargo run --release --example vs_stdlib.
| Size | sort_unstable (CPU) |
forge-sort (GPU) | Speedup |
|---|---|---|---|
| 10K | 0.08 ms / 119 Mk/s | 0.67 ms / 15 Mk/s | CPU 7.9x |
| 50K | 0.46 ms / 108 Mk/s | 0.73 ms / 69 Mk/s | CPU 1.6x |
| 100K | 0.96 ms / 104 Mk/s | 0.72 ms / 139 Mk/s | GPU 1.3x |
| 500K | 4.58 ms / 109 Mk/s | 0.97 ms / 513 Mk/s | GPU 4.7x |
| 1M | 8.96 ms / 112 Mk/s | 1.28 ms / 779 Mk/s | GPU 7.0x |
| 4M | 39.6 ms / 101 Mk/s | 3.17 ms / 1,260 Mk/s | GPU 12.5x |
| 16M | 173.8 ms / 93 Mk/s | 6.21 ms / 2,576 Mk/s | GPU 28.0x |
GPU crossover at ~100K elements. Below that, CPU wins due to GPU dispatch overhead (~0.6ms). Above that, GPU radix sort dominates -- O(n) radix vs O(n log n) comparison-based, with 273 GB/s memory bandwidth.
vs Rayon parallel sort
| Size | sort_unstable |
rayon par_sort |
forge-sort | vs CPU | vs Rayon |
|---|---|---|---|---|---|
| 1M | 8.82 ms / 113 Mk/s | 2.67 ms / 374 Mk/s | 1.20 ms / 836 Mk/s | 7.4x | 2.2x |
| 4M | 38.57 ms / 104 Mk/s | 11.35 ms / 353 Mk/s | 1.77 ms / 2,255 Mk/s | 21.7x | 6.4x |
| 16M | 168.19 ms / 95 Mk/s | 44.08 ms / 363 Mk/s | 5.75 ms / 2,784 Mk/s | 29.3x | 7.7x |
sort_buffer (zero-copy, pure GPU speed)
When data is already in a Metal buffer, skip the memcpy for up to 1.8x more throughput:
| Size | sort_u32 (w/ memcpy) | sort_buffer (zero-copy) | Speedup |
|---|---|---|---|
| 100K | 428 Mk/s | 428 Mk/s | 1.00x |
| 1M | 2,510 Mk/s | 3,432 Mk/s | 1.37x |
| 4M | 3,097 Mk/s | 5,242 Mk/s | 1.69x |
| 16M | 2,693 Mk/s | 4,131 Mk/s | 1.53x |
GPU wins above ~200K elements. Sweet spot is 4M where data fits in the System Level Cache.
Multi-type performance (all 6 types @ 16M elements)
| Type | sort (w/ memcpy) | sort_buffer (zero-copy) | argsort | sort_pairs |
|---|---|---|---|---|
| u32 | 2,802 Mk/s | 4,131 Mk/s | 1,147 Mk/s | 882 Mk/s |
| i32 | 2,208 Mk/s | 3,255 Mk/s | 1,101 Mk/s | 804 Mk/s |
| f32 | 2,188 Mk/s | 3,286 Mk/s | 1,118 Mk/s | 800 Mk/s |
| u64 | 918 Mk/s | 1,174 Mk/s | 536 Mk/s | 450 Mk/s |
| i64 | 811 Mk/s | 1,034 Mk/s | 490 Mk/s | 424 Mk/s |
| f64 | 813 Mk/s | 1,025 Mk/s | 503 Mk/s | - |
32-bit types sort at ~2,200-2,800 Mk/s. 64-bit types sort at ~800-900 Mk/s (2x data, ~3x slower due to halved tile size + extra passes). Float types match integer performance with uniform bit distribution. Narrow-range floats (e.g. all values in [-1000, 1000]) are slower due to MSD bucket imbalance.
vs other implementations
| Implementation | Platform | 1M Mk/s | 16M Mk/s |
|---|---|---|---|
| forge-sort (zero-copy) | M4 Pro (Metal) | 3,432 | 4,131 |
| forge-sort (sort_u32) | M4 Pro (Metal) | 836 | 2,784 |
| VSort | M4 Pro (CPU+Metal) | 99 | - |
| VkRadixSort | RTX 3070 (Vulkan) | 53 | - |
| MPS | Apple Silicon | no sort function | no sort function |
Metal Performance Shaders does not include a sort primitive. forge-sort fills that gap.
Algorithm
MSD+fused-inner 8-bit radix sort executed in 4 GPU dispatches within a single command encoder:
- MSD Histogram -- count byte[24:31] distribution across all tiles
- MSD Prep -- exclusive prefix sum + bucket descriptors
- Atomic MSD Scatter -- scatter to 256 buckets via
atomic_fetch_add(no spin-wait) - Fused Inner Sort -- 3-pass LSD (bytes 0-2) per bucket, self-contained histograms
All dispatches share one encoder with implicit device memory barriers between them. The inner sort kernel computes its own histograms for all 3 passes during its first scan, eliminating a separate precompute dispatch.
Why it's fast
- 4 total passes over data (1 MSD + 3 inner LSD) vs 4-8 in traditional implementations
- Atomic scatter replaces decoupled lookback -- zero spin-waiting, no tile status buffers
- Fused inner kernel -- 3 LSD passes execute within a single dispatch, keeping data in SLC
- ~22 KB threadgroup memory -- fits M4 Pro's 32 KB budget with room to spare
- Per-simdgroup atomic histograms -- 8 parallel accumulators reduce TG memory contention by 8x
Requirements
- macOS with Apple Silicon (M1 or later)
- Metal 3.2 support (for
atomic_thread_fencewith device scope) - Rust 1.70+
- Xcode Command Line Tools (for
xcrun metalshader compiler)
Usage
Add to your Cargo.toml:
[]
= "0.2"
Simple (sort_u32)
Copies data to GPU buffer and back. Easiest to use:
use ;
Zero-copy (sort_buffer)
Write directly into GPU-visible memory. No memcpy, maximum throughput:
use ;
API
Running benchmarks
Running tests
165 tests covering all 6 data types (u32/i32/f32/u64/i64/f64), argsort, sort_pairs, 8 sizes (100 to 16M), edge cases (empty, single, all-same, pre-sorted, reverse, non-aligned), buffer reuse, zero-copy SortBuffer API, interleaved usage, and performance sanity.
Architecture
forge-sort/
build.rs # compiles sort.metal with -std=metal3.2
shaders/sort.metal # 4 Metal compute kernels (~870 lines)
src/lib.rs # GpuSorter + SortBuffer (~3,300 lines)
tests/correctness.rs # 165 integration tests
benches/sort_benchmark.rs
Part of the metal-forge-compute workspace. Depends on forge-primitives for Metal device initialization, buffer allocation, and PSO caching.
Memory usage
| Elements | GPU buffers | Total |
|---|---|---|
| 1M | 2 x 4 MB + 6 KB | ~8 MB |
| 16M | 2 x 64 MB + 6 KB | ~128 MB |
| 32M | 2 x 128 MB + 6 KB | ~256 MB |
Buffers are allocated via Apple Silicon unified memory (shared CPU/GPU). They grow on demand and are never shrunk, so repeated calls with the same or smaller sizes have zero allocation overhead. sort_buffer() only allocates scratch space (1 data-sized buffer + 6 KB metadata).
Limitations
- Synchronous -- all sort methods block until the GPU completes
- Apple Silicon only -- requires Metal 3.2 (M1+)
- u32 values for sort_pairs -- value array is always u32 (use argsort for arbitrary reordering)
License
Dual-licensed.
- Open source: AGPL-3.0 — free for open-source projects that comply with AGPL terms (your code must also be open-sourced under a compatible license).
- Commercial: Proprietary license available for closed-source / commercial use. Contact kavanagh.patrick@gmail.com for pricing.