Overclocked Parallel Sort (Rust)
A blisteringly fast parallel Counting Sort for Rust. It utilizes architectural optimizations such as Auto-Tuning, Prefix-Sum, Zero-Runtime Dynamic Thread Stealing, SIMD Auto-vectorization, and L2 Cache-oblivious chunk blocking.
It defeats standard Rust sort_unstable() and normal Counting Sort on large arrays under various scenarios (Uniform, Skewed Distributions, Reversed datasets).
Quick Start
use overclocked_parallel_sort;
Features
- Auto-Tuning: Automatically discovers and scales
num_threadsby detecting the logical cores. - Cache-oblivious Algorithms: Partitions large counts into perfect 256KB block loops, ensuring data is prefetched and retained efficiently in CPU's L2/L3 Caches.
- SIMD Data-level Vectorization: Bypasses the compiler scheduler to automatically invoke wide AVX/AVX-512 register instructions
std::slice::fill(val)to bulk-write 256-bit elements unconditionally to contiguous RAM. - Dynamic Work-Stealing Load Balancer: Resolves single-thread bottlenecks on highly skewed distributions (ex: 80 million identical values being stuck on 1 core) by slicing large buckets and letting all cores atomically steal the remaining batches without OS contention.
Usage
Simply add it to your Cargo.toml:
[]
= "0.1"