Overclocked Parallel Sort (Rust)

A blisteringly fast parallel Counting Sort for Rust. It utilizes architectural optimizations such as Auto-Tuning, Prefix-Sum, Zero-Runtime Dynamic Thread Stealing, SIMD Auto-vectorization, and L2 Cache-oblivious chunk blocking.

It defeats standard Rust sort_unstable() and normal Counting Sort on large arrays under various scenarios (Uniform, Skewed Distributions, Reversed datasets).

Quick Start

use overclocked_sort::overclocked_parallel_sort;

fn main() {
    let large_array = vec![120, 5, 1, 99, 1337, 5, 2, 8];
    // Supply the known maximum value
    let max_val = 1337; 
    let sorted_data = overclocked_parallel_sort(&large_array, max_val);
    
    assert_eq!(sorted_data[0], 1);
}

Features

Auto-Tuning: Automatically discovers and scales num_threads by detecting the logical cores.
Cache-oblivious Algorithms: Partitions large counts into perfect 256KB block loops, ensuring data is prefetched and retained efficiently in CPU's L2/L3 Caches.
SIMD Data-level Vectorization: Bypasses the compiler scheduler to automatically invoke wide AVX/AVX-512 register instructions std::slice::fill(val) to bulk-write 256-bit elements unconditionally to contiguous RAM.
Dynamic Work-Stealing Load Balancer: Resolves single-thread bottlenecks on highly skewed distributions (ex: 80 million identical values being stuck on 1 core) by slicing large buckets and letting all cores atomically steal the remaining batches without OS contention.

Usage

Simply add it to your Cargo.toml:

[dependencies]

overclocked_sort = "0.1"

overclocked_sort 0.1.0

Overclocked Parallel Sort (Rust)

Quick Start

Features

Usage