overclocked_sort 0.1.0

A hyper-optimized Parallel Counting Sort utilizing L2 Cache-oblivious block sizing, SIMD Auto-vectorization, Prefix-Sum, and Zero-Runtime Dynamic Work Stealing.
Documentation
  • Coverage
  • 100%
    2 out of 2 items documented1 out of 2 items with examples
  • Size
  • Source code size: 8.48 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 1.08 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 22s Average build duration of successful builds.
  • all releases: 29s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • Repository
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • Skyboy12

Overclocked Parallel Sort (Rust)

A blisteringly fast parallel Counting Sort for Rust. It utilizes architectural optimizations such as Auto-Tuning, Prefix-Sum, Zero-Runtime Dynamic Thread Stealing, SIMD Auto-vectorization, and L2 Cache-oblivious chunk blocking.

It defeats standard Rust sort_unstable() and normal Counting Sort on large arrays under various scenarios (Uniform, Skewed Distributions, Reversed datasets).

Quick Start

use overclocked_sort::overclocked_parallel_sort;

fn main() {
    let large_array = vec![120, 5, 1, 99, 1337, 5, 2, 8];
    // Supply the known maximum value
    let max_val = 1337; 
    let sorted_data = overclocked_parallel_sort(&large_array, max_val);
    
    assert_eq!(sorted_data[0], 1);
}

Features

  • Auto-Tuning: Automatically discovers and scales num_threads by detecting the logical cores.
  • Cache-oblivious Algorithms: Partitions large counts into perfect 256KB block loops, ensuring data is prefetched and retained efficiently in CPU's L2/L3 Caches.
  • SIMD Data-level Vectorization: Bypasses the compiler scheduler to automatically invoke wide AVX/AVX-512 register instructions std::slice::fill(val) to bulk-write 256-bit elements unconditionally to contiguous RAM.
  • Dynamic Work-Stealing Load Balancer: Resolves single-thread bottlenecks on highly skewed distributions (ex: 80 million identical values being stuck on 1 core) by slicing large buckets and letting all cores atomically steal the remaining batches without OS contention.

Usage

Simply add it to your Cargo.toml:

[dependencies]

overclocked_sort = "0.1"