# Overclocked Parallel Sort (Rust)
A blisteringly fast parallel Counting Sort for Rust.
It utilizes architectural optimizations such as Auto-Tuning, Prefix-Sum, Zero-Runtime Dynamic Thread Stealing, SIMD Auto-vectorization, and L2 Cache-oblivious chunk blocking.
It defeats standard Rust `sort_unstable()` and normal `Counting Sort` on large arrays under various scenarios (Uniform, Skewed Distributions, Reversed datasets).
## Quick Start
```rust
use overclocked_sort::overclocked_parallel_sort;
fn main() {
let large_array = vec![120, 5, 1, 99, 1337, 5, 2, 8];
// Supply the known maximum value
let max_val = 1337;
let sorted_data = overclocked_parallel_sort(&large_array, max_val);
assert_eq!(sorted_data[0], 1);
}
```
## Features
* **Auto-Tuning**: Automatically discovers and scales `num_threads` by detecting the logical cores.
* **Cache-oblivious Algorithms**: Partitions large counts into perfect 256KB block loops, ensuring data is prefetched and retained efficiently in CPU's L2/L3 Caches.
* **SIMD Data-level Vectorization**: Bypasses the compiler scheduler to automatically invoke wide AVX/AVX-512 register instructions `std::slice::fill(val)` to bulk-write 256-bit elements unconditionally to contiguous RAM.
* **Dynamic Work-Stealing Load Balancer**: Resolves single-thread bottlenecks on highly skewed distributions (ex: 80 million identical values being stuck on 1 core) by slicing large buckets and letting all cores atomically steal the remaining batches without OS contention.
## Usage
Simply add it to your `Cargo.toml`:
```toml
[dependencies]
overclocked_sort = "0.1"
```