overclocked_sort 0.1.0

A hyper-optimized Parallel Counting Sort utilizing L2 Cache-oblivious block sizing, SIMD Auto-vectorization, Prefix-Sum, and Zero-Runtime Dynamic Work Stealing.
Documentation
# Overclocked Parallel Sort (Rust)


A blisteringly fast parallel Counting Sort for Rust.
It utilizes architectural optimizations such as Auto-Tuning, Prefix-Sum, Zero-Runtime Dynamic Thread Stealing, SIMD Auto-vectorization, and L2 Cache-oblivious chunk blocking. 

It defeats standard Rust `sort_unstable()` and normal `Counting Sort` on large arrays under various scenarios (Uniform, Skewed Distributions, Reversed datasets).

## Quick Start

```rust
use overclocked_sort::overclocked_parallel_sort;

fn main() {
    let large_array = vec![120, 5, 1, 99, 1337, 5, 2, 8];
    // Supply the known maximum value
    let max_val = 1337; 
    let sorted_data = overclocked_parallel_sort(&large_array, max_val);
    
    assert_eq!(sorted_data[0], 1);
}
```

## Features

* **Auto-Tuning**: Automatically discovers and scales `num_threads` by detecting the logical cores.
* **Cache-oblivious Algorithms**: Partitions large counts into perfect 256KB block loops, ensuring data is prefetched and retained efficiently in CPU's L2/L3 Caches.
* **SIMD Data-level Vectorization**: Bypasses the compiler scheduler to automatically invoke wide AVX/AVX-512 register instructions `std::slice::fill(val)` to bulk-write 256-bit elements unconditionally to contiguous RAM.
* **Dynamic Work-Stealing Load Balancer**: Resolves single-thread bottlenecks on highly skewed distributions (ex: 80 million identical values being stuck on 1 core) by slicing large buckets and letting all cores atomically steal the remaining batches without OS contention.

## Usage

Simply add it to your `Cargo.toml`:
```toml
[dependencies]
overclocked_sort = "0.1"
```