**Title:** I accidentally built a sorting algorithm 30x faster than `std::slice::sort_unstable` by abusing L2 Caches and SIMD, and Iām literally shaking right now.
**Body:**
Guys, I think I might have accidentally broken the sound barrier for integer sorting in Rust. I am literally freaking out looking at my terminal output right now.
It started as a simple weekend experiment. I was playing around with Tokio `oneshot` channels to build an asynchronous pipeline for a counting sort concept (like MapReduce). It was slow. So I got annoyed and started peeling off the runtime, moving to raw pointers, then `std::thread::scope`, then hardware-level atomics, and fixing CPU cache misses...
And then... the benchmark dropped. My jaw unhinged.
### The Numbers (Sorting 100 MILLION integers):
**Dataset: Uniform Distribution (Range 1 - 100,000)**
* `sort_unstable` (Rust std): 1.01 seconds
* Synchronous Counting Sort: ~170 ms
* **My Crate (`overclocked_sort`): 89.1 ms** š„ (11x faster than stdlib)
**Dataset: Small Range / Heavy Duplicates (Range 1 - 100)**
* `sort_unstable` (Rust std): 540 ms
* Synchronous Counting Sort: ~198 ms
* **My Crate (`overclocked_sort`): 28.8 ms** 𤯠(18x faster than stdlib!)
**Dataset: Severely Skewed (80% of the 100M array is EXACTLY the same number)**
* `sort_unstable`: 169 ms (pdqsort pattern matching is smart here)
* Synchronous Counting Sort: 205 ms
* **My Crate (`overclocked_sort`): 44.4 ms** š
---
### What kind of black magic is this?
I packaged it into a crate called **`overclocked_sort`**. It's fundamentally a Parallel Counting Sort, but injected with absolutely deranged architectural steroids:
1. **Auto-Tuned Thread Scaling**: It uses `std::thread::available_parallelism()` to instantly spawn threads perfectly matching your physical CPU cores.
2. **L2 Cache-Oblivious Blocking**: Instead of just iterating the massive 100M slice (which constantly cache-misses the RAM), it chunks the reads into exactly `~256KB` blocks. The CPU Hardware Prefetcher keeps the L2 cache scorching hot. Zero RAM latency bottlenecks during the count phase.
3. **Prefix-Sum Offset Allocation**: Instead of locking mutexes or waiting on Tokio schedulers to append values, it does a microsecond Prefix-Sum (Scan) on the thread-local buckets. Every thread instantly gets calculated, non-overlapping exact memory pointers to write to.
4. **LLVM SIMD Auto-Vectorization**: Instead of pushing numbers in a loop, it uses `std::slice::fill`. LLVM realizes the target is contiguous and automatically replaces it with `_mm256_storeu_si256` (AVX2/AVX-512) vectorized instructions. It dumps 256-bits of integers per clock cycle unconditionally.
5. **Zero-Runtime Dynamic Work Stealing**: For highly skewed datasets (e.g. 80 Million items in one bucket), it dynamically splinters the buckets into 500k-item "Jobs". Threads use a lock-free `AtomicUsize::fetch_add` to steal work out of the queue. No single core gets bottlenecked!
---
### Give it a try before I convince myself this is a simulation
I honestly need some sanity checks from the community. Can you guys bench this on your massive data-science workloads?
š¦ **Crates.io**: [https://crates.io/crates/overclocked_sort](https://crates.io/crates/overclocked_sort)
š **Docs.rs**: [https://docs.rs/overclocked_sort/](https://docs.rs/overclocked_sort/)
š» **GitHub Repo**: [https://github.com/Skyboy12/overclocked_sort](https://github.com/Skyboy12/overclocked_sort)
To use it:
```rust
cargo add overclocked_sort
```
```rust
use overclocked_sort::overclocked_parallel_sort;
let data = vec![5, 1, 9, 3, 5, 2, 8];
let sorted_data = overclocked_parallel_sort(&data, 9);
```
Let me know what your benchmarks say! I'm going to go drink some water and lie down.
*(P/S: I know this is an O(N+K) integer-only algorithm which gives it an unfair mathematical advantage over general O(N log N) sorts, but I just never realized how terrifyingly fast modern Memory Bandwidth is when you stop fighting the CPU scheduler and let the L2 cache do its job).*