# Learned Partition Sort — Design Document
*Breaking the O(N log N) barrier with distribution-based sorting.*
## The Problem
Comparison-based sorting algorithms (Quicksort, Mergesort) are fundamentally limited:
| **O(N log N) floor** | Mathematical limit of comparison model |
| **Branch misprediction** | Every `if (A > B)` stalls the CPU pipeline |
| **Ignores value information** | Treats data as opaque boxes |
## The Solution: Calculate, Don't Compare
LPS asks *"Where does this element belong in the distribution?"* instead of *"Is A > B?"*
### Algorithm Overview
| 0 | Guard: `if N < 32768 → sort_unstable()` | O(1) |
| 1 | Find `min`, `max` via full scan | O(N) |
| 2 | Count elements per bucket | O(N) |
| 3 | Prefix sum → bucket offsets | O(B) |
| 4 | Scatter to auxiliary buffer | O(N) |
| 5 | Parallel sort buckets (Rayon) | O(N) amortized |
| 6 | Copy back | O(N) |
**Result:** O(N) for well-distributed data, degrades gracefully to O(N log N) on adversarial inputs.
## Key Design Decisions
### Hybridization Threshold
For N < 32768 (32K), setup cost exceeds sorting benefit. Benchmarks show LPS wins starting at ~100K elements.
### Bucket Sizing
Target ~512 elements per bucket to fit in L2 cache. This makes the refine step extremely fast.
### Safety Valve
If a bucket exceeds 4× expected size, the distribution model failed. We detect this and use `sort_unstable` as a robust fallback.
### Memory Trade-off
We accept O(N) auxiliary memory for sequential write performance:
- 100K elements: ~1.6 MB
- 1M elements: ~16 MB
- 10M elements: ~160 MB
- 100M elements: **~1.6 GB** (requires ~2.4 GB peak with cloning)
In-place permutation causes cache thrashing—speed is king.
### Unsafe Usage
Limited to two hot paths:
- `scatter()`: Uses `get_unchecked_mut` for writes
- `SendPtr`: Thread-safe raw pointer wrapper for Rayon
### SIMD Optimization (NOT IMPLEMENTED)
Analyzed but not implemented—low ROI:
- Min/max scan is memory-bound, not compute-bound
- Scatter has data dependencies unsuitable for SIMD
- Real performance comes from Rayon parallelization
## Performance Results
| 100K | Uniform | **+37% faster** |
| 100K | Skewed | **+64% faster** |
| 1M | Uniform | **+22% faster** |
| 10M | Uniform | **+45% faster** |
| Any | Adversarial (all same) | Tie (graceful fallback) |
See [BENCHMARKS.md](BENCHMARKS.md) for full results.
## Roadmap
- [x] **Phase 1:** Core two-pass logic with parallel bucket refinement
- [x] **Phase 2:** Raise threshold to 32K based on benchmarks
- [x] ~~Phase 3: SIMD optimization~~ (skipped—low ROI)
- [/] **Phase 4:** Benchmark suite against 100M+ elements