learned-partition-sort 0.1.0

# Learned Partition Sort — Design Document


*Breaking the O(N log N) barrier with distribution-based sorting.*

## The Problem


Comparison-based sorting algorithms (Quicksort, Mergesort) are fundamentally limited:

| Issue | Impact |
|-------|--------|
| **O(N log N) floor** | Mathematical limit of comparison model |
| **Branch misprediction** | Every `if (A > B)` stalls the CPU pipeline |
| **Ignores value information** | Treats data as opaque boxes |

## The Solution: Calculate, Don't Compare


LPS asks *"Where does this element belong in the distribution?"* instead of *"Is A > B?"*

### Algorithm Overview


| Step | Action | Complexity |
|------|--------|------------|
| 0 | Guard: `if N < 32768 → sort_unstable()` | O(1) |
| 1 | Find `min`, `max` via full scan | O(N) |
| 2 | Count elements per bucket | O(N) |
| 3 | Prefix sum → bucket offsets | O(B) |
| 4 | Scatter to auxiliary buffer | O(N) |
| 5 | Parallel sort buckets (Rayon) | O(N) amortized |
| 6 | Copy back | O(N) |

**Result:** O(N) for well-distributed data, degrades gracefully to O(N log N) on adversarial inputs.

## Key Design Decisions


### Hybridization Threshold

For N < 32768 (32K), setup cost exceeds sorting benefit. Benchmarks show LPS wins starting at ~100K elements.

### Bucket Sizing

Target ~512 elements per bucket to fit in L2 cache. This makes the refine step extremely fast.

### Safety Valve

If a bucket exceeds 4× expected size, the distribution model failed. We detect this and use `sort_unstable` as a robust fallback.

### Memory Trade-off

We accept O(N) auxiliary memory for sequential write performance:
- 100K elements: ~1.6 MB
- 1M elements: ~16 MB
- 10M elements: ~160 MB
- 100M elements: **~1.6 GB** (requires ~2.4 GB peak with cloning)

In-place permutation causes cache thrashing—speed is king.

### Unsafe Usage

Limited to two hot paths:
- `scatter()`: Uses `get_unchecked_mut` for writes
- `SendPtr`: Thread-safe raw pointer wrapper for Rayon

### SIMD Optimization (NOT IMPLEMENTED)

Analyzed but not implemented—low ROI:
- Min/max scan is memory-bound, not compute-bound
- Scatter has data dependencies unsuitable for SIMD
- Real performance comes from Rayon parallelization

## Performance Results


| Size | Distribution | LPS vs `sort_unstable` |
|------|--------------|------------------------|
| 100K | Uniform | **+37% faster** |
| 100K | Skewed | **+64% faster** |
| 1M | Uniform | **+22% faster** |
| 10M | Uniform | **+45% faster** |
| Any | Adversarial (all same) | Tie (graceful fallback) |

See [BENCHMARKS.md](BENCHMARKS.md) for full results.

## Roadmap


- [x] **Phase 1:** Core two-pass logic with parallel bucket refinement
- [x] **Phase 2:** Raise threshold to 32K based on benchmarks
- [x] ~~Phase 3: SIMD optimization~~ (skipped—low ROI)
- [/] **Phase 4:** Benchmark suite against 100M+ elements