peakbagger 0.0.0

Bag sequential local elevation maxima
# Peakbagger

## TODO

- [x] Benchmark performance
- [x] Profile benchmark
- [x] Optimize performance (7.91s → 3.35s, 58% faster)
- [ ] Handle bad SRTM data

## Optimizations

### Replace `HashMap` with `FxHashMap` in `TileCache`

Profile showed 27.5% of CPU in HashMap hashing (`hash_one` 20.4% +
`Hasher::write` 7.1%). `TileCoord` is 4 bytes (two `i16`s) but `HashMap` uses
SipHash. Switched to `FxHashMap` from `rustc-hash` which uses a fast
multiply-xor hash.

- Baseline: 7.91s
- After: 5.16s
- **Improvement: 35%**

### Cache current tile pointer in `find_next_peak`

The ring search visits many consecutive cells in the same tile, but each
`cache.elevation()` call did a hash lookup. Added a `CachedTile` that
stores a raw pointer to the current tile's elevation data and skips the
hash lookup when the tile coordinate hasn't changed. The pointer is
stable because `Vec` heap data doesn't move when the `HashMap` resizes.

- Baseline: 5.16s
- After: 4.61s
- **Improvement: 11%**

### Bulk memcpy + vectorized byte-swap in `Tile::from_bytes`

The per-element loop doing `i16::from_be_bytes` was not vectorized by
LLVM. Replaced with `copy_nonoverlapping` followed by an in-place
`i16::from_be` loop, which LLVM vectorizes into SIMD byte-swap
instructions.

- Baseline: 4.61s
- After: 3.61s
- **Improvement: 22%**

### Reorder `ring_cells` for spatial coherence

The iteration interleaved top and bottom ring edges, alternating between
opposite ends on every cell. This defeated the `CachedTile` optimization,
forcing a hash lookup on nearly every cell. Reordering to iterate each
edge contiguously keeps consecutive cells in the same tile.

- Baseline: 3.61s
- After: 3.30s
- **Improvement: 9%**

### Skip tiles below `min_elevation` threshold

Added `max_elevation` field to `Tile`. When the search enters a new
tile, if `max_elevation ≤ min_elevation`, all cells in that tile are
skipped without reading elevation data, avoiding cache misses entirely.

- Benchmark (35km searches): no significant change (tiles nearby have
  qualifying peaks)
- Bag 55 (207km search, min 2550m): **~5s → 2.7s (46% faster)** — most
  tiles in the search area are below 2550m and get skipped

### Fast path in `resolve_global` for same-tile cells

`div_euclid` / `rem_euclid` compile to `idiv` instructions (~30
cycles each). Added a fast path: if global row/col are in [0, stride),
the cell is in the center tile and no division is needed. Uses unsigned
comparison trick to check both bounds in one instruction.

- Baseline: 3.35s
- After: 3.27s
- **Improvement: 2%**

### Reuse `ring_cells` Vec across rings (reverted)

Attempted to reuse a single `Vec` across ring iterations instead of
allocating a new one per ring. This caused an 8% regression, likely
because `Vec::with_capacity` with exact size produces better-fitting
allocations than reusing a growing buffer.