pbfhogg 0.3.0

Fast OpenStreetMap PBF reader and writer for Rust. Read, write, and merge .osm.pbf files with pipelined parallel decoding.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
# time-filter optimization

`brokkr time-filter --dataset <X>`: filter a sorted PBF to a snapshot
at a cutoff timestamp. Designed for history PBFs (pick the latest
version per (kind, id) with `timestamp <= cutoff`, drop if
`visible=false`); degrades to a per-element timestamp filter when the
input is a snapshot (all our current datasets).

Active code: [`src/commands/time_filter.rs`](../src/commands/time_filter.rs).

## Architecture

Dispatched on `header.has_historical_information()`:

- **History input** (`time_filter_history`): sequential
  pending-group state machine on the 3-stage pipelined reader
  (`for_each_pipelined`). Parallel decode, sequential
  state-machine + BlockBuilder on the callback thread. Version
  selection needs cross-element peek, and versions of one
  (kind, id) can straddle blob boundaries, so trivial per-block
  parallelism is unsafe.

- **Snapshot input** (`time_filter_snapshot`): parallel per-block.
  Each (kind, id) appears exactly once in a sorted snapshot PBF,
  so blocks are independent. Same shape as tags-filter
  single-pass: `for_each_primitive_block_batch` on
  `into_blocks_pipelined` + `batch.par_iter().map_init(
  BlockBuilder::new, ...)`. Workers iterate elements by
  reference, drop elements with `timestamp > cutoff` or
  `visible=false`, write survivors into a local BlockBuilder via
  `ensure_*_capacity_local` / `flush_local`. Consumer drains
  `Vec<OwnedBlock>` in batch order; the writer's own rayon pool
  handles compression.

## Baselines

| Commit | Dataset | Mode | Wall | UUID | Notes |
|---|---|---|---|---|---|
| `1e00c3d` | japan 2.4 GB | `--bench 1` | **44.1 s** | `444823d8` | sequential `for_each`, pre-optimization |
| `3035115` | japan 2.4 GB | `--bench 1` | **37.0 s** | `6e767a67` | iter 1: `for_each_pipelined`, avg cores 2.4 |
| `f45189e` | japan 2.4 GB | `--bench 1 --force` | **7.1 s** | dirty | iter 2: parallel per-block, avg cores 20.2 |
| `f45189e` | europe 35 GB | `--bench 1` | **95.1 s** | `a5d77c9a` | iter 2 at scale, peak anon 20 GB |
| iter 3 | europe 35 GB | `--bench 1` | **94.7 s** | `8b676229` | budgeted batch 128 MB, peak anon **18.1 GB** (-10 %) |
| iter 4 (pool landed) | europe 35 GB | `--bench 1 --force` | **94.3 s** | dirty | Vec pool + pre-grow 512 KB, peak anon **18.3 GB** (no-op vs iter 3) |
| iter 5 (pool works) | europe 35 GB | `--bench 1` | **92.6 s** | `6683cb05` | thread_local BlockBuilder; peak anon **16.9 GB** (-15.5 % vs iter 2); alloc churn -87 % |
| iter 6 (reverted) | europe 35 GB | `--bench 1 --force` | 112.2 s (+21 %) | dirty | raw-frame passthrough + pread workers - **regressed**, reverted |

Throughput at iter 2: ~370 MB/s input. `writer_reorder_high_water`
jumped 4 → 64 (compression pool saturated).

## Instrumentation

- End-of-run counters (cheap, always-on):
  `timefilter_versions_seen`, `timefilter_versions_before_cutoff`,
  `timefilter_elements_written`, `timefilter_dropped_deleted`,
  `timefilter_dropped_no_snapshot_version`,
  `timefilter_is_history_path`.
- Phase markers: `TIMEFILTER_HISTORY_START/END`,
  `TIMEFILTER_SNAPSHOT_START/END`.
- `#[hotpath::measure]` on `time_filter`, `time_filter_history`,
  `time_filter_snapshot`, `process_snapshot_batch`,
  `filter_block_snapshot`, `flush_group`, `write_owned_element`,
  `clone_owned_element`.
- **Do not add per-element `Instant::now()` timers in the callback**:
  tried it and the time-source alone doubled Japan wall from 37 s to
  73 s (344 M elements). The committed counters + hotpath attributes
  are the right shape; per-function breakdown comes from
  `brokkr time-filter --hotpath`.

## Alloc profile (japan iter 2, UUID `fed75758`)

48.7 GB allocated, 54.3 GB deallocated across the run. Exclusive alloc
bytes by function:

| Function | Calls | Avg | Total | % |
|---|---:|---:|---:|---:|
| `block_builder::take_owned` | 37,775 | 506.8 KB | **18.3 GB** | **75.5 %** |
| `parse_and_inline_with_scratch` | 37,858 | 114.9 KB | 4.1 GB | 17.2 % |
| `writer::frame_blob_into` | 30,640 | 43.2 KB | 1.3 GB | 5.2 % |
| `block_builder::add_node` | 233 M | 1 B | 250 MB | 1.0 % |
| `block_builder::add_way` | 795 K | 322 B | 245 MB | 1.0 % |
| `filter_block_snapshot` | 37,775 | 703 B | 25 MB | 0.1 % |

`take_owned` dominates alloc: every BlockBuilder finalization produces
a fresh `Vec<u8>` for the serialized block + indexdata, ~500 KB each,
37 K per Japan run. These Vecs are the same ones whose high-water
retention drives the 20 GB anon peak at Europe.

## Iter 3 notes (2026-04-19)

- **Budgeted batch landed.** `for_each_primitive_block_batch_budgeted`
  with a 128 MB cap on decoded bytes per batch. Europe peak anon 20 GB
  -> 18.1 GB (-10 %), wall unchanged at 95.1 s. Tightening to 32 MB
  *regressed* anon (back to ~20 GB) because the pipelined reader's
  decode-ahead expanded to compensate for slower batch consumption.
  128 MB is the sweet spot for this code shape; below that the
  allocator / decode-ahead takes up the slack.

- **`mallopt(M_ARENA_MAX, 2)` does NOT work here.** renumber_external
  uses it to drop planet peak anon from ~26 GB to <1 GB. Tried the
  same one-liner in time_filter. Measured regression on Europe
  `--bench 1`: wall 95.1 s -> 160.4 s (+69 %), peak anon 20 GB ->
  24.8 GB (+24 %), avg cores 20.4 -> 14.1. The reason is the command
  class: renumber workers do low-alloc wire-format splice, time-filter
  workers do allocation-heavy full BlockBuilder re-encode. With 2
  arenas the malloc lock contention dominates the fragmentation win.
  **Do not re-attempt** - the pin comment in `time_filter.rs` carries
  the full measurement.

- **`parse_and_inline_with_scratch` audit resolved.** Explore agent
  traced the scratch lifecycle through `src/read/pipeline.rs:178-195`
  (thread-local ST_SCRATCH / GR_SCRATCH Vecs per decode task,
  `.clear()` + capacity retention between blobs). The 4.1 GB reported
  by alloc mode is per-worker capacity held across the run, not
  per-call churn. Previous "opportunity #3" in this doc was based on
  a misreading - **strike it**; the reduction from 829 MB -> 48 MB
  claimed in TODO.md *is* wired into the snapshot path.

## Iter 4 notes (2026-04-19): Vec pool lands, doesn't pay off

Landed the full pool infrastructure over two commits:
[`src/write/buf_pool.rs`](../src/write/buf_pool.rs) with a bounded
`Mutex<Vec<Vec<u8>>>` + RAII-adjacent get/put API (instrumented with
hit/miss, put/capacity, len counters),
[`BlockBuilder::take_owned_swap`](../src/write/block_builder.rs) as
a sibling to `take_owned` that `std::mem::replace`s a caller-provided
`Vec<u8>` in for the next encode cycle, and
[`PbfWriter::write_primitive_block_owned_pooled`](../src/write/writer.rs)
that returns `block_bytes` to the pool inside the rayon compression
closure's tail. Wired end-to-end through
`time_filter_snapshot -> process_snapshot_batch -> filter_block_snapshot`
with local `flush_local_pooled` + `ensure_*_capacity_pooled` helpers.

**Measurement:** Europe wall 95.1 s -> 94.3 s (within noise); peak
anon 20.0 GB (iter 2) / 18.1 GB (iter 3 budgeted batch alone) ->
18.3 GB (iter 4 pool + budgeted batch). The pool is doing its job
mechanically (Europe: 522 K gets, 87 % hit rate, 0 puts dropped) but
**does not move the Europe RSS needle** over iter 3's budgeted batch.

**Root cause, diagnosed via pool counters:** `par_iter().map_init(
BlockBuilder::new, ...)` creates a fresh `BlockBuilder` **per rayon
task, not per thread**. Each `BlockBuilder` processes roughly one
block, so its first `encode_block` always allocates `encode_buf` from
`cap=0` - the pool-sized `swap` installed after that encode is for
the *next* call that never comes (the task ends). Average put
capacity matches block size (~140 KB) rather than the pre-grown
target, confirming the diagnosis: pool Vecs get to BlockBuilder, but
BlockBuilder discards them unused because the first and only encode
already finished.

**The pool stays landed** - it's correct, tested (three unit tests
plus per-run counters), and unblocks iter 5. Cost when unused by
longer-lived callers is zero (`Arc` clones and bounded mutex touches
only fire in the snapshot path). The next iteration is the lever
that actually pays the pool off: make `BlockBuilder` persistent
across rayon tasks.

## Iter 5 notes (2026-04-19): pool pays off via thread_local BlockBuilder

Replaced `batch.par_iter().map_init(BlockBuilder::new, ...)` with
`batch.par_iter().map(|block| SNAPSHOT_BB.with_borrow_mut(...))`
where `SNAPSHOT_BB` is a module-scope
`thread_local!<RefCell<BlockBuilder>>`. Rayon reuses a fixed pool of
worker threads across successive `par_iter()` calls, so a
thread-local persists the same `BlockBuilder` across all batches
processed by that thread. `take_owned_swap` now installs a pool-sized
swap whose capacity survives to the next encode on the same
BlockBuilder instead of dying with the per-task one.

**Pool counter change (Japan):**

|                         | iter 4  | iter 5  |
|-------------------------|---------|---------|
| gets_total              | 43,035  | 43,035  |
| gets_hit                | 34,748  | 34,748  |
| avg get capacity        | 136 KB  | 576 KB  |
| avg put capacity        | 136 KB  | 576 KB  |
| avg put len (block size)| 136 KB  | 140 KB  |

**Alloc profile change (Japan, UUID `fed75758` -> `86db6ef6`):**

| Function                              | iter 3 (no pool) | iter 5 (pool + TLS) |
|---------------------------------------|------------------|---------------------|
| `take_owned` / `take_owned_swap`      | 18.3 GB (75 %)   | **109 MB (1.7 %)**  |
| `parse_and_inline_with_scratch`       | 4.1 GB (17 %)    | 4.4 GB (70 %)       |
| `frame_blob_into`                     | 1.3 GB (5 %)     | 1.5 GB (24 %)       |
| Total allocated                       | 48.7 GB          | **~6.3 GB**         |

**Europe wall / RSS (UUID `6683cb05`):**

- Wall 94.7 s -> **92.6 s** (-2.2 % vs iter 3).
- Peak anon 18.1 GB -> **16.9 GB** (-15.5 % vs iter 2; -6.6 % vs
  iter 3).
- Avg cores 20.3 -> 20.5.

Planet extrapolation (naive linear, Europe 16.9 GB at 35 GB input
-> planet 92 GB at ~45 GB anon). Still over the 27 GB host ceiling,
but comfortably within striking distance of opportunity #2 (raw blob
passthrough for all-survive blocks) or finer in-flight-bytes
tuning.

## Iter 6 notes (2026-04-19): raw-frame passthrough *reverted*

Attempted raw-frame passthrough for all-survive blobs following the
`extract --strategy simple` pattern in
[`src/commands/extract/common.rs`](../src/commands/extract/common.rs):
header-only schedule scan + pread-from-workers + scan-first-then-
dispatch, with workers emitting `WorkerOutcome::Passthrough` (frame
offset + size) when every element has `timestamp <= cutoff &&
visible`. Consumer preads the raw frame for passthrough items and
writes via `PbfWriter::write_raw_owned`, skipping BlockBuilder
re-encode and frame_blob_into compression entirely.

**Result on the benched workload: net regression, reverted.**

|                        | japan          | europe          |
|------------------------|----------------|-----------------|
| wall vs iter 5         | 6.8 s -> 8.1 s | 92.6 s -> 112.2 s |
| all-survive blobs      | 309 / 43,035   | 3,291 / 522,168  |
| passthrough rate       | **0.72 %**     | **0.63 %**      |

**Two independent reasons it lost:**

1. **Passthrough rate is a floor-level fraction at this cutoff**
   (2024-01-01 on a 2026-02 snapshot). Elements within a blob share
   nearby IDs, i.e. similar creation eras, but their *last-edit*
   timestamps are independent enough that any single post-cutoff
   edit disqualifies the whole blob. Break-even was ~14 %
   passthrough (scan cost vs savings); we got 0.7 %. Mathematically:
   with 77 % element-survive rate and 8,000 elements/blob, P(all
   survive) is effectively zero unless time-of-edit is strongly
   correlated across nearby IDs - and for OSM snapshots with active
   recent editing across the graph, it is not.
2. **The I/O architecture change itself lost wall.** iter 5 uses
   `ElementReader::into_blocks_pipelined` (3-stage pipeline: IO ->
   rayon decode -> reorder, with async I/O overlapping decode).
   iter 6's pread-from-workers pattern does synchronous pread per
   blob per worker. Even with passthrough savings at 0 %, the I/O
   swap alone cost wall on Europe. The pipelined reader is a
   well-tuned piece we should not give up without a compensating
   win.

**When passthrough *would* pay:** very recent cutoffs (e.g.,
"filter out edits from the last week" on a multi-year-old
snapshot) where many low-ID blobs contain exclusively pre-cutoff
edits. At that regime passthrough rate could push into the tens of
percent and clear the break-even bar. Not the workload anyone
benches us on, and not something the current CLI configures, so
there's no justification to carry the code.

**Lesson:** both of the notes/raw-group-passthrough.md pattern's
lessons apply in reverse here. (a) The shadow-counter methodology
would have caught this before the refactor - we could have
instrumented iter 5 to count all-survive blobs without switching
I/O paths and seen the 0.7 % fraction directly. (b) Even if
passthrough had paid, landing it on a different I/O architecture
than the winning one was a hidden cost. Next time: measure first
with instrumentation on the existing architecture before rewriting
the I/O path.

Reverted in-place; no commit landed iter 6.

## Remaining opportunities (ranked)

### 1. Pool `take_owned` output Vecs - LANDED iter 4

### 2. Persistent BlockBuilder across rayon tasks - LANDED iter 5

Biggest alloc target (75 % at Japan = 18.3 GB of churn; extrapolating
Europe at ~4× element count gives ~70 GB of churn and explains the
18 GB peak anon after iter 3). Lifecycle:

- Worker: `BlockBuilder::take_owned()` allocates `Vec<u8>` for
  serialized block bytes + indexdata `Vec<u8>` + optional tagdata.
- Consumer: writes via
  `PbfWriter::write_primitive_block_owned(bytes, index, tagdata)`
  at `src/write/writer.rs:616`. Writer captures into a rayon closure
  (line 640), calls `frame_blob_into()` (line 642) which **clones**
  the bytes into a fresh `FramedBlobParts` (`lines 1160-1164`). The
  original Vecs are then dropped when the closure scope ends at
  `line 666` (pipelined) or `line 674` (sync). That drop point is
  where a pool would intercept them.

No existing pool for this lifecycle (Explore agent Q1, 2026-04-19).
Closest adjacent pattern is `DecompressPool` + `PooledBuffer` at
`src/read/blob.rs:46-93`, which uses RAII-return-on-drop. The
writer-side receiving end is also unhooked today (Q3) - any pool
design needs to add either a completion callback on
`OutputSink::write_chunk()` or a drop-to-channel wrapper on the Vecs
entering the closure.

Primary motivation: **anon RSS for the planet-scale run, not wall
time**. Cuts the 18 GB Europe peak further and unblocks planet on
the 27 GB host. Wall impact is secondary (allocator fast-paths
handle ~500 KB blocks well already). Landing shape: modest
multi-commit arc - pool primitive, BlockBuilder emit-into-pool hook,
writer drop-point hook, plumb through `time_filter_snapshot`.

### 3. Raw blob passthrough for all-survive blocks - ATTEMPTED + REVERTED iter 6

See the iter-6 post-mortem above. **Measured 0.63 % passthrough on
Europe at cutoff 2024-01-01 - far below the ~14 % break-even bar**.
The theoretical correlation between adjacent IDs (nearby elements
share editing eras) does not survive contact with an active graph:
any single post-cutoff edit in a blob disqualifies the whole blob,
and at 8,000 elements/blob and realistic edit activity, P(all
survive) collapses. Passthrough is viable only at very recent
cutoffs (e.g. "filter out edits from the last week"); not a common
workload, no justification to carry the code.

### 2. Parallel history-input path

Sequential state machine, avg cores 2.4 at iter 1. No history PBFs
currently in the dataset inventory so the wall doesn't show up in
benches - keeping this deferred until a real workload lands.

Shape: workers decode + run per-block version selection emitting
`(prefix_complete_blocks, head_partial_group, tail_partial_group)`.
Consumer stitches blob N+1's head with blob N's tail when they match
(kind, id); writes the stitched winner as its own group.

### 4. Blob-level timestamp range index

Blob index v1 carries `kind/min_id/max_id/count/bbox` - no timestamp
range. With a timestamp range per blob, the scheduler could:

- Drop blobs entirely above the cutoff without decompressing them.
- Mark blobs entirely below the cutoff (and confirmed all-visible) as
  raw-passthrough candidates without per-element scanning.

Format bump, coordinates with other commands that might use
timestamp metadata (merge-changes? extract history slices?). Big
surface-area change - only worth it if time-filter becomes a hot
command, and probably after (#4) since (#4) needs no format change.

### 5. Tune in-flight depth + mallopt variants

Smallest-scope post-iter-5 move. Two knobs sit above time-filter's
code and hold peak anon above what the pool + thread_local already
saved:

- **3-stage pipelined reader's decode-ahead buffer depth** and
  **writer's reorder / compression pool depth**. Both use defaults
  today. Shrinking either trades wall parallelism for peak RSS. On
  Europe iter 5 the writer's `reorder_high_water` reached 32; cut
  that cap to 8 and the writer holds far fewer in-flight compressed
  chunks. Similar thing on the reader side. Knob locations:
  [`src/read/pipeline.rs`]../src/read/pipeline.rs for the decode
  pipeline capacity constants,
  [`src/write/writer.rs`]../src/write/writer.rs for the reorder /
  permit pool.
- **`mallopt(M_ARENA_MAX, K)` with K > 2.** Iter 3 tried K=2 and
  regressed hard (wall +69 %, anon +24 %) because the malloc lock
  became the bottleneck. K=4 or K=8 might sit in the middle: fewer
  arenas than default (which the glibc heuristic chooses at runtime
  based on core count, typically ~16) but more than 2 so lock
  contention is manageable. One-line experiment, same single-commit
  shape as the failed K=2 attempt; bench Europe `--bench 1` each.

Expected win: low-single-digit GB of anon. Cheap to try. Do this
before any planet attempt regardless of whether we continue the arc.

### 6. Reader-side scratch cap

Targets the largest remaining alloc contributor after iter 5: per-
worker thread-local scratch Vecs in
[`src/read/pipeline.rs:178-195`](../src/read/pipeline.rs) that grow
to the max-block-size seen on each decode thread and never shrink.
Iter 5 alloc profile: `parse_and_inline_with_scratch` = 4.4 GB, 70 %
of total remaining alloc. The Explore agent's iter-4 Q2 confirmed
this is **retained capacity across the run**, not per-call churn -
so the hot-path cost is already minimal; the lever here is anon RSS
for planet, not wall.

Shapes to consider:

- **Smaller retention cap per thread.** Right now each worker's
  scratch grows to whatever the biggest blob it touches needs and
  stays there. A Vec::shrink_to with a modest ceiling (e.g. 256 KB)
  after each blob would bound retention at N_threads × 256 KB but
  re-alloc on bigger blobs.
- **Shared bounded pool instead of per-thread.** Same
  `BlockBufPool`-style pattern as iter 4, applied to the reader's
  scratch. Workers pull pre-grown scratch at the top of each decode,
  return after. Bounds aggregate retention to `POOL_CAPACITY *
  scratch_size` regardless of thread count. More refactor than the
  shrink-on-loop path; roughly 2-3 commits.

Do the shrink-on-loop version first; measure. If it lands close to
its target, the pool shape is overkill. If it regresses wall (extra
allocs on big blobs), fall back to the pool shape.

### 7. Adaptive passthrough via shadow counter (iter 5 path)

The iter-6 lesson in post-mortem form: measure the all-survive blob
fraction **on the existing I/O architecture** before investing in a
new one. Concrete shape:

- Add a cheap scan inside iter 5's
  [`filter_block_snapshot`]../src/commands/time_filter.rs that
  does the `ts <= cutoff && visible` predicate as a separate pass,
  returns `(all_survive: bool, total)`. Counter at end of the phase
  reports `timefilter_all_survive_blobs / timefilter_total_blobs`.
- If reported fraction on a real workload is > ~20 %, re-investigate
  passthrough - but from the pipelined-reader path (via
  `raw_group_bytes` / `frame_raw_block` scaffolding in
  `src/write/raw_passthrough.rs`), not a pread swap. The iter-6
  I/O change was as big a loss as the low passthrough rate; any
  future attempt needs to stay on the winning I/O architecture.
- Narrow-cutoff workloads ("last week of edits on a 1-year
  snapshot") are where the fraction clears the break-even bar. We
  don't bench those today. This is a latent feature unlocked when a
  user lands one.

Cheap to land the counter (~20 lines). Measure first; decide
second.

## Relationship to other documents

- Hot-path & alloc methodology template: see the renumber_external
  arc (TODO.md "Active optimization plans" and
  [`src/commands/renumber_external/mod.rs`]../src/commands/renumber_external/mod.rs).
  Same winning pattern as here - parallel decode, worker-parallel
  block work, consumer forwards OwnedBlocks to a writer with its own
  rayon compression pool.
- Per-block parallel pattern template: `tags_filter.rs` single-pass
  (`for_each_primitive_block_batch` + `par_iter().map_init(
  BlockBuilder::new, ...)`). The snapshot path here is a direct
  adaptation.
- The "don't chase raw passthrough without measuring" rule from
  `src/commands/tags_filter.rs` pass-2 worker applies to opportunity
  #4 above: measure the all-survive fraction before building the
  passthrough path.

## Planet

**Not running planet** until opportunity #1 lands. Europe peak anon
20 GB at 35 GB input → naive linear extrapolation to planet 92 GB is
~52 GB, which OOMs the 27 GB bench host. Pooling `take_owned` is the
blocker; everything else is follow-up.