pbfhogg 0.3.0 - Docs.rs

# Reverse Geocoding Index: Implementation Specification

Parent document: [reverse-geocoding.md](reverse-geocoding.md) (problem statement and
design direction).

**Note: implementation deviated from this spec.** Node coordinates use a compact
rank-indexed array (via `IdSetDense` rank) instead of the spec's `DenseMmapIndex`.
Pass 4 (S2 cell assignment) uses 256 temp-file buckets instead of external merge
sort. Actual peak anon RSS is **29.5 GB** (planet, commit `7e9c2e9`, 2026-04-17) vs the spec's ~14.5 GB budget - the peak lands in `GEOCODE_PASS1_5` (referenced-node collect), not in the `DenseMmapIndex` mmap. The earlier 17.8 GB figure under-reported because brokkr's sidecar previously hid short-emitting phase markers. See [geocode-build-opportunities.md](geocode-build-opportunities.md) for the breakdown.

## 1. Scope

This spec covers two deliverables:

1. **`build-geocode-index`** - a new pbfhogg CLI command that reads an OSM PBF and
   writes a set of binary index files optimized for reverse geocoding queries.
   Gated behind the `commands` feature (same as other CLI commands).
2. **`pbfhogg::geocode_index`** - a library module with two parts:
   - **Reader** (`Reader::query`, `Reader::candidates`): gated behind the
     `geocode-reader` feature. This is what nidhogg links against - it does not
     pull in `roaring`, `serde_json`, or builder code.
   - **Builder** (the build pipeline): gated behind `commands`, which implies
     `geocode-reader`.

The spec does not cover nidhogg's HTTP endpoint, response formatting, or
authentication. Those are serving concerns that consume the reader API.

## 2. CLI interface

```
pbfhogg build-geocode-index <INPUT.osm.pbf> --output-dir <DIR> [OPTIONS]

Options:
  --output-dir <DIR>       Directory for index files (created if missing)
  --force                  Proceed without indexdata (slower fallback)
  --street-level <N>       S2 cell level for streets/addresses [default: 17]
  --coarse-level <N>       Fallback cell level for rural areas [default: 14]
  --admin-level <N>        S2 cell level for admin boundaries [default: 10]
  --max-admin-vertices <N> Douglas-Peucker vertex cap per polygon [default: 500]
  --search-radius <M>      Fine-level max distance in meters [default: 75]
  --coarse-search-radius <M> Coarse-level max distance in meters [default: 1000]
```

The command requires an input PBF with blob-level indexdata (generated by
`pbfhogg cat`). The `--force` flag allows raw PBFs at the cost of slower scanning.

Output: a directory containing the index files described in section 4. The command
prints a summary on completion: element counts, file sizes, build time, peak memory.

Exit codes: 0 on success, 1 on error. Partial output is removed on failure.

## 3. Build pipeline

The build has four sequential passes over the PBF, plus a finalization step. Each
pass uses `ElementReader::for_each_pipelined` with blob-level filtering to skip
irrelevant blobs.

### 3.0. Header and validation

Read the PBF header. Verify indexdata is present (or `--force`). Record the
`osmosis_replication_sequence_number` and `osmosis_replication_timestamp` if available
- these are written into the index header for staleness detection.

### 3.1. Pass 1: Nodes

**Input:** All node blobs.
**Blob filter:** `BlobFilter::only_nodes()`.

**Collect:**
- **Address points:** Nodes with both `addr:housenumber` and `addr:street` tags.
  Store `(node_id, lat_e7, lon_e7, housenumber, street, postcode)` in a temporary
  file, sorted by node ID. The housenumber, street, and postcode (from
  `addr:postcode`, if present) are interned into the global string pool
  (section 4.8). Postcode is stored as string offset 0 (empty) when the tag is
  absent.
- **Node coordinate index:** All node coordinates for way resolution in pass 2.
  Uses the same strategy as `add-locations-to-ways`: a `DenseMmapIndex`
  (file-backed mmap, `index[node_id] = (lat_e7, lon_e7)`) for the default path.
  This is the dominant memory cost - ~16 GB touched for planet after blob
  filtering.

**Memory budget:** ~16 GB virtual for the dense node index (same as ALTW). The
address point temporary file is append-only and streamed to disk. String pool
grows in memory but street/housenumber cardinality is bounded (~50M unique
strings for planet, ~1.5 GB at 30 bytes average).

### 3.2. Pass 2: Ways

**Input:** All way blobs.
**Blob filter:** `BlobFilter::only_ways()`.

**Collect three categories in parallel (per-block, folded):**

**Streets:** Ways with a `highway` tag and a `name` tag, excluding highway values:
`footway`, `path`, `track`, `steps`, `cycleway`, `service`, `pedestrian`,
`bridleway`, `construction`. For each qualifying way:
- Resolve node coordinates from the dense index.
- Store `(way_id, name_string_id, [(lat_e7, lon_e7), ...])` in a temporary file.

**Address points from buildings:** Ways with `building` tag and both
`addr:housenumber` and `addr:street`. Compute the centroid (arithmetic mean of
node coordinates). Append to the same address point temporary file from pass 1.

**Interpolation ways:** Ways with `addr:interpolation` tag and `addr:street`.
Store `(way_id, street_string_id, interpolation_type, [(lat_e7, lon_e7), ...])`.
The start/end house numbers are resolved in finalization (section 3.5) by matching
interpolation endpoints against nearby address points.

**Memory budget:** Dense node index remains mapped (read-only). Way data is streamed
to temporary files. Per-block scratch is bounded by the ALTW batch budget pattern
(128 MB decode budget).

### 3.3. Pass 3: Relations (admin boundaries)

**Input:** All relation blobs, then a second scan for way member geometry.
**Blob filter:** `BlobFilter::only_relations()` for first scan.

This pass assembles administrative boundary polygons. It follows a two-scan approach:

**Scan 3a - Collect boundary relations:**
Stream relations. For each relation with `boundary=administrative` (admin_level
2-10) or `boundary=postal_code` (treated as level 11):
- Record the relation ID, admin_level, name, and country_code (ISO 3166-1:alpha2
  from `ISO3166-1:alpha2` tag, for admin_level=2 only).
- Collect all way member IDs into an `IdSetDense`, recording the role (`outer`,
  `inner`, or empty) per member per relation. Both outer and inner members are
  needed - inner rings define holes (enclaves, exclusion zones) that must be
  subtracted during point-in-polygon testing. Without inner rings, queries inside
  enclaves (e.g., Vatican City within Rome) would incorrectly match the enclosing
  boundary.

**Scan 3b - Resolve way geometries:**
Re-scan way blobs (blob filter: `BlobFilter::only_ways()`). For each way whose
ID is in the collected set, resolve node coordinates from the dense index and
store the way geometry.

**Ring assembly:**
After scan 3b, assemble ways into closed rings per relation using endpoint
matching, grouped by role. This reuses the ring assembly algorithm from nidhogg's
`geocode.rs` (`assemble_rings`), adapted to work with pbfhogg's coordinate types:
- Group way geometries by role: outer (including empty role) and inner separately.
- For each group, build an endpoint map: `HashMap<(i32, i32), Vec<usize>>`
  mapping quantized endpoints to segment indices.
- Trace rings by following endpoint chains.
- Classify outer rings as exterior and inner rings as holes by role assignment
  (not signed area alone - the role tag is authoritative per OSM conventions).

**Polygon simplification:**
Apply Douglas-Peucker to each assembled ring with a vertex cap
(`--max-admin-vertices`, default 500). This dramatically reduces storage and
point-in-polygon cost for complex boundaries like Norway's coastline.

**Memory budget:** The `IdSetDense` for way member IDs is small (~100K ways for
planet boundaries). The assembled polygons for all admin boundaries fit in memory
- there are ~300K admin boundary relations for planet, but after simplification
the total vertex count is bounded (~150M vertices worst case before simplification,
~50M after). The dense node index can be unmapped after this pass.

### 3.4. Pass 4: S2 cell assignment and index construction

No PBF reading - this pass operates on the temporary files from passes 1-3.

**For each address point:**
Compute cell IDs at both `street_level` and `coarse_level`. Record
`(cell_id, addr_point_index)` in both the fine and coarse cell maps.

**For each street way:**
For each consecutive node pair (segment), compute the S2 cell covering at both
`street_level` and `coarse_level`. Record `(cell_id, way_index, segment_index)`
for each covered cell in both maps. Each segment maps to ~1 cell at level 17,
so there is no need for per-way deduplication - each (cell, way, segment) tuple
is naturally unique.

**For each interpolation way:**
Same as streets - cover each segment at both levels, recording segment refs.

**For each admin polygon:**
Compute edge and interior cell sets at `admin_level`:

**Edge cells:** For each segment of each ring (outer and inner), compute the S2
cells at `admin_level` that the segment crosses. This uses the same segment-level
covering as streets: compute the cell ID for each endpoint and, if they differ,
walk the cell boundary crossings between them. All cells touched by any ring
segment are edge cells.

**Interior cells:** Flood-fill from a seed point known to be inside the polygon
(e.g., the centroid, verified by point-in-polygon test). Starting from the seed
cell, BFS-expand to all neighbors at `admin_level` that pass the point-in-polygon
test (testing the cell center). Stop expanding when the cell center falls outside
the polygon. Edge cells (already identified) are excluded from the interior set.

This is an approximation, not a geometric proof. A cell whose center is inside
the polygon could still straddle the boundary near narrow features, coastlines,
or holes.

Interior cells are marked with the high-bit flag (0x8000_0000) in the entry list.

**Interior cell semantics:** The high bit is a **hint**, not proof of
containment. The reader **still performs a point-in-polygon test** on
interior-flagged cells - the same test as edge cells. The difference is
priority: interior cells are tested first, and the reader can skip remaining
candidates at the same admin level once a match is confirmed (since interior
cells are far from boundaries, the first match is almost always correct). This
avoids the correctness problem of skipping ray-casting entirely while still
getting most of the performance benefit - interior cells are the common case,
and PIP on a simplified polygon is cheap.

The worst case for a false interior-flag is wasted work (testing a polygon
that doesn't contain the point), not a wrong answer.

**Sort and merge:**
- Sort street/address/interpolation cell entries by cell ID.
- Build the merged geo-cell index: for each unique cell ID, record offsets into
  the three entry lists.
- Sort admin cell entries by cell ID.
- Build the admin cell index.

**Write all index files** (section 4).

**Memory budget:** The sort is the bottleneck. With dual-level indexing, each
geometry produces two cell entries (fine + coarse). For planet: ~500M address
points × 2 levels × 8 bytes = ~8 GB for addresses; ~1.5B street segments ×
2 levels × 14 bytes (cell_id + way_index + segment_index) = ~42 GB for streets.
**External merge sort is the default path at planet scale**, not a fallback:
write sorted chunks to temp files, k-way merge. On a 30 GB host, chunk size
of ~10 GB with 3-4 merge passes.

The S2 `RegionCoverer` is used for line segment covering. The `s2` Rust crate
(v0.0.13, `s2 = "0.0.13"` on crates.io) provides `CellID`, `LatLng`,
`RegionCoverer`, and `CellID::all_neighbors`. For admin polygon covering, see
the edge-walk + flood-fill approach described above.

### 3.5. Finalization: interpolation endpoint resolution

For each interpolation way, resolve the house numbers at each endpoint:

1. Look up nearby address points using the cell index (already built at this point).
2. **Filter by street name:** Only consider address points whose `addr:street`
   matches the interpolation way's `addr:street`. This prevents assigning numbers
   from a different street that happens to be nearby.
3. **Prefer coordinate-coincident points:** If an address point has the same
   coordinates (within 1 decimicrodegree) as the interpolation endpoint, prefer
   it. In OSM, interpolation way endpoints are typically placed on the address
   nodes they reference. This is a heuristic - coordinate coincidence is not
   node identity (address records don't retain source node IDs), but it works
   well in practice.
4. **Fall back to nearest:** Among same-street candidates within the search radius,
   pick the nearest.

If both endpoints resolve to address points with parseable numeric house numbers,
store the start/end numbers. Otherwise, the interpolation way is written with
start=0, end=0 (unusable for interpolation but still provides a street name match
for nearest-street queries).

### 3.6. Cleanup

Remove temporary files (dense node index, unsorted intermediate files). Print
summary statistics.

## 4. On-disk index format

All files use **little-endian** byte order. All coordinates are **i32
decimicrodegrees** (10^-7 degrees), matching nidhogg's disk format convention.
String references are **u32 byte offsets** into `strings.bin`.

**Serialization approach:** Records are read and written via manual byte-level
helpers (`read_u32_le`, `write_i32_le`, etc.), not `#[repr(C)]` transmutation.
This avoids alignment padding issues (several records mix u32/u16/u8 fields that
would get padding under `repr(C)`) and is safe on all architectures. The format
structs in Rust use `#[repr(Rust)]` with explicit `from_bytes` / `to_bytes`
methods.

**Offset widths:** Files exceeding 4 GB at planet scale use **u64** offsets:
`street_nodes.bin` (~24 GB) and `street_entries.bin` (~9 GB, due to segment-level
indexing). All other offset fields are **u32**.

### 4.0. `geocode_header.bin` - Index header

Fixed-size, 128 bytes (64 bytes used, 64 bytes reserved for future fields
without a format version bump):

| Offset | Type | Field |
|--------|------|-------|
| 0 | `[u8; 4]` | Magic: `GIDX` |
| 4 | `u32` | Format version (1) |
| 8 | `u8` | Street cell level (default 17) |
| 9 | `u8` | Coarse cell level (default 14) |
| 10 | `u8` | Admin cell level (default 10) |
| 11 | `u8` | Reserved (zero) |
| 12 | `u16` | Max admin vertices (default 500) |
| 14 | `u16` | Reserved (zero) |
| 16 | `f32` | Fine search radius in meters (default 75.0) |
| 20 | `f32` | Coarse search radius in meters (default 1000.0) |
| 24 | `u32` | Replication sequence number (0 if unknown) |
| 28 | `u64` | Replication timestamp (unix seconds, 0 if unknown) |
| 36 | `u32` | Address point count |
| 40 | `u32` | Street way count |
| 44 | `u32` | Interpolation way count |
| 48 | `u32` | Admin polygon count |
| 52 | `u32` | Geo cell count (street-level) |
| 56 | `u32` | Coarse cell count |
| 60 | `u32` | Admin cell count |

The reader checks magic + version on load and rejects mismatches. The cell levels
and search radius are stored so the reader doesn't need to assume defaults.

### 4.1. `geo_cells.bin` - Merged street-level cell index

Sorted by cell_id. One entry per unique cell. Binary-searchable.

| Offset | Type | Field |
|--------|------|-------|
| 0 | `u64` | S2 cell ID at street level |
| 8 | `u64` | Byte offset into `street_entries.bin` (u64::MAX = no streets) |
| 16 | `u32` | Byte offset into `addr_entries.bin` (0xFFFF_FFFF = no addresses) |
| 20 | `u32` | Byte offset into `interp_entries.bin` (0xFFFF_FFFF = no interpolation) |

Record size: **24 bytes**.

Street entry offset is u64: segment-level indexing produces ~1.5B entries at
planet scale (6 bytes each = ~9 GB), exceeding the u32 4 GB limit. Address
and interpolation entry offsets remain u32 (addr_entries ~2 GB,
interp_entries ~400 MB).

### 4.2. `street_entries.bin` - Street segment refs per cell

Variable-length records, packed sequentially. Each record referenced by a byte
offset from `geo_cells.bin`. Entries point to **segments** (pairs of consecutive
nodes), not whole ways:

| Type | Field |
|------|-------|
| `u16` | Entry count (N) |
| `[SegmentRef; N]` | Street segment references |

Each `SegmentRef` is 6 bytes:

| Type | Field |
|------|-------|
| `u32` | Way index (into `street_ways.bin`) |
| `u16` | Segment index (node pair starting at this index within the way) |

At query time, the reader reads exactly 2 nodes (16 bytes) from
`street_nodes.bin` per candidate: `way.node_offset + segment_index * 8` and
`way.node_offset + (segment_index + 1) * 8`. No full-polyline iteration, no
per-way dedup hash table. The way header is only read for the street name.

### 4.3. `addr_entries.bin` - Address point IDs per cell

| Type | Field |
|------|-------|
| `u16` | Entry count (N) |
| `[u32; N]` | Address point indices (into `addr_points.bin`) |

Address points are single points - no segment indexing needed.

### 4.4. `interp_entries.bin` - Interpolation segment refs per cell

Same segment-level layout as street entries:

| Type | Field |
|------|-------|
| `u16` | Entry count (N) |
| `[SegmentRef; N]` | Interpolation segment references (6 bytes each) |

For interpolation, computing the house number requires way-level context
(total length, start/end house numbers). The interpolation path:
1. Read 2 nodes for the matched segment (distance check) - same as streets.
2. For the nearest interpolation hit only (in `query()`) or on demand (via
   `Reader::interpolate()` for `candidates()` users), read the way header
   from `interp_ways.bin` for start/end numbers and total node count.
3. Compute total way length and accumulated distance to the snap point by
   reading **all** nodes in the way. This is O(node_count), not
   O(segment_index), because total_way_length requires the full polyline.
   Interpolation ways are typically short (5-20 nodes), so this is
   acceptable.

**Entry count overflow (sections 4.2-4.4):** The u16 entry count limits each cell
to 65535 entries. At level 17 (~77m cells) with segment-level indexing, each
segment maps to ~1 cell, so the count per cell is bounded by the number of
segments intersecting a 77m area - well under 65535 even in dense urban areas.
The builder emits a debug assertion if any cell exceeds this, and truncates with
a warning if triggered.

### 4.4a. Coarse-level fallback index

In rural areas (e.g., northern Scandinavia), the nearest named street can be
500m+ from a query point - beyond the ~230m reach of the 9-cell level-17
neighborhood. A coarse fallback index at level 14 (~620m cells, 9-cell
neighborhood covers ~1.8 km) handles these cases.

**`coarse_geo_cells.bin`** - same layout as `geo_cells.bin` (24 bytes/record)
but indexed at `coarse_level` (default 14):

| Offset | Type | Field |
|--------|------|-------|
| 0 | `u64` | S2 cell ID at coarse level |
| 8 | `u64` | Byte offset into `coarse_street_entries.bin` |
| 16 | `u32` | Byte offset into `coarse_addr_entries.bin` |
| 20 | `u32` | Byte offset into `coarse_interp_entries.bin` |

**`coarse_street_entries.bin`**, **`coarse_addr_entries.bin`**,
**`coarse_interp_entries.bin`** - same variable-length layout as their
fine-level counterparts (segment refs for streets/interp, u32 indices for
addresses). They reference the same underlying data files.

The builder generates both fine and coarse cell indices in pass 4 - each
geometry is assigned cells at both levels. The coarse cell index file itself is
small (~800 MB vs ~7 GB for fine, since level 14 has ~8x fewer cells). However,
the coarse **entry files** are roughly the same size as the fine entry files
because every segment appears once at each level - the entries are just grouped
into fewer, larger cells. This is a deliberate space-for-coverage tradeoff:
the coarse index roughly doubles the entry file storage (~12 GB at planet) to
cover rural areas where the fine level's 9-cell neighborhood is too narrow.

### 4.5. `street_ways.bin` - Street way geometry

Fixed-size header per way, followed by variable node data in `street_nodes.bin`:

| Offset | Type | Field |
|--------|------|-------|
| 0 | `u64` | Node offset (byte index into `street_nodes.bin`) |
| 8 | `u32` | Name string offset (into `strings.bin`) |
| 12 | `u16` | Node count |

Record size: **14 bytes**.

Node offset is u64 because `street_nodes.bin` reaches ~24 GB at planet scale.
u16 for node count (max 65535) instead of traccar-geocoder's u8 (max 255) -
some OSM ways have >255 nodes.

### 4.6. `street_nodes.bin` - Street way node coordinates

All street way nodes, concatenated. Each way's nodes are contiguous:

| Type | Field |
|------|-------|
| `i32` | Latitude (decimicrodegrees) |
| `i32` | Longitude (decimicrodegrees) |

Record size: **8 bytes** per node.

### 4.7. `addr_points.bin` - Address points

Fixed-size records:

| Offset | Type | Field |
|--------|------|-------|
| 0 | `i32` | Latitude (decimicrodegrees) |
| 4 | `i32` | Longitude (decimicrodegrees) |
| 8 | `u32` | House number string offset |
| 12 | `u32` | Street name string offset |
| 16 | `u32` | Postcode string offset (0 = no postcode on element) |

Record size: **20 bytes**.

The postcode field stores `addr:postcode` from the source element when present.
Many countries (US, UK, much of Asia) have sparse or no postal code boundary
data in OSM; individual elements often carry `addr:postcode` directly. At query
time, if no `boundary=postal_code` polygon contains the query point, the reader
falls back to the postcode of the nearest address point.

### 4.8. `interp_ways.bin` - Interpolation way metadata

| Offset | Type | Field |
|--------|------|-------|
| 0 | `u64` | Node offset (byte index into `interp_nodes.bin`) |
| 8 | `u32` | Street name string offset |
| 12 | `u32` | Start house number (0 = unresolved) |
| 16 | `u32` | End house number (0 = unresolved) |
| 20 | `u16` | Node count |
| 22 | `u8` | Interpolation type: 0=all, 1=even, 2=odd |

Record size: **23 bytes**.

Node offset is u64 for consistency with street_ways (interp_nodes.bin is small,
but using u32 here would be a format inconsistency that invites bugs).

### 4.9. `interp_nodes.bin` - Interpolation way node coordinates

Same layout as `street_nodes.bin`: `(i32 lat, i32 lon)` pairs, **8 bytes** each.

### 4.10. `admin_cells.bin` - Admin boundary cell index

Sorted by cell_id. Binary-searchable:

| Offset | Type | Field |
|--------|------|-------|
| 0 | `u64` | S2 cell ID at admin level |
| 8 | `u32` | Byte offset into `admin_entries.bin` |

Record size: **12 bytes**.

### 4.11. `admin_entries.bin` - Admin polygon IDs per cell

Variable-length records:

| Type | Field |
|------|-------|
| `u16` | Entry count (N) |
| `[u32; N]` | Polygon indices (into `admin_polygons.bin`) |

High bit (0x8000_0000) on a polygon index marks an **interior cell hint**: the
polygon is likely to contain query points in this cell. The reader still
performs point-in-polygon on these entries but tests them first and can skip
remaining same-level candidates once confirmed. See section 3.4 for semantics.

### 4.12. `admin_polygons.bin` - Admin boundary metadata

| Offset | Type | Field |
|--------|------|-------|
| 0 | `f32` | Approximate area (square degrees, for smallest-polygon selection) |
| 4 | `u32` | Vertex offset (byte index into `admin_vertices.bin`) |
| 8 | `u32` | Vertex count |
| 12 | `u32` | Name string offset |
| 16 | `u16` | Country code (ISO 3166-1 alpha2, packed: `(c0 << 8) | c1`) |
| 18 | `u8` | Admin level (2-11) |
| 19 | `u8` | Reserved (zero) |

Record size: **20 bytes**.

Fields are ordered largest-first to avoid alignment padding. Vertex count is u32
to allow bypassing simplification (`--max-admin-vertices 0`); with the default cap
of 500, u16 would suffice, but u32 keeps the format general.

### 4.13. `admin_vertices.bin` - Admin polygon vertices

All polygon vertices, concatenated. Each polygon's vertices are contiguous.
Exterior ring first, then holes (if any), separated by a sentinel
`(i32::MIN, i32::MIN)`:

| Type | Field |
|------|-------|
| `i32` | Latitude (decimicrodegrees) |
| `i32` | Longitude (decimicrodegrees) |

Record size: **8 bytes** per vertex.

### 4.14. `strings.bin` - Global string pool

Null-terminated UTF-8 strings, concatenated. Every string reference in the index
is a u32 byte offset into this file. Offset 0 is always an empty string (`\0`).

Strings are globally deduplicated during the build via an `FxHashMap<String, u32>`
mapping string content to its byte offset.

### File count and estimated sizes

| | Denmark (465 MB) | Planet (~87 GB) |
|---|---|---|
| geo_cells.bin | ~35 MB | ~7 GB |
| street_entries.bin | ~30 MB | ~9 GB |
| addr_entries.bin | ~10 MB | ~2 GB |
| interp_entries.bin | ~2 MB | ~400 MB |
| coarse_geo_cells.bin | ~4 MB | ~800 MB |
| coarse_*_entries.bin | ~35 MB | ~12 GB |
| street_ways.bin | ~15 MB | ~3 GB |
| street_nodes.bin | ~120 MB | ~24 GB |
| addr_points.bin | ~60 MB | ~10 GB |
| interp_ways.bin | ~2 MB | ~300 MB |
| interp_nodes.bin | ~5 MB | ~1 GB |
| admin_cells.bin | ~1 MB | ~200 MB |
| admin_entries.bin | ~500 KB | ~100 MB |
| admin_polygons.bin | ~100 KB | ~6 MB |
| admin_vertices.bin | ~5 MB | ~500 MB |
| strings.bin | ~20 MB | ~2 GB |
| **Total** | **~360 MB** | **~71 GB** |

These are rough estimates. The planet estimate is ~4x larger than
traccar-geocoder's 18 GB. The main cost drivers: segment-level street entries
at both fine and coarse levels (~9 GB each = ~18 GB for street entries alone),
postcode per address point (20 vs 16 bytes), and wider record formats (u16
node counts, u64 node offsets). The coarse entry files are roughly the same
size as the fine entry files because every segment appears once at each level -
the coarse level has fewer cells but more entries per cell.

The tradeoff is accepted: segment indexing eliminates per-way polyline iteration
on the query path, the coarse index covers rural areas with a wider search
radius, and 71 GB fits comfortably on a 128 GB SSD alongside the source PBF.
Actual sizes will be validated on Denmark during implementation.

## 5. Reader API

The reader lives in `pbfhogg::geocode_index` (library crate, gated behind the
`geocode-reader` feature). It memory-maps all index files and provides two
query levels: raw candidates and ranked results.

### 5.1. Public types

```rust
pub mod geocode_index {
    /// A memory-mapped reverse geocoding index.
    pub struct Reader { /* private */ }

    /// Result of a reverse geocoding query. Borrows strings from the
    /// Reader's mmap'd string pool - zero allocation on the query path.
    pub struct ReverseResult<'a> {
        /// Nearest address point, if any within search radius.
        pub address: Option<AddressMatch<'a>>,
        /// Nearest named street, if any within search radius.
        pub street: Option<StreetMatch<'a>>,
        /// Interpolated address, if any.
        pub interpolation: Option<InterpolationMatch<'a>>,
        /// Administrative hierarchy (one entry per admin level found).
        /// Admin levels are 2-11, so max 10 entries. ArrayVec avoids heap
        /// allocation on the query hot path.
        pub admin: ArrayVec<AdminMatch<'a>, 10>,
    }

    pub struct AddressMatch<'a> {
        pub lat_e7: i32,
        pub lon_e7: i32,
        pub house_number: &'a str,
        pub street: &'a str,
        /// addr:postcode from the source element, if present.
        pub postcode: Option<&'a str>,
        pub distance_m: f64,
    }

    pub struct StreetMatch<'a> {
        pub name: &'a str,
        /// Closest point on the street segment.
        pub snap_lat_e7: i32,
        pub snap_lon_e7: i32,
        pub distance_m: f64,
    }

    pub struct InterpolationMatch<'a> {
        pub street: &'a str,
        pub house_number: u32,
        pub distance_m: f64,
    }

    pub struct AdminMatch<'a> {
        pub admin_level: u8,
        pub name: &'a str,
        /// ISO 3166-1 alpha2, only populated for admin_level=2.
        pub country_code: Option<&'a str>,
    }
}
```

All string fields are `&'a str` referencing the mmap'd `strings.bin` via the
Reader's lifetime. This makes `query()` allocation-free - the only work is
binary search, distance math, and point-in-polygon. Callers that need owned
strings (e.g., for JSON serialization) copy at the point of use.

### 5.2. Two-layer API

The reader exposes two levels: a low-level `candidates()` that returns raw
spatial matches with distances, and a high-level `query()` that applies default
ranking. This lets nidhogg (or future consumers) implement custom ranking
without forking the reader.

```rust
impl Reader {
    /// Open an index directory. Memory-maps all files.
    /// Returns an error if any file is missing or the header is invalid.
    pub fn open(dir: &Path) -> Result<Self>;

    // --- Low-level: raw candidates ---

    /// Return all street/address/interpolation candidates within the search
    /// radius, and all admin polygons containing the point. No ranking.
    /// Tries fine level first, falls back to coarse if no street/address.
    pub fn candidates(&self, lat: f64, lon: f64) -> Candidates<'_>;

    // --- High-level: ranked result ---

    /// Query with default ranking: nearest address, nearest street, nearest
    /// interpolation, smallest-area admin per level, postcode fallback.
    /// Allocation-free fast path - does not materialize all candidates.
    /// Equivalent in result to `self.candidates(lat, lon).into_result(self)`.
    pub fn query(&self, lat: f64, lon: f64) -> ReverseResult<'_>;

    /// Header metadata.
    pub fn replication_sequence(&self) -> u32;
    pub fn replication_timestamp(&self) -> u64;
    pub fn street_cell_level(&self) -> u8;
    pub fn coarse_cell_level(&self) -> u8;
    pub fn admin_cell_level(&self) -> u8;
}

/// Raw candidates from a spatial lookup. Unranked, untruncated.
/// Uses Vec - callers choosing the low-level API accept the allocation.
pub struct Candidates<'a> {
    /// All address points within search radius, with distances.
    pub addresses: Vec<AddressMatch<'a>>,
    /// All street segments within search radius, with distances.
    /// Name is resolved for every candidate (way header read per hit).
    pub streets: Vec<StreetMatch<'a>>,
    /// All interpolation segment hits within search radius.
    /// House numbers are NOT computed - use Reader::interpolate() to
    /// resolve a specific candidate into a house number.
    pub interpolations: Vec<InterpolationCandidate<'a>>,
    /// All admin polygons containing the query point - may include
    /// multiple polygons at the same admin level (e.g., overlapping
    /// municipal boundaries). Not collapsed to one-per-level.
    pub admin: Vec<AdminMatch<'a>>,
}

/// A raw interpolation hit. The house number is not yet computed because
/// that requires reading the full way geometry (O(node_count)).
pub struct InterpolationCandidate<'a> {
    pub street: &'a str,
    pub way_index: u32,
    pub segment_index: u16,
    pub distance_m: f64,
    /// Closest point on the segment.
    pub snap_lat_e7: i32,
    pub snap_lon_e7: i32,
}

impl<'a> Candidates<'a> {
    /// Apply default ranking: nearest of each type, smallest admin per level,
    /// postcode fallback from nearest address if no postal boundary.
    /// Resolves interpolation for the nearest interpolation candidate.
    pub fn into_result(self, reader: &'a Reader) -> ReverseResult<'a>;
}

impl Reader {
    // ... (open, candidates, query as before)

    /// Compute the interpolated house number for a candidate.
    /// Reads the full way geometry to compute accumulated distance.
    /// Returns None if the way has unresolved start/end numbers.
    pub fn interpolate(&self, candidate: &InterpolationCandidate<'_>) -> Option<u32>;
}
```

`candidates()` returns all matches within the applicable search radius (fine
or coarse). It uses `Vec` - callers choosing the low-level API accept the
allocation cost. Admin matches use `ArrayVec<_, 10>` since admin levels are
bounded (2-11).

`query()` is allocation-free internally: it tracks only the nearest candidate
of each type during iteration, never materializing the full candidate set.
It calls `interpolate()` only for the single nearest interpolation hit.

`candidates()` reads the way header (for street name) for every retained
street/interpolation hit. This is one 14-byte read per candidate from the
mmap'd file - cheap compared to the distance math.

nidhogg can call `candidates()` directly to implement custom ranking (e.g.,
prefer a specific admin level, apply language-specific street name preferences,
or combine with forward-geocoding signals).

### 5.3. Query algorithm

`Reader::query(lat, lon)`:

**Step 1 - Fine street-level lookup:**
```
cell = S2CellID::from(LatLng::from_degrees(lat, lon)).parent(street_level)
cells = [cell] + cell.all_neighbors(street_level)   // 9 cells
```

**Step 2 - For each cell, binary search `geo_cells.bin`:**
If found, read offsets for street_entries, addr_entries, interp_entries.

**Step 3 - Score candidates (fine level):**

Address points: for each entry, compute approximate distance using the cos
projection formula:
```
dlat = (point.lat_e7 - lat_e7) as f64 * 1e-7 * PI / 180
dlng = (point.lon_e7 - lon_e7) as f64 * 1e-7 * PI / 180
dist_sq = dlat * dlat + dlng * dlng * cos_lat * cos_lat
```
Track the nearest point. `cos_lat` is computed once per query.

Streets: for each segment ref, read exactly 2 nodes (16 bytes) from
`street_nodes.bin` at `way.node_offset + segment_index * 8`. Compute
point-to-segment distance (project query point onto segment, clamp parameter
to [0, 1]). Track nearest segment + snap point. Read the way header (name)
only for the nearest match.

No deduplication hash table is needed - segment refs are unique per cell.
A segment spanning a cell boundary appears in both cells' entry lists, but
each entry is a distinct (way, segment) pair. The same segment may be
evaluated twice across neighboring cells, but this is at most 2x work per
boundary segment, which is cheaper than maintaining a dedup structure.

Interpolation: for each segment ref, same 2-node distance check as streets.
For the nearest interpolation hit, read the way header from `interp_ways.bin`
to get start/end house numbers and total node count, then interpolate:
```
t = accumulated_distance_to_snap_point / total_way_length
// For type == all:
number = start + round(t * (end - start))
// For type == even or odd:
number = start + 2 * round(t * (end - start) / 2)
```
The even/odd formula preserves the parity of `start` naturally - OSM even
interpolation ways have even start/end values, and odd ways have odd values.
The step size of 2 with rounding to the nearest multiple keeps the output on
the correct parity without an additional +1 offset.

**Step 3a - Coarse fallback (if no street or address found):**
If the fine-level search found no street and no address, repeat steps 1-3
against the coarse index (`coarse_geo_cells.bin` and its entry files) at
`coarse_level` (default 14), using the **coarse search radius** (default
1000m) instead of the fine radius (75m). The 9-cell neighborhood at level 14
covers ~1.8 km, and the wider radius allows matches up to 1 km away.

This two-level approach keeps the common case fast (urban queries hit level 17
only, 75m radius) while covering rural areas where the nearest named street
may be 500m+ away. The coarse search radius is stored separately in the header
(`coarse_search_radius`) and can be tuned independently.

**Step 4 - Admin lookup:**
```
cell = S2CellID::from(LatLng).parent(admin_level)
cells = [cell] + cell.all_neighbors(admin_level)
```

For each cell, binary search `admin_cells.bin`. For each polygon entry:
- If high bit is set (interior hint): test this polygon first (likely match).
- Load vertices from `admin_vertices.bin`, run point-in-polygon.
- Once a match is confirmed at a given admin level, skip remaining candidates
  at that level (interior hints make this early-exit effective).

For each admin level (2-11), keep the smallest-area polygon that contains the
query point (handles nested boundaries like city within state).

**Step 5 - Assemble `ReverseResult<'_>`.**

If no admin match at level 11 (postal code boundary) was found but the nearest
address point has a non-empty `postcode` field, use that as the postcode in the
result. This covers countries with sparse `boundary=postal_code` data.

Distance comparisons use `fine_max_distance_sq` (derived from
`fine_search_radius`) for the fine-level search, and `coarse_max_distance_sq`
(derived from `coarse_search_radius`) for the coarse fallback. Both thresholds
are precomputed from the header at `Reader::open` time. The cos projection
formula avoids all trig except one `cos()` call per query.

### 5.4. Thread safety

`Reader` is `Send + Sync`. All state is in the mmap'd files (read-only). No
interior mutability. Multiple threads can call `query()` concurrently.

## 6. Geometry module extraction

### 6.1. New module: `src/geo.rs`

Extract from `src/commands/extract.rs` into a shared geometry module:

```rust
pub mod geo {
    /// Ray-casting point-in-polygon test.
    /// Coordinates are (x, y) = (longitude, latitude) in degrees.
    pub fn point_in_ring(px: f64, py: f64, ring: &[(f64, f64)]) -> bool;

    /// Point-in-polygon with antimeridian wrapping.
    pub fn point_in_ring_with_antimeridian(px: f64, py: f64, ring: &[(f64, f64)]) -> bool;

    /// Test if a point is inside a polygon with holes.
    /// `exterior` is the outer ring, `holes` are inner rings.
    pub fn point_in_polygon(
        px: f64, py: f64,
        exterior: &[(f64, f64)],
        holes: &[&[(f64, f64)]],
    ) -> bool;

    /// Douglas-Peucker polyline simplification with a vertex cap.
    pub fn simplify_ring(ring: &[(f64, f64)], max_vertices: usize) -> Vec<(f64, f64)>;

    /// Approximate distance squared between two points (cos projection).
    /// Returns radians squared. Multiply by EARTH_RADIUS^2 for meters squared.
    pub fn approx_distance_sq(
        lat1_rad: f64, lon1_rad: f64,
        lat2_rad: f64, lon2_rad: f64,
        cos_lat: f64,
    ) -> f64;

    /// Closest point on a line segment to a query point.
    /// Returns (parameter t in [0,1], distance_sq).
    pub fn point_to_segment_distance_sq(
        px: f64, py: f64,
        ax: f64, ay: f64,
        bx: f64, by: f64,
        cos_lat: f64,
    ) -> (f64, f64);

    /// Assemble closed rings from a set of way segments.
    /// Each segment is a slice of (lat_e7, lon_e7) coordinate pairs.
    /// Returns closed rings as Vec of coordinate vectors.
    pub fn assemble_rings(segments: &[&[(i32, i32)]]) -> Vec<Vec<(i32, i32)>>;

    /// Classify a ring as exterior (CCW, positive area) or hole (CW, negative).
    /// Returns the signed area.
    pub fn signed_area(ring: &[(i32, i32)]) -> f64;
}
```

### 6.2. Refactoring extract.rs

`extract.rs` changes:
- `point_in_ring`, `point_in_ring_with_antimeridian`, `polygon_contains` become
  thin wrappers calling `geo::*`. The existing tests move to `geo.rs`.
- `PolygonRings` struct stays in extract.rs (it's specific to GeoJSON parsing).
- `BboxInt` stays in extract.rs (it's specific to the extract spatial filter).

The existing tests in extract.rs (lines 1958-2027) for `point_in_ring` move to
`src/geo.rs` and are extended with additional cases for `simplify_ring` and
`assemble_rings`.

## 7. S2 dependency

Add `s2 = "0.0.13"` to `Cargo.toml` under `[dependencies]`, gated behind the
`commands` feature (same gate as `roaring` and `serde_json`):

```toml
[dependencies]
s2 = { version = "0.0.13", optional = true }

[features]
geocode-reader = ["s2"]                              # reader only (what nidhogg links)
commands = ["roaring", "serde_json", "geocode-reader"] # full CLI including builder
```

The `geocode-reader` feature gates `pbfhogg::geocode_index::Reader` and its
dependencies (`s2` for cell ID computation at query time). nidhogg depends on
`pbfhogg` with `features = ["geocode-reader"]` - it does not need `roaring`,
`serde_json`, or the builder. The builder is gated behind `commands` as usual.

The `s2` crate provides:
- `CellID::from(LatLng).parent(level)` - cell ID computation
- `CellID::all_neighbors(level)` - 8 neighbors for query expansion
- `RegionCoverer` - line segment covering (for street/interpolation cell assignment)

The `s2` crate at v0.0.13 is the only maintained Rust S2 implementation. Its core
APIs (CellID, LatLng, neighbors) are stable and used in production by
traccar-geocoder. If `RegionCoverer` proves insufficient for segment covering, the
fallback is to compute cell IDs for both segment endpoints and, if they differ, walk
intermediate points along the segment to find all crossed cells - this is ~50 lines
of code and only needed during the build, not at query time.

If the crate becomes unmaintained, vendoring the ~200 lines for cell ID computation
+ parent + neighbors is straightforward. The reader only needs `CellID::from` and
`all_neighbors`; `RegionCoverer` is builder-only.

For admin polygon covering, the `s2` crate's polygon/polyline support is
incomplete. We use a two-phase approach instead: (1) walk ring segments to find
edge cells, (2) flood-fill from a verified interior seed point to find interior
cells. See section 3.4 for details.

## 8. Failure modes and resumability

### Build failures

| Failure | Behavior |
|---------|----------|
| Input PBF unreadable / corrupt | Error on header read, exit 1 |
| No indexdata and no `--force` | Error message with hint, exit 1 |
| Out of disk space | Error during write, partial output removed |
| Out of memory (dense index) | OOM killed by kernel |
| Interrupted (SIGINT) | Partial output directory removed (cleanup handler) |

The build is **not resumable**. A failed build requires re-running from scratch.
This is acceptable because the build runs inside a pipeline where the upstream
steps (cat, ALTW) are also non-resumable. The weekly pipeline already handles
restarts.

### Reader failures

| Failure | Behavior |
|---------|----------|
| Missing index file | `Reader::open` returns `Err` |
| Header magic/version mismatch | `Reader::open` returns `Err` |
| Corrupt index (truncated file) | Undefined - mmap reads may return garbage. A post-build validation step (section 9) catches this before serving. |

## 9. Validation strategy

### Build-time validation

After writing all files, the builder re-opens them with `Reader::open` and runs
a smoke test:
- Query a known point (the centroid of the first address point written).
- Verify the result contains a non-empty street or house number.
- Verify at least one admin level resolves.

If the smoke test fails, the build exits 1 and removes the output directory.

### Test corpus

**Unit tests (in `src/geo.rs`):**
- `point_in_ring`: existing extract.rs tests (square, triangle, L-shape, degenerate,
  antimeridian) plus new cases for complex multipolygons.
- `simplify_ring`: verify vertex count ≤ cap, verify simplified ring still contains
  original centroid.
- `assemble_rings`: single ring, multiple segments needing reversal, unclosed input.
- `point_to_segment_distance_sq`: colinear, perpendicular, endpoint cases.

**Unit tests (in `src/geocode_index/mod.rs`):**
- Round-trip: build a small index from synthetic data (10 streets, 50 addresses,
  2 admin polygons), query known points, verify results.
- Binary search edge cases: query cell with no entries, query cell at array boundary.
- Interior cell hint: verify interior-flagged cells still produce correct PIP results.
- String pool: verify deduplication and offset correctness.
- **API equivalence:** For every test query, assert
  `reader.query(lat, lon) == reader.candidates(lat, lon).into_result(&reader)`.
  The two code paths (allocation-free tracking vs full materialization + ranking)
  must produce identical results.

**Integration test (in `tests/geocode_index.rs`, `#[ignore]`):**
- Build index from Denmark PBF.
- Query known addresses (Copenhagen city hall: 55.6761, 12.5683 → Rådhuspladsen).
- Query known admin hierarchy (55.6761, 12.5683 → København, Hovedstaden, Danmark).
- Query rural point with no nearby address (verify empty address, but admin resolves).
- Verify index file sizes are within expected ranges.

**Cross-validation with traccar-geocoder (manual, not automated):**
Build both indices from the same Denmark PBF. Query 100 random points. Compare
street/address/admin results. Document discrepancies (expected: differences in
interpolation and edge-case polygon containment).

## 10. Implementation order

1. **`src/geo.rs`** - Extract geometry primitives from extract.rs. Add
   `simplify_ring`, `assemble_rings`, distance functions. Unit tests.
2. **`src/geocode_index/format.rs`** - Define the on-disk format structs, header,
   magic, version. Manual byte-level `from_bytes` / `to_bytes` methods (not
   `#[repr(C)]` transmutation - see section 4 serialization approach).
3. **`src/geocode_index/reader.rs`** - `Reader::open`, `Reader::query`. Unit tests
   with synthetic index.
4. **`src/geocode_index/builder.rs`** - The four-pass build pipeline. String pool.
   Cell assignment. Index writing.
5. **`cli/src/main.rs`** - `BuildGeocodeIndex` command variant. Wire up CLI args.
6. **Integration tests** - Denmark round-trip.
7. **Benchmarking** - Add `brokkr bench commands build-geocode-index` support.
   Measure Denmark build time and query latency.

Steps 1-3 can proceed without the S2 dependency (use stub cell IDs for unit tests).
Step 4 introduces the S2 crate. Step 5 is mechanical CLI wiring. Steps 6-7 require
a Denmark dataset.