Geograph — Design Document
1. Purpose
geograph is a self-contained Rust crate for representing and querying planar
maps: complete subdivisions of a geographic region into non-overlapping units.
It exposes a geometry- and topology-only API (no attributes, no optimization) and
is designed as the geometric foundation for redistricting computations in
openmander.
2. Planar Map Model
A Region is a finite set of Units whose interiors are pairwise disjoint and whose union equals the region's support polygon.
Both the region and individual units are MultiPolygons:
- A unit may be multiply connected (has interior holes).
- A unit may consist of multiple exclaves (disjoint pieces / multipolygon).
- The region itself may be a multipolygon (e.g. a state with islands).
The region is assumed to have no gaps — every point in the region's support belongs to exactly one unit.
2.1 Boundary topology
Two units share a Rook adjacency if they share a 1-D boundary segment (positive-length shared edge). They share a Queen adjacency if they share at least one point (including corners).
Interior holes (surrounded units not in a query subset) are topologically distinct from the outer boundary of the region.
3. Internal Representation
3.1 DCEL (Doubly Connected Edge List)
The DCEL is the primary internal data structure. It encodes the full planar embedding: which faces are adjacent, what the boundary of each face looks like, and which vertices are shared.
Mapping to the planar map:
| DCEL element | Planar map meaning |
|---|---|
| Bounded face | One polygon ring of a unit (including the exterior unit) |
Unbounded face (FaceId(0)) |
One face of the exterior unit |
Half-edge h |
One directed side of a shared or exterior boundary segment |
h.twin |
The same segment seen from the other side |
| Vertex | A corner point shared by ≥ 2 polygon rings |
Every DCEL face belongs to exactly one unit. A unit may own multiple faces (exclaves map to multiple bounded faces; holes in a unit's polygon are faces belonging to whatever unit fills them). This applies to the exterior unit too:
Exterior unit (UnitId::EXTERIOR)
├── Unbounded face (FaceId(0)) → the region outside the state boundary
├── Hole face H1 → an interior gap with no unit assigned
└── Hole face H2 → another gap (e.g. a lake not in the data)
Regular unit U (exclave with hole)
├── Outer ring of exclave 1 → DCEL face F1
├── Outer ring of exclave 2 → DCEL face F2
└── Hole in exclave 1 → DCEL face F3, owned by whatever unit fills it
(or UnitId::EXTERIOR if the hole is empty)
A face_to_unit: Vec<UnitId> table maps every DCEL face (including the
unbounded face) to its unit. UnitId::EXTERIOR is a reserved sentinel; all
other UnitIds refer to input units. UnitId::EXTERIOR is never a valid
district assignment.
3.2 Adjacency tables
Two separate CSR adjacency matrices are built eagerly during construction and
stored on Region:
Rook adjacency (shared edge — positive-length boundary):
- Walk every half-edge
hin the DCEL. - If
face_to_unit[h.face] != face_to_unit[h.twin.face], emit the triple(face_to_unit[h.face], face_to_unit[h.twin.face], edge_length[h / 2]). - Triples involving
UnitId::EXTERIORare filtered out — the adjacency matrices only contain real unit-to-unit edges. Whether a unit borders the exterior is tracked separately via theis_exteriorflag (see §3.3). - Duplicate
(row, col)triples (from multi-edge boundaries) have their weights summed, so each CSR entry represents the total shared boundary length. - Collect, sort, merge, and build CSR with aligned weight array.
Queen adjacency (shared point — corner touch counts):
- Walk every DCEL vertex
v. For each pair of distinct units(a, b)that appear in the vertex star ofv(viavertex_star()), emit(a, b)and(b, a). - Collect, sort, deduplicate, and build CSR (no weights).
- Because every shared edge is also a shared point, Rook ⊆ Queen holds automatically.
The two matrices are stored as separate fields:
adjacent: AdjacencyMatrix // Rook (with per-edge weights)
touching: AdjacencyMatrix // Queen (no weights)
AdjacencyMatrix is a CSR structure: offsets: Vec<u32> (length
num_units + 1), neighbors: Vec<UnitId> (sorted within each row for binary
search), and an optional weights: Option<Vec<f64>> aligned to neighbors.
Row u gives the sorted list of units adjacent to u under the chosen mode.
The Rook matrix always has weights; the Queen matrix does not.
3.3 Per-unit cache
During construction the following scalar quantities are computed once and
stored in Vecs indexed by UnitId:
area: Vec<f64>— per-edge weighted shoelace formula, in m². Each shoelace term is scaled bycos(φ_mid)whereφ_midis the midpoint latitude of that edge in radians (see §8, question 4). Holes are subtracted.perimeter: Vec<f64>— sum of edge lengths in metres, where each edge length is√(Δlat² + (Δlon·cos(φ_mid))²) × 111_320. Holes included.exterior_boundary_length: Vec<f64>— sum ofedge_lengthvalues for half-edges whose twin belongs toUnitId::EXTERIOR. Zero for interior units. In metres.centroid: Vec<Coord<f64>>— approximate centroid of each unit in lon/lat, computed as the unweighted average of all half-edge origin vertices belonging to the unit. This is a vertex-average, not the true area-weighted centroid; it is fast to compute and sufficient for angular ordering and display.bounds: Vec<Rect<f64>>— axis-aligned bounding box of each unit in lon/lat.is_exterior: Vec<bool>— true if the unit has any half-edge whose twin belongs toUnitId::EXTERIOR(i.e. the unit touches the region boundary).bounds_all: Rect<f64>— axis-aligned bounding box of the entire region (union of all per-unit bounding boxes).
An R-tree spatial index (rstar::RTree) over per-unit bounding boxes is also
built during construction for fast point-location and envelope queries (see
§5.14).
Shared-edge lengths are stored on the half-edge pairs:
edge_length: Vec<f64> indexed by HalfEdgeId / 2 (one entry per undirected
edge), in metres using the same per-edge cos(φ_mid) correction.
3.4 Edge matching
During construction, shared polygon edges are detected using a hash map keyed
on directed vertex-pair (VertexId, VertexId). The key is packed as a single
u64 (origin << 32 | dest) and stored in an AHashMap<u64, HalfEdgeId> for
non-cryptographic O(1) lookups. After vertex snapping (§4) brings
near-coincident vertices to exact canonical positions, two directed edges from
different units that share the same (origin, dest) vertex pair — in forward or
reverse direction — are identified as the same undirected boundary segment and
twinned in the DCEL. Edges with no matching reverse edge are twinned with a
half-edge on the outer face. The map is discarded after the DCEL is built.
4. Construction Pipeline
Input: a Vec<MultiPolygon<f64>> (one per unit, in the order that determines
UnitId assignment) and an optional snap tolerance Option<f64>.
1. Ring extraction and winding normalisation
- Extract all polygon rings (outer rings + holes) from each input unit.
- Normalise winding order so that outer rings are CCW and hole rings are CW
(GeoJSON / geo-crate convention).
- ESRI Shapefile and TIGER/Line data uses the opposite convention (CW outer
rings); normalising here ensures the DCEL rotation rule produces correct
face cycles regardless of input source.
2. Vertex snapping (optional)
- If snap_tol is Some(tol), snap nearby vertices (within tol) to a canonical
position using the snap module. Only vertices connected by a polygon edge
in at least one input unit are candidates for snapping.
- If snap_tol is None, snapping is skipped entirely. Pass None for
topologically clean data such as TIGER/Line GeoParquet where shared-boundary
vertices are already exactly equal.
3. DCEL construction
- Allocate one DCEL vertex per unique (snapped) coordinate.
- For each polygon ring of each unit, insert its edges as half-edge pairs,
recording directed-edge → half-edge mappings in an AHashMap<u64, HalfEdgeId>
keyed on packed (origin, dest) pairs.
- Shared edges are identified and twinned using this map; exterior edges
(no matching reverse) are twinned with a half-edge on the outer face.
- next/prev links are set by sorting outgoing half-edges at each vertex in
CCW angular order (a second AHashMap<u64, Vec<HalfEdgeId>> groups them by
vertex for sorting).
- face.half_edge is set to an outer-ring half-edge for each face, so that
face-cycle traversal yields the outer boundary rather than a hole cycle.
4. Gap detection
- Walk all cycles reachable from OUTER_FACE. The cycle with the most
negative signed area (largest CW polygon) is the true outer boundary; all
others are interior gap faces assigned to UnitId::EXTERIOR.
5. Cache pre-computation
- Compute area, perimeter, edge_length, centroid, bounds, bounds_all,
exterior_boundary_length, and is_exterior for all units and edges.
- Build Rook and Queen adjacency matrices (§3.2).
- Build the R-tree spatial index.
6. Validation (optional, cfg(debug_assertions))
- All half-edges have valid twins (twin of twin is self).
- All half-edge next/prev links are consistent.
- Every face has at least one half-edge.
- Unit areas are non-negative.
Also available as Region::validate() for explicit checks after
deserialisation.
Snapping: the snap module is separate from DCEL construction and has a
single entry point (see §8, note 1). The tolerance is passed as
Option<f64> — None skips snapping entirely for data whose shared vertices
are already bitwise-identical. Typical values: 1e-7 degrees (~1 cm) for
TIGER/Line; 1e-4 degrees (~10 m) for PMTiles-quantised data.
5. Algorithm Sketches
5.1 are_adjacent(a, b)
Binary search in a's row of the Rook CSR matrix for b.
O(log deg(a)), or O(1) with a hash set per row (trade memory for speed).
5.2 neighbors(unit)
Read unit's row from the Rook CSR matrix — a sorted slice of UnitIds. O(deg).
5.3 area(units) and perimeter(units)
All values are in metres / metres² via the per-edge cos(φ_mid) correction
baked into the cache at construction time (see §3.3).
area: units.iter().map(|u| cached_area[u]).sum() O(k)
perimeter: Build a set S of units.
For each unit u in S:
For each DCEL face f of u:
For each half-edge h of f:
if face_to_unit[h.twin.face] ∉ S:
perimeter += edge_length[h] O(k · avg_degree)
The perimeter walk naturally excludes shared internal edges; edge_length
values are already in metres so no further scaling is needed at query time.
5.4 boundary(units)
Same walk as perimeter but collect the half-edges whose twins are outside S.
Group them into cycles by following next links within the boundary.
Return as MultiLineString. O(boundary length in edges).
5.5 shared_boundary_length(a, b)
Walk half-edges of unit a; sum edge_length for edges whose twin belongs to
b. O(edges of a).
5.6 is_contiguous(units)
BFS on the Rook adjacency graph restricted to units (treat as a
subgraph). O(k + edges within subgraph).
5.7 connected_components(units)
Same as is_contiguous but collect all components via repeated BFS from
unvisited seeds. O(k + internal edges).
5.8 has_holes(units) / enclaves(units)
1. Let complement = all units NOT in the query set.
2. Find connected components of complement using Rook adjacency.
3. A component is a hole iff it has no unit adjacent to the outer face
(i.e., no unit on the region's exterior boundary).
Exterior units are flagged at construction time: a unit is exterior if any of its DCEL faces has a half-edge whose twin is in the outer face.
O(n) where n = total number of units.
5.9 compactness(units)
4π · area(units) / perimeter(units)²
O(k · avg_deg) — dominated by perimeter_of which does a boundary walk.
5.10 centroid(unit) / bounds(unit) / is_exterior(unit) / exterior_boundary_length(unit) / bounds_all()
O(1) — all pre-cached at construction time (see §3.3).
5.11 bounds_of(units)
Expand a Rect over the cached per-unit bounding boxes. O(k).
5.12 exterior_boundary_length_of(units)
Sum exterior_boundary_length over units in the subset. O(k).
5.13 union_of(units)
Walk the DCEL boundary of the subset (same as boundary_of), then classify
each cycle by signed area: positive (CCW) = outer ring, negative (CW) = hole.
Match holes to their enclosing outer ring using a point-in-ring test. Returns
a MultiPolygon<f64> representing the merged shape. O(boundary edges).
No external polygon boolean operations are needed — the DCEL already encodes the planar subdivision, so the boundary walk extracts the merged shape directly.
5.14 unit_at(point) / units_in_envelope(envelope)
An R-tree over per-unit bounding boxes provides fast spatial queries:
unit_at(point): query the R-tree for candidate units whose bounding box contains the point (O(log n)), then perform exact point-in-polygon tests on candidates. Returns the first match orNone.units_in_envelope(envelope): query the R-tree for all units whose bounding box intersects the envelope (O(log n + k)). This is a coarse filter — returned units may not actually intersect geometrically.
5.15 convex_hull(unit) / convex_hull_of(units)
convex_hull(unit): delegates togeo::ConvexHullon the unit's storedMultiPolygon. O(v log v) where v is the number of vertices.convex_hull_of(units): collects all polygons from the input units and computes the convex hull of the combined geometry. O(V log V) where V is the total vertex count across all input units. This is a naive implementation — a future version should merge per-unit hulls incrementally (O(k · h) where h is the output hull size).
6. Public API Summary
// Construction
new // not yet implemented
from_shapefile // not yet implemented
// Unit access
region.unit_ids .num_units .geometry // Adjacency — Rook (shared edge) is the default for all graph queries.
// Raw matrix access exposes both modes for callers that need Queen.
region.are_adjacent // Rook (shared edge)
region.neighbors // Rook, sorted
region.adjacency // Rook CSR
region.touching // Queen CSR
// Single-unit geometry (O(1), pre-cached unless noted)
region.area .perimeter .exterior_boundary_length .centroid .bounds .is_exterior .boundary .convex_hull // O(v log v)
// Region-wide geometry (O(1), pre-cached)
region.bounds_all // Subset geometry (O(k) unless noted)
region.area_of .perimeter_of .exterior_boundary_length_of .bounds_of .boundary_of .compactness_of .union_of // O(boundary edges)
region.convex_hull_of // O(V log V)
// Spatial queries
region.unit_at // O(log n)
region.units_in_envelope // O(log n + k)
// Validation
region.validate // Edge metrics
region.shared_boundary_length .boundary_length_with .edge_weight_at // Rook CSR index → shared length in m
// Topology — Rook adjacency throughout
region.is_contiguous .connected_components .has_holes .enclaves
Supporting types:
/// A read-only CSR adjacency matrix, optionally with per-edge weights.
The Rook adjacency matrix always has weights; the Queen matrix does not.
offset(unit) provides the base index into the flat neighbor/weight arrays,
so a caller can zip neighbors(unit) with weights_of(unit) or use raw
flat indices (offset(unit) + i) to address individual directed edges.
units parameters accept &[UnitId] or any IntoIterator<Item = UnitId>.
7. Performance Summary
| Operation | Complexity | Notes |
|---|---|---|
are_adjacent |
O(log deg) | Rook; O(1) with hash adjacency |
neighbors |
O(deg) | Rook; slice into CSR row |
area(unit) |
O(1) | cached |
perimeter(unit) |
O(1) | cached |
exterior_boundary_length(unit) |
O(1) | cached |
centroid(unit) |
O(1) | cached |
bounds(unit) |
O(1) | cached |
is_exterior(unit) |
O(1) | cached |
bounds_all() |
O(1) | cached |
area_of(units) |
O(k) | sum of cached values |
perimeter_of(units) |
O(k · avg_deg) | boundary walk |
exterior_boundary_length_of(units) |
O(k) | sum of cached values |
bounds_of(units) |
O(k) | expand cached bounding boxes |
boundary_of(units) |
O(boundary edges) | half-edge walk |
union_of(units) |
O(boundary edges) | DCEL boundary walk |
compactness_of(units) |
O(k · avg_deg) | via perimeter_of |
convex_hull(unit) |
O(v log v) | v = vertices of unit |
convex_hull_of(units) |
O(V log V) | V = total vertices; naive impl |
unit_at(point) |
O(log n) | R-tree + point-in-polygon |
units_in_envelope(envelope) |
O(log n + k) | R-tree bounding-box filter |
shared_boundary_length |
O(edges of a) | half-edge walk |
is_contiguous |
O(k) | BFS on subgraph |
connected_components |
O(k) | BFS on subgraph |
has_holes / enclaves |
O(n) | complement BFS |
| Construction | O(n log n) | dominated by CCW sort |
| Adjacency matrix build | O(n · avg_deg) | one half-edge pass |
k = size of query subset, n = total number of units.
8. Notes & Open Questions
-
Snapping strategy. (Resolved) Use the conservative approach: only snap pairs of vertices that are already connected by a polygon edge in at least one input unit (i.e. snap along shared boundaries only, never across open space). This avoids merging legitimately distinct vertices in dense areas.
The snapping and vertex-repair logic lives in a dedicated
snapmodule (separate from DCEL construction) with a single entry-point function:// snap.rs (pub(crate) — internal to the geograph crate) pubSnapping is optional:
Region::newacceptssnap_tol: Option<f64>, and passingNoneskips the snap pass entirely. For TIGER/Line GeoParquet data the shared-boundary vertices are bitwise-identical, so snapping is a no-op; skipping it saves the O(E) overhead.Keeping the snap module isolated makes it straightforward to swap the algorithm (e.g. switch tolerance heuristic or add a topology-aware repair pass) without touching the DCEL builder.
-
Multipolygon hole ownership. (Resolved) The exterior of the region is treated as a first-class unit (
UnitId::EXTERIOR). It owns the unbounded DCEL face and any interior gaps (holes in the region with no assigned unit). Since a regular unit can already own multiple faces (exclaves), no new machinery is needed —UnitId::EXTERIORsimply participates inface_to_unitlike any other unit. It is excluded from district assignment and does not appear in the adjacency matrices (filtered out during construction). Whether a unit borders the exterior is tracked separately via theis_exteriorflag. -
Queen adjacency at T-junctions. (Resolved) When building the touching (Queen) matrix, emit all pairs
(a, b)for every distinct combination of units appearing in each vertex star — not just consecutive pairs. This correctly captures diagonal corner touches (e.g. two units that meet at a single point but share no edge) without any special-casing.
3a. Winding order normalisation. (Resolved) The DCEL rotation rule
(sort outgoing half-edges CCW at each vertex, then set twin(h_i).next = h_{(i−1) mod k}) assumes CCW outer rings. ESRI Shapefiles and TIGER/Line
data use the opposite convention: CW exterior rings, CCW holes. The
construction pipeline normalises winding before DCEL construction using the
shoelace signed-area formula: outer rings with negative area are reversed to
CCW; hole rings with positive area are reversed to CW. Without this step,
the rotation rule produces face cycles that trace the outer boundary of the
region instead of the interior face, causing every unit to appear as a shape
that spans the whole region.
-
Coordinate system. (Resolved) Input coordinates are unprojected lon/lat (degrees). Area and length results are returned in m² and m respectively via a per-edge
cos(φ_mid)correction applied at construction time:- Edge length:
√(Δlat² + (Δlon·cos(φ_mid))²) × 111_320 m/° - Area (shoelace term):
(lon_i·lat_{i+1} − lon_{i+1}·lat_i) · cos(φ_mid) × 111_320² m²/°²
where
φ_mid = (lat_start + lat_end) / 2in radians for each edge. Evaluatingcosat the midpoint of every short polygon edge is equivalent to a first-order numerical integration of the spherical area elementR² cos(φ) dφ dλ, and keeps error well under 1% for all real-world inputs. - Edge length:
-
Parallelism. (Out of scope) The construction pipeline is single-threaded. Parallelism would add complexity that isn't warranted at this stage.
-
Serialisation. (Resolved) A custom binary format with no external dependencies. The file is a fixed header followed by flat, fixed-width sections — trivial to read or write in any language with a seek and a bulk copy.
Compaction techniques:
- Coordinates stored as
i32(scaled by 10^7, i.e. 1e-7° precision) rather thanf64— lossless at the snap tolerance, 50% smaller. twinfield omitted from half-edge records; derived asid ^ 1since half-edges are always stored in consecutive pairs.- No external compression (avoids WASM-incompatible native libraries).
File layout:
[Header] magic: 4 bytes ("OMRP") version: 1 byte reserved: 3 bytes num_vertices: u32 num_half_edges: u32 (always even) num_faces: u32 num_units: u32 [Vertices] num_vertices × (i32 lon, i32 lat) 8 B each [HalfEdges] num_half_edges × (u32 origin, u32 next, u32 prev, u32 face) 16 B each [Faces] num_faces × u32 half_edge 4 B each (0xFFFFFFFF = no boundary / outer face) [FaceToUnit] num_faces × u32 4 B each (0xFFFFFFFF = UnitId::EXTERIOR) [UnitCache] num_units × (f64 area_m2, f64 perimeter_m) 16 B each [EdgeLengths] (num_half_edges/2) × f64 8 B each [RookAdj offsets] (num_units+1) × u32 4 B each [RookAdj neighbors] num_rook_edges × u32 4 B each [TouchingAdj offsets] (num_units+1) × u32 4 B each [TouchingAdj neighbors] num_touching_edges × u32 4 B eachPersisted vs. derived fields: Only
areaandperimeterare stored in the UnitCache section. The remaining cache fields (exterior_boundary_length,centroid,bounds,bounds_all,is_exterior) and thegeometriesvector are re-derived from the DCEL on deserialisation.Known limitation: Geometry reconstruction on deserialisation creates one
Polygonper DCEL face. Units with holes will have the hole boundary as a separate outer ring rather than as an interior ring of the enclosing polygon. This only affectsregion.geometry(unit)after a round-trip; all other queries (area, perimeter, boundary, adjacency, etc.) use the DCEL directly and are unaffected.Potential future improvement: store vertex coordinates as deltas relative to a per-section reference origin to reduce magnitude and improve compressibility if compression is added later.
- Coordinates stored as
-
Convex hull queries. (Resolved)
convex_hull(unit)delegates togeo::ConvexHullon the storedMultiPolygon.convex_hull_of(units)collects all polygons and computes the hull of the combined geometry. The subset version is O(V log V) where V is the total vertex count — a future improvement should merge per-unit hulls incrementally.