# Architecture
## 1. Split of responsibilities
```
┌─────────────────────────── node A (LEADER / primary) ──────────────────────────┐
│ Postgres │
│ ├─ pg_replica extension (.so) │
│ │ ├─ background worker: Raft node ◄────raft RPC────┐ │
│ │ │ • election, membership, cluster state │ │
│ │ │ • health checks, failover state machine │ │
│ │ │ • executes promote / rewind / reconfigure │ │
│ │ └─ SQL API: pg_replica.status() / .members() / .failover() / .add_node() │
│ └─ WAL ──────────────physical streaming replication──────────────┐ │
└──────────────────────────────────────────────────────────────────── │ ──────────┘
▲ raft RPC │ WAL
│ ▼
┌──────────── node B (follower / standby) ──┐ ┌──────── node C (follower / standby) ──┐
│ Postgres + pg_replica bgworker (Raft) │ │ Postgres + pg_replica bgworker (Raft) │
│ recovery mode, primary_conninfo → A │ │ recovery mode, primary_conninfo → A │
└───────────────────────────────────────────┘ └───────────────────────────────────────┘
supervisor (systemd / docker restart=always) keeps each Postgres process alive; pg_replica decides its ROLE.
clients reach the primary via libpq multi-host (target_session_attrs=read-write) or an HAProxy that reads pg_replica state.
```
**Two planes, deliberately separate:**
- **Control plane (Raft, tiny):** elects the leader, tracks membership and the
authoritative "current primary + replication topology," and drives failover.
The Raft log only ever carries small control entries (membership changes,
leader term, failover decisions, fencing tokens). Megabytes, not gigabytes.
- **Data plane (Postgres WAL, large):** the actual rows/indexes/roles/DDL move
over Postgres's own physical streaming replication. We configure it; we do not
reimplement it. This is what makes "replicate *everything* including roles and
DDL" true for free.
See [DECISIONS.md D1](DECISIONS.md#d1-raft-replicates-control-state-not-data) for
why Raft must **not** carry the data itself.
## 2. Components
### 2.1 `pg_replica` extension (Rust, via `pgrx`)
Built with [`pgrx`](https://github.com/pgcentralfoundation/pgrx) so it stays in
the same toolchain as ParadeDB-style extensions. Two parts:
1. **Background worker `pg_replica supervisor`** — registered via
`RegisterBackgroundWorker` at `shared_preload_libraries`. Hosts:
- the **Raft node** ([`openraft`](https://github.com/databendlabs/openraft), an
async Raft) on a small embedded tokio runtime, with a peer transport
(TCP, length-prefixed JSON RPC) that also multiplexes LSN gossip;
- a durable **Raft log + snapshot store** in `$PGDATA/pg_replica/` (separate
from Postgres WAL);
- the **health monitor** (heartbeats + libpq `SELECT 1` probes of peers);
- the **failover state machine** (detect → elect → choose → fence → promote →
reconfigure → rejoin);
- executors that call `pg_promote()`, edit `postgresql.auto.conf`
(`primary_conninfo`), manage replication slots, and shell out to
`pg_basebackup` / `pg_rewind`.
2. **SQL surface** (functions + views):
- `pg_replica.status()` → this node's role, term, leader, LSNs, lag.
- `pg_replica.members()` → cluster membership + health.
- `pg_replica.failover([target])` → operator-initiated switchover.
- `pg_replica.add_node(dsn)` / `pg_replica.remove_node(id)` → membership.
- GUCs: `pg_replica.peers`, `pg_replica.raft_port`, `pg_replica.node_id`,
`pg_replica.synchronous`, `pg_replica.failover_timeout_ms`,
`pg_replica.bootstrap`.
### 2.2 The supervisor you already have
`pg_replica` does **not** start/stop the Postgres *process* — a chicken/egg an
in-Postgres worker can't solve. That job stays with **systemd** or the **Docker
restart policy** (`restart: always`). pg_replica decides each running node's
*role*; the OS supervisor guarantees the process keeps trying to run. This is the
honest reason it can be "just an extension" and still do failover (see
[DECISIONS.md D3](DECISIONS.md#d3-extension-bgworker--not-a-standalone-daemon)).
### 2.3 Client routing (not built, recommended)
pg_replica only **publishes** who the primary is. Pick one:
- **libpq multi-host:** `host=A,B,C ... target_session_attrs=read-write` — driver finds the writable node. Zero infra.
- **HAProxy** with an httpchk against `pg_replica`'s tiny health endpoint (`/primary` → 200 only on the leader).
- **Floating/virtual IP** moved by the leader on promotion.
## 3. Failover flow (primary dies)
1. **Detect.** Followers' health monitors miss `pg_replica.failover_timeout_ms`
of heartbeats from the leader. Raft leadership lease expires.
2. **Elect (control plane).** Surviving nodes hold a Raft election; only the
**majority partition** can elect — a minority (incl. an isolated old primary)
cannot, which is what prevents split-brain.
3. **Self-fence the loser.** A primary that loses quorum **demotes itself**
(drops to read-only / shuts down) via a loss-of-quorum watchdog, *before* a new
one is promoted. A monotonic **fencing token** (Raft term) guards against a
paused old primary resuming.
4. **Choose the most-advanced replica.** The new Raft leader queries each
survivor's `pg_last_wal_replay_lsn()` / `pg_last_wal_receive_lsn()` and picks
the highest, minimizing data loss. Decision is committed to the Raft log.
5. **Promote.** `SELECT pg_promote()` on the chosen standby.
6. **Reconfigure the rest.** Remaining standbys get `primary_conninfo` repointed
at the new primary; slots recreated; reload.
7. **Rejoin the loser.** When the old primary returns, `pg_rewind` it against the
new primary to discard diverged WAL, then start it as a standby.
8. **Publish.** New primary recorded in Raft state; routing layer (HAProxy/VIP) follows.
## 4. Data-loss posture (sync vs async)
Configurable via `pg_replica.synchronous`:
- **async (default):** lowest latency; failover may lose the last unreplicated
commits (bounded by replica lag).
- **quorum-sync:** sets `synchronous_standby_names = 'ANY 1 (nodeB,nodeC)'` so a
commit is acked by ≥1 standby → failover loses **nothing**. Cost: write latency,
and writes stall if too few sync standbys are up. pg_replica keeps
`synchronous_standby_names` in step with live membership so a single failure
never wedges writes.
## 5. The hard problems (and our stance)
| **Split-brain** | Raft majority quorum + self-fence on quorum loss + fencing token (term). Never two writable primaries. |
| **Promoting a stale replica** | Always pick highest LSN among survivors; `pg_rewind` the others. |
| **Old primary resurrecting** | Loss-of-quorum watchdog demotes it; term-based fencing token rejects its stale writes/role. |
| **Quorum math** | Odd cluster sizes (3 → tolerate 1, 5 → tolerate 2). 2-node HA is impossible safely; document it. |
| **Raft node dies with Postgres** | Fine: a down node doesn't need to vote; survivors keep quorum. OS supervisor restarts Postgres → bgworker rejoins. |
| **Hung (not dead) Postgres** | Watchdog timer; if the bgworker can't make progress, it stops accepting the leader role. |
| **Bootstrap / new replica** | `pg_basebackup` from the leader, register in Raft, stream. |
| **Network partition flapping** | Election timeouts + leader leases + openraft's pre-vote to avoid term churn. |
## 6. On-disk / network footprint
- Raft state: `$PGDATA/pg_replica/{raft-log, snapshot, meta}` — small.
- Peer transport: one TCP port per node (`pg_replica.raft_port`), mTLS optional.
- No external store, no JVM, no Go control plane. Memory target: a few MB of
resident bgworker overhead beyond Postgres itself.