pg_replica 0.3.0

Consensus-driven failover for PostgreSQL (Raft control plane)
# Design decisions (ADR-style)

Each entry: the decision, the rationale, and the alternative we rejected.

---

## D1 — Raft replicates control state, NOT data

**Decision.** The Raft log carries only small cluster-control entries (membership,
leader term, failover decisions, fencing tokens). The actual database content is
replicated by **Postgres physical (WAL) streaming replication**.

**Why.** "Use Raft to replicate the full database" means putting every write
through a Raft log and rebuilding storage on top of it — that is what CockroachDB
and TiKV do, and it is a *multi-year* engineering effort that throws away
Postgres's WAL, MVCC, and on-disk format. It would also make us *not Postgres*.
Physical WAL replication already copies **everything we want** — heap, indexes,
and the shared catalog `pg_authid` (roles/users + SCRAM verifiers) — correctly and
fast. So Raft's job is consensus on *who leads*, not moving bytes.

**Rejected.** Data-over-Raft (reimplementing the storage engine). Too big, and it
discards the entire reason to stay on Postgres.

**Consequence.** "Replicates roles and DDL" is satisfied by physical replication,
not by us. We must never claim otherwise.

---

## D2 — Physical replication, not logical

**Decision.** Use streaming **physical** replication as the data plane.

**Why.** Only physical replication copies **global objects** — roles live in the
cluster-wide `pg_authid`, which logical replication and pgactive explicitly do
**not** carry. Since "replicate roles + DDL + everything" is the whole point,
physical is the only fit. Replicas are read-only; that's acceptable (single-writer
HA, like a Mongo replica set).

**Rejected.** Logical replication (no roles/DDL/globals) and active-active
(pgactive: no global objects, conflict hell). Both fail the core requirement.

---

## D3 — Extension + bgworker, not a standalone daemon

**Decision.** Ship as a Postgres **extension** whose **background worker** hosts
the Raft node and orchestration. Rely on the existing OS supervisor
(systemd / Docker `restart: always`) for the Postgres *process* lifecycle.

**Why.** The user wants "a Postgres plugin," and most of the work *can* live in a
bgworker: a node whose Postgres is down doesn't need to vote (survivors hold
quorum); standbys being promoted are *up*, so their bgworker can `pg_promote()`
itself; a deposed primary that's up runs its own bgworker and self-demotes on
quorum loss. The one thing a bgworker genuinely cannot do is **start a Postgres
that is down** (chicken/egg) — so we delegate *only that* to systemd/Docker, which
every deployment already has.

**Rejected.** A separate Go/Rust daemon à la Patroni/Stolon. It would work, but
it's heavier and contradicts the "plugin" goal. We accept one honest limitation
(process lifecycle is the supervisor's job) to keep the plugin form factor.

**Risk.** A *hung but not dead* Postgres can wedge its bgworker. Mitigation: a
watchdog timer that makes a stuck node refuse/relinquish leadership.

---

## D4 — Embedded Raft, no external DCS

**Decision.** Embed Raft (`openraft`) inside the extension. No etcd, Consul, or k8s.

**Why.** The stated goal is fewer moving parts than CloudNativePG/Patroni-etcd.
An embedded quorum removes an entire external system to deploy, secure, and
operate. Patroni's `raft` (pysyncobj) mode proves the pattern is viable; we do it
natively and lighter.

**Rejected.** External DCS (operational weight) and single-monitor designs like
pg_auto_failover (the monitor is itself a SPOF and not a quorum).

---

## D5 — Rust + pgrx + openraft

**Decision.** Implement in Rust: the extension via **pgrx**, consensus via
**openraft** (async, event-driven Raft) hosted on a small embedded tokio runtime
inside the background worker.

**Why.** "Light on resources" rules out the JVM and argues against a Go control
plane; Rust gives a small static `.so` with no GC pauses in the failover path.
pgrx keeps us in the same toolchain as the ParadeDB-style stack already in use.
openraft leaves storage and transport to us (a single versioned `Decision` value
plus a tiny TCP/JSON RPC — both trivial here).

---

## D6 — Quorum-only, odd node counts

**Decision.** Support 3 and 5 nodes; refuse to pretend 2-node is safe.

**Why.** Raft needs a majority; 2 nodes can't form a safe majority on partition
(both think they're right, or neither can proceed). 3 tolerates 1 failure, 5
tolerates 2. We document this loudly rather than offering a footgun.

---

## D7 — Safety over availability by default

**Decision.** Default to **never two writable primaries**, even at the cost of a
brief write outage during failover. Synchronous (zero-loss) mode is opt-in.

**Why.** A search/cache can be rebuilt; a system of record that double-writes is
corrupted. Fencing + quorum + most-advanced-replica selection prioritize
correctness. Operators who want zero data loss enable quorum-sync and accept the
latency.