pg_replica 0.2.0

Consensus-driven failover for PostgreSQL (Raft control plane)
# pg_replica

A PostgreSQL extension that gives a small cluster of **vanilla Postgres** nodes
**automatic, consensus-driven failover** — full-cluster replication (tables,
roles, DDL, *everything*) with a **built-in Raft group** and **no external
dependencies**: no etcd, no Consul, no Kubernetes.

Replaces Patroni, CloudNativePG, pgActive and many more.

One job: keep a single leader elected and the data byte-identical across N nodes,
and fail over automatically when the leader dies. Do that one job well.

---

## The one idea

**Don't reinvent replication.** Postgres already ships **physical (WAL) streaming
replication**, which copies *everything* byte-for-byte — heap, indexes, and the
**shared catalog `pg_authid`** (i.e. roles/users and their SCRAM verifiers).
That is exactly why physical replicas have the same roles as the primary, while
logical replication / active-active (pgactive) do not.

`pg_replica` adds the *only* piece Postgres itself lacks: **automatic leader
election and failover**, using an **embedded Raft quorum** instead of an external
DCS.

| Plane | Who does it |
|-------|-------------|
| **Data** (tables, indexes, roles, DDL) | Postgres streaming replication — untouched, battle-tested |
| **Leadership & failover** | Embedded Raft, running inside a Postgres background worker |
| **Process lifecycle** (start/restart Postgres) | Your existing supervisor — systemd or Docker restart policy |

Result: roles, DDL, and data stay consistent on every node, and a dead primary
is replaced in seconds — with no human, no etcd, no Kubernetes.

---

## Performance (highly subjective numbers for 0.1.0)

pgbench, 8 clients / 4 jobs, single-row INSERT on the primary, 10 s, async mode

Memory

- 8 MB RSS
- 2 MB PSS
- 0.8 MB private

CPU

- 0.38% idle

Throughput

- 2494 tps
- 3.2 ms avg latency
- 17.6k rows

Failover latency

- 5 s to writable new primary

## Test

```bash
# Build dependencies

sudo apt-get update
sudo apt-get install -y \
  build-essential pkg-config libssl-dev \
  libreadline-dev zlib1g-dev flex bison \
  libxml2-dev libxslt-dev libxml2-utils xsltproc \
  ccache libclang-dev clang \
  iptables libfaketime
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"
rustup default stable
rustc --version

# test dependencies
cargo install --locked cargo-pgrx --version 0.18.1
cargo pgrx init
cd packages/pg_replica
cargo pgrx run pg18

CREATE EXTENSION pg_replica;
```

### Run the whole test suite

One command builds the extension + test clients and runs every test (3-node
clusters in `/tmp`, torn down between each):

```bash
bash scripts/run-all-tests.sh            # build, then run all tests + summary
bash scripts/run-all-tests.sh --no-build # skip the rebuild
```

Coverage (`scripts/test-*.sh`, each spins a real 3-node cluster):

| Test | Proves |
|------|--------|
| `test-model` | **formal model check** (stateright, exhaustive, N=3 and N=5): **(1) durability** — the sync-quorum math (`ack = majority`) loses **no acked transaction** across every reachable crash/failover interleaving; **(2) split-brain** — a **resurrected stale-term primary** can never commit a conflicting write (a commit needs a majority still at the leader's term). Both have negative controls that produce counterexamples |
| `test-m3-lsn` | failover promotes the **highest-LSN** survivor (no data loss) |
| `test-m4-fence` | minority primary **self-demotes** read-only (no split-brain) |
| `test-m4-watchdog` | a **hung** control plane is fenced by the deadman watchdog |
| `test-m4-partition` | a **network-partitioned** (but running) primary self-demotes |
| `test-m5-rejoin` | deposed primary `pg_rewind`-rejoins as a standby |
| `test-m5-walgone` | WAL gone → automatic **`pg_basebackup` re-clone** |
| `test-compaction` | Raft log stays **bounded** via snapshotting |
| `test-m6-routing` | a multi-host client **follows the failover** with only a reconnect |
| `test-m7-sync` | quorum-sync = **zero committed-transaction loss** on failover |
| `test-quorum-consistency` | the Postgres **sync quorum** (`synchronous_standby_names`) and the **Raft consensus quorum** name the same nodes, and a write confirmed by one standby is **promoted onto that standby** — no two-quorum drift on failover |
| `test-perf` | supervisor **memory** (single-digit MB private overhead), **idle CPU**, write **throughput** (pgbench), and **failover latency** |
| `test-chaos` | Jepsen-style: continuous writer under partitions / SIGSTOP / kill / clock-skew / slow-disk / rolling-restart — **0 split-brain**, converges, zero-loss for clean failovers |

---

## Goals / non-goals

**Goals**
- Automatic failover on a 3- or 5-node cluster (Raft majority quorum).
- Full-cluster fidelity: roles, GRANTs, DDL, extensions, data — all replicated
  (free, via physical replication).
- Zero external coordination services (Raft is embedded).
- Lightweight: one `.so` extension + Postgres; no JVM, no Go control plane, no k8s.
- Safe by default: quorum-gated, fences the old primary, picks the most-advanced
  replica, rejoins the loser with `pg_rewind`.
- Operable from SQL: `SELECT pg_replica.status();`, `pg_replica.failover()`, etc.

**Non-goals**
- Sharding / horizontal write scale-out → that is Citus, a different axis.
- Connection pooling / proxy → recommend HAProxy or libpq multi-host
  (`target_session_attrs=read-write`). pg_replica only *publishes* who the primary is.
- Backups / PITR → recommend pgBackRest or wal-g.
- Logical / multi-master replication → out of scope by design.

---

## Positioning (honest prior art)

| Tool | Consensus | External deps | Form factor |
|------|-----------|---------------|-------------|
| **CloudNativePG** | k8s control plane | **Kubernetes required** | operator |
| **Patroni** | etcd/Consul/k8s — *or* `pysyncobj` Raft mode | DCS (or its raft lib) | Python agent |
| **pg_auto_failover** | single **monitor** node (not quorum) | a monitor Postgres | C ext + agent |
| **Stolon** | external store (etcd/consul) | DCS | Go agents |
| **repmgr** | none (manual/assisted) || C ext + CLI |
| **pg_replica** (this) | **embedded Raft quorum** | **none** | **Postgres extension** |

The niche: embedded quorum (unlike pg_auto_failover's single monitor), **no
external DCS** (unlike Patroni-etcd / Stolon), **no Kubernetes** (unlike CNPG),
shipped as a plain Postgres extension. Closest existing thing is Patroni's
`raft` DCS mode — pg_replica aims to be that idea, but native, in-process, and Rust-light.