pg_replica 0.4.0

# HyperionDb

[![HyperionDb](https://hyperiondb.eu/header.jpg)

A PostgreSQL extension that gives a small cluster of **vanilla Postgres** nodes
**automatic, consensus-driven failover** — full-cluster replication (tables,
roles, DDL, *everything*) with a **built-in Raft group** and **no external
dependencies**: no etcd, no Consul, no Kubernetes.

Replaces Patroni, CloudNativePG, pgActive and many more.

One job: keep a single leader elected and the data byte-identical across N nodes,
and fail over automatically when the leader dies. Do that one job well.

Status: **in production**

---

## The one idea

**Don't reinvent replication.** Postgres already ships **physical (WAL) streaming
replication**, which copies *everything* byte-for-byte — heap, indexes, and the
**shared catalog `pg_authid`** (i.e. roles/users and their SCRAM verifiers).
That is exactly why physical replicas have the same roles as the primary, while
logical replication / active-active (pgactive) do not.

`pg_replica` adds the *only* piece Postgres itself lacks: **automatic leader
election and failover**, using an **embedded Raft quorum** instead of an external
DCS.

| Plane | Who does it |
|-------|-------------|
| **Data** (tables, indexes, roles, DDL) | Postgres streaming replication — untouched, battle-tested |
| **Leadership & failover** | Embedded Raft, running inside a Postgres background worker |
| **Process lifecycle** (start/restart Postgres) | Your existing supervisor — systemd or Docker restart policy |

Result: roles, DDL, and data stay consistent on every node, and a dead primary
is replaced in seconds — with no human, no etcd, no Kubernetes.

---

## Features

- **Automatic, consensus-driven failover** — an embedded Raft quorum keeps a single leader elected and promotes a new primary in seconds when it dies. No etcd, no Consul, no Kubernetes, no external DCS.
- **Full-cluster fidelity, for free** — tables, indexes, roles + SCRAM verifiers, GRANTs, DDL, and extensions all replicate byte-for-byte over Postgres physical (WAL) streaming replication. Standbys are exact copies, not logical subsets.
- **Safe by default** — quorum-gated promotion fences the old primary, picks the **highest-LSN** survivor (no acked-write loss), and self-demotes a minority or network-partitioned primary read-only (no split-brain).
- **Deadman watchdog** — a control plane that is alive but hung is fenced, not trusted.
- **Automatic re-join** — a deposed primary `pg_rewind`-rejoins as a standby; if its WAL is already gone, it falls back to a full `pg_basebackup` re-clone.
- **Zero committed-transaction loss** (opt-in) — quorum-sync ties the Postgres sync quorum (`synchronous_standby_names`) to the Raft consensus quorum, so an acked write is always present on the node that gets promoted.
- **Crash-safe consensus storage** — the Raft log, vote, and state machine are persisted with atomic write + `fsync` of both the file data **and** the containing directory, so an entry acknowledged as durable survives power loss.
- **Bounded on-disk footprint** — the Raft log is compacted via snapshotting; it does not grow without limit.
- **Operable from SQL** — `SELECT pg_replica.status();`, `pg_replica.failover()`, and friends. No sidecar agent or CLI required.
- **Client follows the failover** — a multi-host libpq / Node client (`target_session_attrs=read-write`) re-resolves the new primary on reconnect; the extension only *publishes* who the primary is.
- **Lightweight** — one `.so` plus Postgres: single-digit-MB private memory, sub-percent idle CPU, ~5 s to a writable new primary.

---

## End-user install

```bash
curl -fsSL https://hyperiondb.github.io/hyperiondb/install.sh | sudo bash
sudo apt-get install -y postgresql-18-pg-replica      # or -17, -16, -15, -14

# enable the extension
sudo sed -i "s/^#\?shared_preload_libraries.*/shared_preload_libraries = 'pg_replica'/" \
  /etc/postgresql/18/main/postgresql.conf
sudo systemctl restart postgresql
```

```sql
CREATE EXTENSION IF NOT EXISTS pg_replica;
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD '<pass>'; // *same* password used for main user
```

---

## Performance (highly subjective numbers for 0.1.0)

pgbench, 8 clients / 4 jobs, single-row INSERT on the primary, 10 s, async mode

Memory

- 8 MB RSS
- 2 MB PSS
- 0.8 MB private

CPU

- 0.38% idle

Throughput

- 2494 tps
- 3.2 ms avg latency
- 17.6k rows

Failover latency

- 5 s to writable new primary

## Test

```bash
# Build dependencies

sudo apt-get update
sudo apt-get install -y \
  build-essential pkg-config libssl-dev \
  libreadline-dev zlib1g-dev flex bison \
  libxml2-dev libxslt-dev libxml2-utils xsltproc \
  ccache libclang-dev clang \
  iptables libfaketime
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"
rustup default stable
rustc --version

# test dependencies
cargo install --locked cargo-pgrx --version 0.18.1
cargo pgrx init
cargo pgrx run pg18

CREATE EXTENSION pg_replica;
```

### Run the whole test suite

One command builds the extension + test clients and runs every test (3-node
clusters in `/tmp`, torn down between each):

```bash
bash scripts/run-all-tests.sh            # build, then run all tests + summary
bash scripts/run-all-tests.sh --no-build # skip the rebuild
```

Coverage (`scripts/test-*.sh`, each spins a real 3-node cluster):

| Test | Proves |
|------|--------|
| `test-model` | **formal model check** (stateright, exhaustive, N=3 and N=5): **(1) durability** — the sync-quorum math (`ack = majority`) loses **no acked transaction** across every reachable crash/failover interleaving; **(2) split-brain** — a **resurrected stale-term primary** can never commit a conflicting write (a commit needs a majority still at the leader's term). Both have negative controls that produce counterexamples |
| `test-m3-lsn` | failover promotes the **highest-LSN** survivor (no data loss) |
| `test-m4-fence` | minority primary **self-demotes** read-only (no split-brain) |
| `test-m4-watchdog` | a **hung** control plane is fenced by the deadman watchdog |
| `test-m4-partition` | a **network-partitioned** (but running) primary self-demotes |
| `test-m5-rejoin` | deposed primary `pg_rewind`-rejoins as a standby |
| `test-m5-walgone` | WAL gone → automatic **`pg_basebackup` re-clone** |
| `test-compaction` | Raft log stays **bounded** via snapshotting |
| `test-m6-routing` | a multi-host client **follows the failover** with only a reconnect |
| `test-m7-sync` | quorum-sync = **zero committed-transaction loss** on failover |
| `test-quorum-consistency` | the Postgres **sync quorum** (`synchronous_standby_names`) and the **Raft consensus quorum** name the same nodes, and a write confirmed by one standby is **promoted onto that standby** — no two-quorum drift on failover |
| `test-perf` | supervisor **memory** (single-digit MB private overhead), **idle CPU**, write **throughput** (pgbench), and **failover latency** |
| `test-chaos` | Jepsen-style: continuous writer under partitions / SIGSTOP / kill / clock-skew / slow-disk / rolling-restart — **0 split-brain**, converges, zero-loss for clean failovers |

---

## Goals / non-goals

**Goals**
- Automatic failover on a 3- or 5-node cluster (Raft majority quorum).
- Full-cluster fidelity: roles, GRANTs, DDL, extensions, data — all replicated
  (free, via physical replication).
- Zero external coordination services (Raft is embedded).
- Lightweight: one `.so` extension + Postgres; no JVM, no Go control plane, no k8s.
- Safe by default: quorum-gated, fences the old primary, picks the most-advanced
  replica, rejoins the loser with `pg_rewind`.
- Operable from SQL: `SELECT pg_replica.status();`, `pg_replica.failover()`, etc.

**Non-goals**
- Sharding / horizontal write scale-out → that is Citus, a different axis.
- Connection pooling / proxy → recommend HAProxy or libpq multi-host
  (`target_session_attrs=read-write`). pg_replica only *publishes* who the primary is.
- Backups / PITR → recommend pgBackRest or wal-g.
- Logical / multi-master replication → out of scope by design.

---

## Positioning (honest prior art)

| Tool | Consensus | External deps | Form factor |
|------|-----------|---------------|-------------|
| **CloudNativePG** | k8s control plane | **Kubernetes required** | operator |
| **Patroni** | etcd/Consul/k8s — *or* `pysyncobj` Raft mode | DCS (or its raft lib) | Python agent |
| **pg_auto_failover** | single **monitor** node (not quorum) | a monitor Postgres | C ext + agent |
| **Stolon** | external store (etcd/consul) | DCS | Go agents |
| **repmgr** | none (manual/assisted) | — | C ext + CLI |
| **pg_replica** (this) | **embedded Raft quorum** | **none** | **Postgres extension** |

The niche: embedded quorum (unlike pg_auto_failover's single monitor), **no
external DCS** (unlike Patroni-etcd / Stolon), **no Kubernetes** (unlike CNPG),
shipped as a plain Postgres extension. Closest existing thing is Patroni's
`raft` DCS mode — pg_replica aims to be that idea, but native, in-process, and Rust-light.

---

## Gotchas (sic!)

- If a standby ever runs a different glibc than the primary that built the btree indexes, the standby's text/varchar indexes are silently mis-ordered. The instant extension promotes that standby, index-using queries return missing/duplicate rows with nothing in the logs. Extensions's entire job is safe promotion; a glibc skew turns a clean failover into silent corruption
- Extension rusn 100 ms tick, 250 ms heartbeats, 1000–2000 ms election window. A bad multi-hundred-ms (or >1 s) stall of THP on the primary's host can delay heartbeats/gossip enough that standbys mark it unhealthy and start an election - a spurious failover of a healthy primary.

---

## Docs

- [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) — components, failover flow, fencing, the hard problems.
- [docs/DECISIONS.md](docs/DECISIONS.md) — the load-bearing design choices and why.
- [docs/DURABILITY.md](docs/DURABILITY.md) — zero-loss configuration, failure modes, what backups still cover.
- [docs/TODO.md](docs/TODO.md) — phased milestones
- [docs/CLIENT_TODO.md](docs/CLIENT_TODO.md) — phased milestones for nodejs addon
- [docs/DOCKER_CONFIG.md](docs/DOCKER_CONFIG.md) - Docker config (ParadeDb as example)

---

## Listed in

- [https://crates.io/crates/pg_replica](crates.io)