pg_replica 0.3.0

# Durability & zero-loss guide

How close `pg_replica` (+ the `hyperiondb-client` pool) can get to "no lost writes or reads,"
what configuration that requires, and the failure modes that no configuration removes.

"Zero loss" is really three separate guarantees, with three different mechanisms:

| Guarantee | Means | Achievable? |
|-----------|-------|-------------|
| **Durability** | every *acked* (committed) write survives a failover | **Yes**, within the cluster's fault budget — with synchronous quorum commit |
| **Availability** | no *failed* requests during a failover | **Not automatically** — there is a ~5 s window; needs app-level retry + idempotency |
| **Read correctness** | no stale / split-brain reads | **Yes** from the primary; standby reads are eventually-consistent |

---

## A. Durability — zero acked-write loss

A failover can only be lossless if every acked transaction is already on a node that can be
promoted. That requires **synchronous quorum commit**.

### Configuration

Set one GUC; `pg_replica` does the rest:

```ini
# postgresql.conf  (pg_replica.synchronous is a Postmaster GUC — needs a restart)
pg_replica.synchronous = on
```

When on, the primary continuously maintains

```
synchronous_standby_names = 'ANY <majority-1> (peer1, peer2, ...)'
```

so a `COMMIT` is not acked until the WAL is flushed on a **majority of nodes**. Because the
sync quorum (`ANY majority-1` standbys + the primary = a majority) is the *same* majority Raft
elects a leader from, every acked write is guaranteed to be on whichever node wins the next
election. `test-quorum-consistency` exists to prove these two quorums never drift apart.

| Cluster | `synchronous_standby_names` set by pg_replica | Tolerates (zero-loss) |
|---------|------------------------------------------------|-----------------------|
| 3 nodes | `ANY 1 (n2, n3)` | **1** node/leader failure |
| 5 nodes | `ANY 2 (n2, n3, n4, n5)` | **2** simultaneous failures |

Also required (these are PostgreSQL defaults — do **not** weaken them):

- **`synchronous_commit = on`** (the default). Never `local`, `off`, or `remote_write`
  (`remote_write` acks before the standby fsyncs, so a standby crash can still lose it).
  Use `remote_apply` if you additionally want a committed row to be *visible* on the
  confirming standbys before the ack (helps read-your-writes — see section C).
- **`data_checksums = on`** at `initdb` time, so silent disk corruption is detected, not
  replicated as truth.

### What synchronous mode costs

- **Latency:** a commit waits for a standby flush (one extra network round-trip).
- **Availability:** if fewer than `majority-1` standbys are caught up, writes **block** until
  one returns. That is the correct trade — blocking beats acking a write that could be lost.
  (With `pg_replica.synchronous = off` you get async replication: lower latency, but a primary
  can ack a commit and crash before replicating it → that write is lost on failover.)

### The residual durability risk

Even fully synchronous, a 3-node `ANY 1` cluster can lose an acked write **only** if a fault
impairs the sync-confirming standby *at the same time* as the primary, mid-failover (the
"sync-ANY-k edge" noted in `test-chaos`). Clean single failovers are zero-loss; to survive two
overlapping faults, run **5 nodes / `ANY 2`**.

---

## B. Availability — failed requests during a failover

A failover takes **~5 s** to a writable new primary. During that window there is **no
writable primary**, and the client cannot hide that for you:

- `hyperiondb-client` retries the *connection checkout* with backoff up to `acquireTimeoutMs`
  (default 5000), then throws `no writable primary available after <ms>ms`.
- **It does not replay your SQL** — only the connection acquisition. A transaction in flight
  when the primary dies **fails**, and your application must retry it.
- **Ambiguous COMMIT:** if the connection drops *after* `COMMIT` was sent but before the ack,
  the transaction may have committed and replicated. The client correctly reports an error,
  but a blind retry would **double-apply**. This is the in-doubt-transaction problem and
  cannot be solved at the pool layer.

To make failovers invisible to end users, the application needs all three:

1. **Retry** on the typed `no writable primary…` error and on connection-drop errors.
2. **`acquireTimeoutMs` ≥ your failover time** (raise it above ~5 s so a single failover
   surfaces as a delay, not an error).
3. **Idempotent writes**, so retrying an ambiguous commit is safe — e.g. a client-generated
   UUID with `INSERT … ON CONFLICT DO NOTHING`, or an outbox/dedup key. Reads are inherently
   safe to retry.

---

## C. Reads

Reads never get "lost" (they don't mutate), but they can be **stale** or, without fencing,
**wrong**:

- A `mode: 'read-only'` / `prefer-standby` pool reads from standbys, which lag the primary.
  For **read-your-writes** consistency, read from the **read-write pool (primary)**, or use
  `synchronous_commit = remote_apply` (makes a commit visible on the *confirming* standbys
  before it acks — still not on *all* standbys).
- **Split-brain reads are prevented:** a demoted or minority-partitioned primary fences itself
  read-only (`default_transaction_read_only = on`), and the client's checkout validation
  (`SHOW transaction_read_only`) evicts those connections — so you never read from a stale
  ex-primary as if it were current (`test-m4-fence`, `test-m4-partition`).

---

## Failure modes at a glance

| Layer | Failure | Effect | Mitigation |
|-------|---------|--------|------------|
| Postgres repl | async / weak `synchronous_commit` | acked write lost on failover | `pg_replica.synchronous = on`, keep `synchronous_commit = on` |
| pg_replica | quorum lost (≥2 of 3 down) | no leader → writes **block** | correct (safety > availability); 5 nodes to tolerate 2 |
| pg_replica | minority-partitioned primary | that side can't write | self-fences → no split-brain |
| pg_replica | primary + sync standby fault together, mid-failover | rare acked-write loss | 5-node `ANY 2`; backups |
| client | failover window (~5 s) | queries fail | retry + raise `acquireTimeoutMs` |
| client | ambiguous COMMIT | unknown outcome | **idempotent writes** |
| client | read-only pool | stale reads | read the primary for strong reads |
| infra | supervisor doesn't restart PG | capacity shrinks → quorum risk | reliable systemd / Docker restart policy |
| infra | single DC / disk corruption / bad SQL (`DELETE` w/o `WHERE`) | total or logical loss | multi-AZ, `data_checksums`, **backups + PITR** |

---

## What this does *not* replace

Raft + synchronous replication protect against **node and leader failure inside one cluster**.
They do nothing for:

- **Logical errors / bad migrations** → you still need **backups + PITR** (pgBackRest, wal-g).
- **Correlated loss** (one rack / AZ / DC / region) → spread nodes across AZs, or add a remote
  (optionally synchronous) standby, at a latency cost.
- **Bugs / operator error** → backups, again.

Backups and PITR are an explicit non-goal of `pg_replica` (see the README) — they remain
**mandatory** for real durability.

---

## Verdict

- **Zero loss of acknowledged writes:** achievable and *tested* for the failures pg_replica is
  built for (single node/leader failure, clean failover) **iff** `pg_replica.synchronous = on`
  with `synchronous_commit = on`. Not absolute: overlapping faults beyond the budget,
  correlated/DC loss, corruption, and logical errors still need 5-node clusters, multi-AZ,
  checksums, and backups.
- **Zero failed requests:** only with **app-level retry + idempotency**; the ~5 s failover
  window is real and the pool will not replay statements for you.
- **Zero stale reads:** read from the primary (or `remote_apply`); standby reads are
  eventually-consistent by design.

"Zero loss" end-to-end is a property of **cluster config + application retry/idempotency +
backups together** — not of `pg_replica` or the client alone.

## Validated by

| Property | Test (`scripts/`) |
|----------|-------------------|
| Quorum-sync = zero committed-transaction loss on failover | `test-m7-sync` |
| Sync quorum and Raft quorum name the same nodes | `test-quorum-consistency` |
| Continuous writer, faults injected, 0 split-brain + zero-loss for clean failovers | `test-chaos` |
| Highest-LSN survivor is promoted (no data loss) | `test-m3-lsn` |
| Minority primary self-fences read-only | `test-m4-fence`, `test-m4-partition` |
| Client follows the failover with only a reconnect | `test-m6-routing` |

See also [ARCHITECTURE.md](ARCHITECTURE.md), [DECISIONS.md](DECISIONS.md), and the client
guide [CLIENT.md](CLIENT.md).