walrust 0.3.1 - Docs.rs

# walrust HA Architecture

SQLite high availability via S3-based leader election and WAL streaming.

## Problem

A single SQLite process is a SPOF. If the machine dies or during deploys, the database is unavailable. We need 2-node hot standby with zero-downtime deploys, fast failover, and zero data loss.

## Architecture

```
┌──────────────┐                ┌──────────────┐
│  Node 1      │  WAL stream    │  Node 2      │
│  LEADER      │───────────────>│  FOLLOWER    │
│  read+write  │  (via S3)      │  read replica │
└──────┬───────┘                └──────┬───────┘
       │                               │
  sync_wal (1s)                  pull_incremental (1s)
       │                               │
       ▼                               ▼
┌──────────────────────────────────────────────┐
│                 S3 (Tigris)                  │
│  {prefix}{db}/0001/*.ltx     (snapshots)     │
│  {prefix}{db}/0000/*.ltx     (incrementals)  │
│  {prefix}leader.json         (lease)         │
│  {prefix}nodes/{id}.json     (registrations) │
└──────────────────────────────────────────────┘
              ▲                ▲
   discover_leader()    discover_replicas()
              │                │
        ┌─────┴─────┐   ┌─────┴─────┐
        │  Writer   │   │  Reader   │
        │ (client)  │   │ (client)  │
        └───────────┘   └───────────┘
```

S3 is the only shared state. No gossip protocol, no Raft, no external coordinator.

## Components

### Leader Election (`leader.rs`)

S3 conditional PUTs for CAS-based leader election. The lease is a JSON object at `{prefix}leader.json`:

```json
{
  "instance_id": "node1",
  "address": "http://node1:9001",
  "claimed_at": 1711065600,
  "ttl_secs": 5,
  "session_id": "550e8400-e29b-41d4-a716-446655440000"
}
```

- **Claim**: `PutObject` with `If-None-Match: *` (create-if-not-exists) or `If-Match: <etag>` (CAS on expired lease)
- **Renew**: `PutObject` with `If-Match: <etag>` every 2s. Same `session_id` across renewals.
- **Abdicate**: `DeleteObject` on graceful shutdown (enables ~2s failover vs ~15s on crash)
- **Split-brain prevention**: Tigris enforces `If-Match` (ETag) on conditional PUTs in strong-consistent regions (e.g., `iad`). Two concurrent claims: exactly one gets 200, the other gets 412. As a safety net for cross-region use (where consistency is eventual), we also do a post-claim read-back with random jitter to confirm the winner.
- **`session_id`**: UUID generated per leadership claim. Changes on every promotion, even if the same instance reclaims. Consumers (writers, proxies) poll for a new `session_id` to detect failover — not `instance_id`, since the same node can crash and come back.

### WAL Streaming (Leader → S3)

The leader runs `sync_wal()` every 1s:

1. Read new WAL frames since last sync
2. Deduplicate pages (WAL may have multiple writes to the same page)
3. Encode as LTX with checksum chaining (pre_apply → post_apply)
4. Upload to `{prefix}{db}/0000/{min_txid}-{max_txid}.ltx`
5. Track a `wal_page_overlay` for correct checksums between checkpoints

LTX files are self-describing — filenames encode the TXID range. No manifest needed for discovery.

Periodic snapshots (full DB as LTX) go to `{prefix}{db}/0001/` for restore base.

### Follower Pull (`sync.rs`)

The follower runs `pull_incremental()` every 1s:

1. **Efficient listing**: `list_objects_after()` with S3 `start_after` parameter. Only lists incrementals newer than current TXID — no scanning of old keys. For a DB with 1M incrementals at TXID 999,990, this lists ~10 keys instead of 1M.
2. **Parallel download**: `FuturesUnordered` queue+worker model. Seeds 8 workers, each completes and immediately grabs the next file. No idle workers waiting for the slowest in a batch.
3. **Sequential application**: Downloads are parallel but application is sequential — the LTX checksum chain is serial (each file's `pre_apply` must match the previous file's `post_apply`).
4. **Stale handling**: If an incremental's checksum doesn't match (from a previous leader's lineage), it's skipped with `continue`, not `break` — later files from the new lineage may still apply.

### Node Registration (`leader.rs`)

Every node registers itself in S3 at `{prefix}nodes/{instance_id}.json`:

```json
{
  "instance_id": "node2",
  "address": "http://node2:9002",
  "role": "follower",
  "leader_session_id": "550e8400-e29b-41d4-a716-446655440000",
  "last_seen": 1711065600
}
```

- **Register**: Follower writes its registration on startup and heartbeats it periodically (every 5s, piggybacks on election loop)
- **`leader_session_id`**: Must match the current leader's `session_id`. When a new leader is elected, followers re-register with the new `session_id`. Clients use this to filter stale registrations.
- **Promotion**: When a follower becomes leader, it deletes its `nodes/{instance_id}.json` (it's now discoverable via `leader.json`)
- **Cleanup**: Stale registrations (`last_seen + ttl < now`) are ignored by clients. Nodes delete their own file on graceful shutdown.

### Read/Write Routing (HTTP middleware)

The follower serves reads locally from its own database (kept current via `pull_incremental`, up to ~1s stale). Only writes are forwarded to the leader.

- **Reads**: Served locally on whichever node the client is connected to (leader or follower)
- **Writes**: Served locally on leader, reverse-proxied to leader on follower

Clients can connect to any node. If connected to the leader, everything is local. If connected to a follower, reads are local (fast, ~1s stale) and writes are transparently forwarded.

### Service Discovery

Clients only need S3 credentials and the bucket/prefix. Zero node addresses in config.

**Write discovery** (via `leader.json`):
1. Read `{prefix}leader.json` to get leader address + `session_id`
2. Send writes to that address
3. On 3+ consecutive failures, poll `leader.json` for a new `session_id`
4. When `session_id` changes, switch to the new leader's address and resume

**Read discovery** (via `nodes/`):
1. List `{prefix}nodes/` to find all registered replicas
2. Filter: `last_seen + ttl > now` and `leader_session_id` matches current leader's `session_id`
3. Connect to any valid node (prefer followers to keep load off leader, fall back to leader)
4. If the connected node disappears (connection errors), rediscover from `nodes/`

## Failover Scenarios

### Graceful (deploy, SIGTERM)

1. Leader receives SIGTERM
2. Close HTTP connections (reject new writes)
3. Final `sync_wal()` — flush last WAL frames to S3
4. `abdicate()` — delete `leader.json`
5. Follower detects missing lease, claims immediately (new `session_id`)
6. Follower deletes its `nodes/{instance_id}.json` (now leader, discoverable via `leader.json`)
7. Follower runs `pull_incremental()` to catch up to final TXID
8. Opens SQLite, starts WAL streaming as new leader
9. Old leader's node registration (if any) goes stale — ignored by TTL

**Downtime: ~2-3s. Data loss: zero.**

### Crash (kill -9, machine death)

1. Leader stops renewing lease
2. Follower reads expired lease for `REQUIRED_EXPIRED_READS` consecutive checks (currently 1)
3. Follower claims via CAS `If-Match: <expired_etag>`
4. Follower restores from S3 (snapshot + incrementals)
5. Opens SQLite, starts WAL streaming

**Downtime: ~TTL + REQUIRED_EXPIRED_READS * RENEW_INTERVAL (~7-10s). Data loss: up to 1 sync interval (1s of writes).**

### Same-node recovery

The crashed node restarts, reads `leader.json`, sees a different `session_id` → becomes follower, registers in `nodes/`. If no other node claimed, it claims with a new `session_id` → writers detect the new session and reconnect. Readers connected to the crashed node rediscover via `nodes/` listing.

## Experiment Results

Tested locally with 2 nodes, 100ms write interval:

| Metric | Result |
|--------|--------|
| Graceful failover time | ~2s |
| Data loss (graceful) | 0-3 rows (sub-sync-interval) |
| Writer reconnection | Automatic via S3 discovery |
| Follower proxy | Transparent to clients |

## Key Design Decisions

1. **S3 as the only coordination bus** — no Raft, no gossip, no external service. S3 is already a dependency for WAL storage, so leader election through it adds zero new infrastructure.

2. **No manifest for incremental discovery** — LTX filenames encode TXID ranges as zero-padded hex. S3 lexicographic listing = TXID order. `start_after` skips directly to the right position.

3. **`session_id` over `instance_id`** for failover detection — the same node can crash and reclaim. What matters is that a new leadership session exists, not which instance holds it.

4. **Read/write split over full proxy** — follower serves reads locally (~1s stale), only forwards writes. Clients get read scaling without knowing which node is the leader.

5. **Node registration with `leader_session_id`** — replicas are only valid for the current leadership session. New election invalidates all registrations, forcing re-registration. Clients can validate a replica by checking its `leader_session_id` against `leader.json`.

6. **`FuturesUnordered` over batch `join_all`** for parallel downloads — queue+worker model keeps all workers saturated. One slow S3 response doesn't stall the other 7 workers.

7. **Checksum chain for integrity** — every LTX file chains to the previous via `pre_apply`/`post_apply` checksums. Stale files from old leadership lineages are detected and skipped automatically.

## File Map

```
walrust/crates/walrust-core/
├── src/
│   ├── leader.rs        # S3 leader election (LeaseData, LeaderElection, discover_leader)
│   ├── replicator.rs    # High-level replication manager (wraps sync loop)
│   ├── sync.rs          # Core: sync_wal, pull_incremental, restore, download_parallel
│   ├── storage.rs       # StorageBackend trait + S3Backend (list_objects_after)
│   ├── s3.rs            # Raw S3 operations (upload, download, list, list_after)
│   ├── ltx.rs           # LTX encode/decode with checksum chaining
│   ├── wal.rs           # SQLite WAL parsing
│   └── bin/
│       ├── ha_experiment.rs  # 2-node HA experiment (leader election + proxy + hooks)
│       └── ha_writer.rs      # S3-discovery writer (discovers leader via leader.json)
```