# walrust HA Architecture
SQLite high availability via S3-based leader election and WAL streaming.
## Problem
A single SQLite process is a SPOF. If the machine dies or during deploys, the database is unavailable. We need 2-node hot standby with zero-downtime deploys, fast failover, and zero data loss.
## Architecture
```
┌──────────────┐ ┌──────────────┐
│ Node 1 │ WAL stream │ Node 2 │
│ LEADER │───────────────>│ FOLLOWER │
│ read+write │ (via S3) │ read replica │
└──────┬───────┘ └──────┬───────┘
│ │
sync_wal (1s) pull_incremental (1s)
│ │
▼ ▼
┌──────────────────────────────────────────────┐
│ S3 (Tigris) │
│ {prefix}{db}/0001/*.ltx (snapshots) │
│ {prefix}{db}/0000/*.ltx (incrementals) │
│ {prefix}leader.json (lease) │
│ {prefix}nodes/{id}.json (registrations) │
└──────────────────────────────────────────────┘
▲ ▲
discover_leader() discover_replicas()
│ │
┌─────┴─────┐ ┌─────┴─────┐
│ Writer │ │ Reader │
│ (client) │ │ (client) │
└───────────┘ └───────────┘
```
S3 is the only shared state. No gossip protocol, no Raft, no external coordinator.
## Components
### Leader Election (`leader.rs`)
S3 conditional PUTs for CAS-based leader election. The lease is a JSON object at `{prefix}leader.json`:
```json
{
"instance_id": "node1",
"address": "http://node1:9001",
"claimed_at": 1711065600,
"ttl_secs": 5,
"session_id": "550e8400-e29b-41d4-a716-446655440000"
}
```
- **Claim**: `PutObject` with `If-None-Match: *` (create-if-not-exists) or `If-Match: <etag>` (CAS on expired lease)
- **Renew**: `PutObject` with `If-Match: <etag>` every 2s. Same `session_id` across renewals.
- **Abdicate**: `DeleteObject` on graceful shutdown (enables ~2s failover vs ~15s on crash)
- **Split-brain prevention**: Tigris enforces `If-Match` (ETag) on conditional PUTs in strong-consistent regions (e.g., `iad`). Two concurrent claims: exactly one gets 200, the other gets 412. As a safety net for cross-region use (where consistency is eventual), we also do a post-claim read-back with random jitter to confirm the winner.
- **`session_id`**: UUID generated per leadership claim. Changes on every promotion, even if the same instance reclaims. Consumers (writers, proxies) poll for a new `session_id` to detect failover — not `instance_id`, since the same node can crash and come back.
### WAL Streaming (Leader → S3)
The leader runs `sync_wal()` every 1s:
1. Read new WAL frames since last sync
2. Deduplicate pages (WAL may have multiple writes to the same page)
3. Encode as LTX with checksum chaining (pre_apply → post_apply)
4. Upload to `{prefix}{db}/0000/{min_txid}-{max_txid}.ltx`
5. Track a `wal_page_overlay` for correct checksums between checkpoints
LTX files are self-describing — filenames encode the TXID range. No manifest needed for discovery.
Periodic snapshots (full DB as LTX) go to `{prefix}{db}/0001/` for restore base.
### Follower Pull (`sync.rs`)
The follower runs `pull_incremental()` every 1s:
1. **Efficient listing**: `list_objects_after()` with S3 `start_after` parameter. Only lists incrementals newer than current TXID — no scanning of old keys. For a DB with 1M incrementals at TXID 999,990, this lists ~10 keys instead of 1M.
2. **Parallel download**: `FuturesUnordered` queue+worker model. Seeds 8 workers, each completes and immediately grabs the next file. No idle workers waiting for the slowest in a batch.
3. **Sequential application**: Downloads are parallel but application is sequential — the LTX checksum chain is serial (each file's `pre_apply` must match the previous file's `post_apply`).
4. **Stale handling**: If an incremental's checksum doesn't match (from a previous leader's lineage), it's skipped with `continue`, not `break` — later files from the new lineage may still apply.
### Node Registration (`leader.rs`)
Every node registers itself in S3 at `{prefix}nodes/{instance_id}.json`:
```json
{
"instance_id": "node2",
"address": "http://node2:9002",
"role": "follower",
"leader_session_id": "550e8400-e29b-41d4-a716-446655440000",
"last_seen": 1711065600
}
```
- **Register**: Follower writes its registration on startup and heartbeats it periodically (every 5s, piggybacks on election loop)
- **`leader_session_id`**: Must match the current leader's `session_id`. When a new leader is elected, followers re-register with the new `session_id`. Clients use this to filter stale registrations.
- **Promotion**: When a follower becomes leader, it deletes its `nodes/{instance_id}.json` (it's now discoverable via `leader.json`)
- **Cleanup**: Stale registrations (`last_seen + ttl < now`) are ignored by clients. Nodes delete their own file on graceful shutdown.
### Read/Write Routing (HTTP middleware)
The follower serves reads locally from its own database (kept current via `pull_incremental`, up to ~1s stale). Only writes are forwarded to the leader.
- **Reads**: Served locally on whichever node the client is connected to (leader or follower)
- **Writes**: Served locally on leader, reverse-proxied to leader on follower
Clients can connect to any node. If connected to the leader, everything is local. If connected to a follower, reads are local (fast, ~1s stale) and writes are transparently forwarded.
### Service Discovery
Clients only need S3 credentials and the bucket/prefix. Zero node addresses in config.
**Write discovery** (via `leader.json`):
1. Read `{prefix}leader.json` to get leader address + `session_id`
2. Send writes to that address
3. On 3+ consecutive failures, poll `leader.json` for a new `session_id`
4. When `session_id` changes, switch to the new leader's address and resume
**Read discovery** (via `nodes/`):
1. List `{prefix}nodes/` to find all registered replicas
2. Filter: `last_seen + ttl > now` and `leader_session_id` matches current leader's `session_id`
3. Connect to any valid node (prefer followers to keep load off leader, fall back to leader)
4. If the connected node disappears (connection errors), rediscover from `nodes/`
## Failover Scenarios
### Graceful (deploy, SIGTERM)
1. Leader receives SIGTERM
2. Close HTTP connections (reject new writes)
3. Final `sync_wal()` — flush last WAL frames to S3
4. `abdicate()` — delete `leader.json`
5. Follower detects missing lease, claims immediately (new `session_id`)
6. Follower deletes its `nodes/{instance_id}.json` (now leader, discoverable via `leader.json`)
7. Follower runs `pull_incremental()` to catch up to final TXID
8. Opens SQLite, starts WAL streaming as new leader
9. Old leader's node registration (if any) goes stale — ignored by TTL
**Downtime: ~2-3s. Data loss: zero.**
### Crash (kill -9, machine death)
1. Leader stops renewing lease
2. Follower reads expired lease for `REQUIRED_EXPIRED_READS` consecutive checks (currently 1)
3. Follower claims via CAS `If-Match: <expired_etag>`
4. Follower restores from S3 (snapshot + incrementals)
5. Opens SQLite, starts WAL streaming
**Downtime: ~TTL + REQUIRED_EXPIRED_READS * RENEW_INTERVAL (~7-10s). Data loss: up to 1 sync interval (1s of writes).**
### Same-node recovery
The crashed node restarts, reads `leader.json`, sees a different `session_id` → becomes follower, registers in `nodes/`. If no other node claimed, it claims with a new `session_id` → writers detect the new session and reconnect. Readers connected to the crashed node rediscover via `nodes/` listing.
## Experiment Results
Tested locally with 2 nodes, 100ms write interval:
| Graceful failover time | ~2s |
| Data loss (graceful) | 0-3 rows (sub-sync-interval) |
| Writer reconnection | Automatic via S3 discovery |
| Follower proxy | Transparent to clients |
## Key Design Decisions
1. **S3 as the only coordination bus** — no Raft, no gossip, no external service. S3 is already a dependency for WAL storage, so leader election through it adds zero new infrastructure.
2. **No manifest for incremental discovery** — LTX filenames encode TXID ranges as zero-padded hex. S3 lexicographic listing = TXID order. `start_after` skips directly to the right position.
3. **`session_id` over `instance_id`** for failover detection — the same node can crash and reclaim. What matters is that a new leadership session exists, not which instance holds it.
4. **Read/write split over full proxy** — follower serves reads locally (~1s stale), only forwards writes. Clients get read scaling without knowing which node is the leader.
5. **Node registration with `leader_session_id`** — replicas are only valid for the current leadership session. New election invalidates all registrations, forcing re-registration. Clients can validate a replica by checking its `leader_session_id` against `leader.json`.
6. **`FuturesUnordered` over batch `join_all`** for parallel downloads — queue+worker model keeps all workers saturated. One slow S3 response doesn't stall the other 7 workers.
7. **Checksum chain for integrity** — every LTX file chains to the previous via `pre_apply`/`post_apply` checksums. Stale files from old leadership lineages are detected and skipped automatically.
## File Map
```
walrust/crates/walrust-core/
├── src/
│ ├── leader.rs # S3 leader election (LeaseData, LeaderElection, discover_leader)
│ ├── replicator.rs # High-level replication manager (wraps sync loop)
│ ├── sync.rs # Core: sync_wal, pull_incremental, restore, download_parallel
│ ├── storage.rs # StorageBackend trait + S3Backend (list_objects_after)
│ ├── s3.rs # Raw S3 operations (upload, download, list, list_after)
│ ├── ltx.rs # LTX encode/decode with checksum chaining
│ ├── wal.rs # SQLite WAL parsing
│ └── bin/
│ ├── ha_experiment.rs # 2-node HA experiment (leader election + proxy + hooks)
│ └── ha_writer.rs # S3-discovery writer (discovers leader via leader.json)
```