gruezi 0.1.1

Service Discovery & Distributed Key-Value Store
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
# gruezi

Service Discovery & Distributed Key-Value Store

## Roadmap

### HA

- [x] HA mode over unicast UDP at L4
- [x] IPv6-first API listener with IPv4 fallback
- [x] CLI for peer management and status
- [x] Live HA status watch mode and packet troubleshooting workflow
- [ ] DNS-based service discovery
- [x] HA packet format and authentication
- [x] HA state machine (`INIT`, `BACKUP`, `MASTER`)
- [x] HA management API on `9376/tcp`
- [x] HA transition hooks (`on_promote`, `on_demote`, `on_backup`)
- [x] HA fault hook (`on_fault`) for address-action and runtime failure paths
- [x] Graceful VIP cleanup on shutdown (`SIGINT`, `SIGTERM`)
- [x] Ansible-based HA lab deployment workflow
- [ ] Split-brain prevention and conservative failover rules
- [ ] Performance tuning for heartbeat, timers, and failover latency

### KV

- [ ] KV mode using Raft consensus
- [ ] `raft-engine` for the dedicated Raft log
- [ ] RocksDB for applied key-value state
- [ ] Snapshot creation and install flow
- [ ] Aggressive Raft log truncation after snapshot
- [ ] Snapshot lifecycle management
- [ ] Membership change and bootstrap rules
- [ ] Witness/arbiter support
- [ ] Discovery bootstrap by URL

### Operations

- [ ] Backpressure before disk exhaustion
- [ ] Quotas and reserved free space
- [ ] Clear write-stall behavior under pressure
- [ ] Security model (mTLS and auth)
- [ ] Observability (metrics and tracing)

## DRAFT: Configuration Model

Configuration should use YAML.

Default config lookup order:

* `--config /path/to/gruezi.yaml`
* `GRUEZI_CONFIG=/path/to/gruezi.yaml`
* `/etc/gruezi/gruezi.yaml`

Example configs for local testing live in:

* `examples/ha.yaml`
* `examples/kv.yaml`
* `ansible-playbook -i ansible/inventory/lab.yml ansible/deploy-ha-lab.yml` after creating local files from `ansible/inventory/lab.yml.example` and `ansible/group_vars/gruezi_ha_lab.yml.example`
* `ansible/README.md` for the HA lab deployment workflow
* `contrib/README.md` for package-oriented `.deb` and `.rpm` builds
* `CONTRIBUTING.md` for the development and pull request workflow

Draft default ports:

* `9375/udp` for `gruezi-ha`
* `9376/tcp` for `gruezi-api`
* `9377/tcp` for `gruezi-peer`

For v1, `gruezi` should expose a single top-level selector:

```yaml
mode: ha
```

or:

```yaml
mode: kv
```

Current meaning:

* `mode: ha`: keepalived-like high availability for a minimum 2-node deployment
* `mode: kv`: etcd-like distributed key-value store with quorum semantics

Deployment guidance:

* `mode: ha` for 2-node deployments
* `mode: kv` for 3+ quorum participants

In other words, when only 2 nodes are available, HA is the viable mode. Once a cluster has enough participants for Raft, consensus should become the source of truth for leadership and failover decisions.

The `mode` field is the user-facing operational choice. The underlying protocol or algorithm used to implement that mode is an internal detail.

Internally, the configuration can still be normalized into dedicated sections, but the user-facing interface should remain simple:

```yaml
mode: ha

node:
  id: node-a

ha:
  bind: 0.0.0.0:9375
  interface: eth0
  group_id: cluster-ha
  addresses:
    - 10.0.0.10/24
  peer: 10.0.0.2:7000
  protocol_version: 1
  priority: 100
  preempt: true
  advert_interval_ms: 1000
  dead_factor: 3
  hold_down_ms: 3000
  jitter_ms: 100
  auth:
    mode: none
```

```yaml
mode: kv

kv:
  role: voter
  listen_client: 0.0.0.0:9376
  listen_peer: 0.0.0.0:9377
  data_dir: /var/lib/gruezi
  initial_cluster:
    - node-a=http://10.0.0.1:2380
    - node-b=http://10.0.0.2:2380
    - witness=http://10.0.0.3:2380
```

This keeps the external configuration explicit and simple while leaving room for richer internal validation and future expansion.

## DRAFT: Protocol Direction

### HA mode

`mode: ha` should use a high-availability protocol over unicast UDP at L4.

Current implementation overview:

```mermaid
flowchart TD
    A["Node A<br/>gruezi start --config ..."] --> B["Bind HA socket<br/>9375/udp"]
    C["Node B<br/>gruezi start --config ..."] --> D["Bind HA socket<br/>9375/udp"]

    B --> E["Periodic HA advertisement<br/>protocol_version, state, priority,<br/>dead_factor, advert_interval, sequence,<br/>node_id, group_id, auth_tag"]
    D --> F["Periodic HA advertisement<br/>same packet shape"]

    E --> G["Peer receives UDP packet"]
    F --> H["Peer receives UDP packet"]

    G --> I["Validate packet<br/>version, group_id, auth_tag,<br/>not looped local node"]
    H --> J["Validate packet<br/>version, group_id, auth_tag,<br/>not looped local node"]

    I --> K["Update peer observation<br/>peer node_id, state, priority,<br/>last seen timestamp"]
    J --> L["Update peer observation<br/>peer node_id, state, priority,<br/>last seen timestamp"]

    K --> M["Recompute local state<br/>INIT | BACKUP | MASTER"]
    L --> N["Recompute local state<br/>INIT | BACKUP | MASTER"]

    M --> O{"State changed?"}
    N --> P{"State changed?"}

    O -- "to MASTER" --> Q["Add VIP addresses to interface<br/>run promote hook"]
    O -- "to BACKUP" --> R["Remove VIP addresses from interface<br/>run backup/demote hook"]
    P -- "to MASTER" --> S["Add VIP addresses to interface<br/>run promote hook"]
    P -- "to BACKUP" --> T["Remove VIP addresses from interface<br/>run backup/demote hook"]

    M --> U["Publish HA status snapshot<br/>9376/tcp API"]
    N --> V["Publish HA status snapshot<br/>9376/tcp API"]
```

In practice, each node continuously:

* sends HA advertisements to exactly one configured peer over `9375/udp`
* tracks the peer's last observed state, priority, and liveness deadline
* chooses `MASTER` or `BACKUP` based on peer health, priority, node ID tiebreak, and `preempt`
* adds or removes the configured VIP addresses on state transition
* exposes the current snapshot through the management API on `9376/tcp`

Current implementation walkthrough:

1. Startup:
   the node loads the HA runtime config, binds the UDP socket on `ha.bind`, parses the single configured peer, initializes the local state as `INIT`, and publishes an initial status snapshot.
2. Advertisement loop:
   each iteration recomputes local state, waits until the next advertisement deadline, then either sends one HA packet to the configured peer or handles one received packet from that peer.
3. Packet validation:
   received packets are accepted only when the magic bytes, protocol version, `group_id`, and auth tag match, and when the packet is not looped back from the local node ID.
4. Peer observation:
   after a valid packet, the node stores the peer node ID, peer state, peer priority, and the timestamp of when that packet was observed.
5. State choice:
   if the peer is considered alive, the node compares peer state, peer priority, local priority, local node ID, and `preempt` to decide between `MASTER` and `BACKUP`.
   if the peer is not alive, the node promotes itself after the startup follower deadline or keeps `MASTER` if it already held it.
6. VIP handling:
   transitions to `MASTER` add the configured VIP addresses to the interface and run `on_promote`.
   transitions to `BACKUP` remove the configured VIP addresses and run either `on_backup` or `on_demote`.
7. Fault handling:
   address add/remove failures trigger `on_fault`.
   fatal runtime send/receive failures also trigger shutdown cleanup and `on_fault`.
8. Shutdown:
   on graceful shutdown, including `SIGINT` and `SIGTERM`, the node removes its configured VIP addresses before the runtime exits and publishes a final `INIT` snapshot.

The goal is to preserve the operational model of VRRP/CARP best practices while avoiding a dependency on L2 multicast, gratuitous ARP, or other mechanisms commonly blocked by cloud providers.

This means:

* leader election and liveness detection happen over UDP
* the state machine should remain close to active/backup failover behavior
* priority, advertisement interval, preemption, and authentication should be first-class concepts

Operational scope for HA mode:

* `mode: ha` is an internal infrastructure component, similar in intent to keepalived
* HA advertisements on `9375/udp` are expected to stay on a trusted private network
* the HA peer channel should not be exposed to the public Internet
* firewall rules should restrict HA traffic to the expected peer nodes

This is intentionally **VRRP/CARP-like**, not wire-compatible VRRP or CARP. The project should not claim protocol compatibility unless it implements the actual protocol semantics and packet format.

The HA advertisement format should be versioned and minimal. At a minimum, each packet should carry:

* protocol version
* node ID
* group or instance identifier
* current state
* priority
* advertisement interval
* sequence number
* authentication tag

Performance and reliability should be first-class requirements for HA mode:

* low-overhead UDP heartbeats
* deterministic state transitions
* conservative failover under packet loss or partitions
* explicit authentication on HA advertisements
* predictable recovery after peer restart or transient network loss
* first-class IPv6 support with IPv4 fallback when dual-stack binding is unavailable

HA observability should also be a first-class design goal:

* every promotion, demotion, backup transition, and VIP move should be explainable after the fact
* operators should be able to tell whether the cause was peer timeout, priority/preempt logic, node ID tiebreak, shutdown cleanup, or an explicit fault path
* `gruezi status`, logs, hooks, and future metrics should make the decision path visible instead of only showing the final state
* debugging HA should answer "why did this node take or drop the VIP?" without requiring packet capture as the primary source of truth

### KV mode

`mode: kv` should use Raft for consensus.

The KV subsystem is intended to provide etcd-like semantics:

* quorum-based writes
* leader election through Raft
* replicated log and durable state
* membership-aware cluster status

Operationally, `mode: kv` requires at least 3 quorum participants, or 2 nodes plus a witness/arbiter.

### API and management port

`9376/tcp` should be the common API and management port.

That means:

* in `mode: kv`, it is the client-facing API port
* in `mode: ha`, it is the live management/status port
* CLI commands such as `gruezi status` already target this API instead of talking directly to the HA or Raft peer ports

The port split should remain:

* `9375/udp`: HA peer advertisements
* `9376/tcp`: API, management, and client access
* `9377/tcp`: KV peer and Raft traffic

### Separation of concerns

HA and KV solve different problems and should remain conceptually separate:

* HA decides which node should be active in a 2-node deployment
* KV uses Raft to decide leadership and maintain authoritative replicated state in a 3+ participant deployment

For v1, the user-facing configuration should stay explicit and simple with `mode: ha` or `mode: kv`. For 2 nodes, use HA. For 3 or more quorum participants, KV is the preferred mode because Raft already provides leadership and failover behavior through consensus.

## DRAFT: KV Validation Strategy

`mode: kv` should be validated with [Maelstrom](https://github.com/jepsen-io/maelstrom), the Jepsen-based distributed systems workbench used by the Fly.io distributed systems challenges.

This is useful because Maelstrom provides:

* a simulated network with latency, loss, and partitions
* workload-specific correctness checks
* visualization and history analysis for distributed failures

The Fly.io challenges should be treated as a staged validation path for KV work, not as a literal promise to implement every challenge unchanged.

Initial KV validation milestones:

* basic RPC/message handling
* node identity and request correlation
* inter-node message propagation
* replicated log behavior
* Raft leader election and quorum behavior

As `mode: kv` evolves, tests should move from local unit coverage to Maelstrom-driven fault-injection and correctness checks.

## DRAFT: Storage Direction

For `mode: kv`, storage should be split by workload:

* `raft-engine` as the dedicated Raft log engine for the consensus journal
* RocksDB for the applied key-value state
* snapshots managed separately from the live Raft log and KV state

The Raft log and the applied KV state are different things:

* the Raft log stores ordered replicated commands
* the KV state stores the result after committed commands are applied

This separation should make it easier to optimize for performance and resilience under disk pressure.

The intended operational goals are:

* fast sequential Raft appends and replay
* efficient log truncation after snapshotting
* durable and performant KV reads/writes through RocksDB
* better disk-pressure handling than a single shared backend

For v1, `gruezi` should use a single Raft group. `raft-engine` is still the preferred direction for the Raft log even though it is designed to support Multi-Raft, because the log engine characteristics are a better fit for consensus journals than a general-purpose KV backend.

To support that, the system should explicitly implement:

* aggressive Raft log truncation after snapshot
* snapshot lifecycle management
* backpressure before disk exhaustion
* quotas and reserved free space
* clear write-stall behavior under pressure

## DRAFT: Snapshot Model

Snapshots are required to keep the Raft log from growing without bound.

The expected model is:

* committed log entries are applied to RocksDB
* a snapshot captures the applied KV state at a specific Raft index and term
* once the snapshot is durable, older Raft log segments can be truncated

Each snapshot should include:

* last included Raft index
* last included Raft term
* cluster membership metadata
* snapshot format version
* checksum

Snapshots should be used for:

* faster node restart and recovery
* catching up lagging or newly joined followers
* bounding disk usage for the Raft log

Snapshot lifecycle should define:

* when snapshots are triggered
* how snapshots are transferred to other nodes
* when old snapshots can be deleted
* when Raft log truncation is allowed after snapshot persistence

## DRAFT: On-Disk Layout

`mode: kv` should separate persistent data by purpose:

* `raft/` for the Raft log engine
* `kv/` for RocksDB applied state
* `snapshots/` for snapshot files and metadata

This layout should allow independent quotas, cleanup policies, and recovery behavior.

## DRAFT: Membership And Bootstrap

Cluster lifecycle needs explicit rules.

Items that must be defined:

* first cluster bootstrap
* adding a new node
* replacing a failed node
* removing a node safely
* restart behavior after crash or partial disk loss
* witness or arbiter behavior, if supported

For v1, membership changes should be conservative and explicit. Unsafe ad-hoc joins should be avoided.

## DRAFT: Failure And Disk-Pressure Behavior

Disk pressure should be treated as a first-class failure mode.

Behavior to define:

* reserved free-space threshold
* warning threshold and critical threshold
* when writes are throttled
* when writes are rejected
* when compaction, truncation, or snapshot cleanup is triggered
* when the node reports degraded or read-only state

The goal is to fail predictably before the disk is fully exhausted.

## DRAFT: HA Failure Semantics

For `mode: ha`, split-brain prevention must be documented explicitly.

Items to define:

* promotion rules
* preemption rules
* peer loss timeout
* behavior under network partition
* fencing or external safety checks, if required

For a 2-node deployment, HA should prefer deterministic and conservative failover behavior over aggressive promotion.

## DRAFT: HA Implementation Priorities

The HA path should be built first as the smallest end-to-end feature.

Suggested order:

* YAML config schema and validation for `mode: ha`
* node identity, peer identity, and interface/address configuration
* UDP packet format with versioning and authentication fields
* heartbeat sender/receiver loop with bounded timers
* HA state machine with `INIT`, `BACKUP`, and `MASTER`
* promotion, preemption, and failover rules
* CLI status output and metrics

HA v1 should optimize for:

* strong reliability before fast failover
* bounded CPU and memory overhead
* clear behavior during packet loss, delay, or temporary partitions
* simple and observable state transitions for debugging

Recommended HA timer defaults:

* `advert_interval_ms: 1000`
* `dead_factor: 3`
* `hold_down_ms: 3000`
* `jitter_ms: 100`

Recommended HA auth shape:

```yaml
ha:
  group_id: cluster-ha
  auth:
    mode: shared_key
    key: change-me
```

Meaning of the HA fields:

* `group_id`: logical HA domain. Only nodes in the same group should accept each other's advertisements.
* `auth.mode: none`: disable packet authentication. This is only suitable for local development or isolated lab testing.
* `auth.mode: shared_key`: every HA packet carries an authentication tag derived from a shared secret and the packet contents.
* `auth.key`: the shared secret used by all nodes in the same HA group. It should be treated like any other cluster secret and distributed securely.

`shared_key` in HA mode is not transport encryption. It exists to answer a narrower question:

* is this UDP advertisement from a node that knows the group secret?
* was the packet likely modified in transit?

This is a better fit for HA v1 because the HA control plane is unicast UDP. `mTLS` is a strong option for TCP-based APIs and Raft peer links, but it does not apply directly to raw UDP advertisements. The comparable UDP-level option would be `DTLS` or a more advanced per-packet cryptographic scheme, which adds more complexity than is needed for the initial HA protocol.

Recommended direction:

* HA over UDP: start with explicit packet authentication using `shared_key`
* API and KV peer traffic over TCP: use TLS/mTLS
* future HA hardening: consider DTLS or stronger keyed message authentication if the simpler HA packet auth is not sufficient

Threat-model note:

* `shared_key` is a practical first step for a private HA network, not a full Internet-facing security model
* it should be combined with network isolation, peer allow-listing, and standard infrastructure firewalling
* if HA traffic ever needs to cross less-trusted networks, the design should be revisited with stronger transport or packet-level protections

Operational guidance:

* use `auth.mode: none` only for tests and local experiments
* use a different `auth.key` per HA group/environment
* rotate the key carefully, because all HA peers in the same group must agree on it
* do not treat `shared_key` as a substitute for TLS on the management API

### Current HA API

The current HA management API is read-only and listens on `9376/tcp`.

Available endpoints:

* `GET /status`
* `GET /ha/status`
* `GET /healthz`

The current `gruezi status` command queries this API.

For a live view during failover testing:

```bash
gruezi status --watch --interval-ms 1000 --node 192.0.2.5:9376
```

### HA Packet Troubleshooting

For HA packet troubleshooting, use `gruezi status --watch` and `tcpdump` together:

```bash
gruezi status --watch --interval-ms 1000 --node 192.0.2.10:9376
```

```bash
sudo tcpdump -ni eth0 'udp port 9375 and host 192.0.2.11' -tttt -vvv -X -s0
```

Recommended `tcpdump` flags:

* `-n`: disable DNS lookups
* `-tttt`: print readable timestamps for correlation with `status --watch`
* `-vvv`: increase protocol detail
* `-X`: show packet payload in hex and ASCII
* `-s0`: capture the full packet instead of truncating it

This lets you correlate:

* `sent` and `recv` counters from `gruezi status --watch`
* peer liveness and `last_peer_seen`
* raw UDP payload bytes on `9375/udp`

### Current HA Hooks

The current HA implementation supports transition hooks in YAML:

```yaml
ha:
  hooks:
    on_promote: /etc/gruezi/hooks/promote.sh
    on_demote: /etc/gruezi/hooks/demote.sh
    on_backup: /etc/gruezi/hooks/backup.sh
    on_fault: /etc/gruezi/hooks/fault.sh
    timeout_ms: 5000
```

Implemented today:

* `on_promote`
* `on_demote`
* `on_backup`
* `on_fault` for explicit HA address-action and runtime failure paths

Hook scripts currently receive runtime context through environment variables:

* `GRUEZI_EVENT`
* `GRUEZI_NODE_ID`
* `GRUEZI_GROUP_ID`
* `GRUEZI_INTERFACE`
* `GRUEZI_STATE`
* `GRUEZI_PREVIOUS_STATE`
* `GRUEZI_PEER_ID`
* `GRUEZI_PEER_STATE`

## DRAFT: API And Service Discovery

The external API surface for `mode: kv` still needs to be defined.

Questions to settle:

* etcd-compatible API or custom API
* gRPC, HTTP, or both
* key-space layout and prefix conventions
* watch/stream semantics
* lease/session behavior

API listeners must not be IPv4-only by default. The preferred behavior is:

* explicit `listen` IP binds exactly that IP family
* if no listen IP is provided, try dual-stack IPv6 first
* if dual-stack IPv6 is unavailable, fall back to IPv4

If an HTTP API is added, `axum` is a reasonable choice on top of a pre-bound `TcpListener`.

Service discovery also needs clear rules:

* how records are stored in KV
* how DNS responses are generated
* TTL and expiration behavior
* health integration and stale record cleanup

## DRAFT: Security And Observability

Both modes should plan for production-grade safety and debugging.

Minimum areas to define:

* mTLS between nodes
* client authentication and authorization
* certificate rotation
* metrics for leadership, replication lag, snapshot size, disk usage, and write stalls
* tracing for elections, failover, and storage operations