gruezi
Service Discovery & Distributed Key-Value Store
Roadmap
HA
- HA mode over unicast UDP at L4
- IPv6-first API listener with IPv4 fallback
- CLI for peer management and status
- Live HA status watch mode and packet troubleshooting workflow
- DNS-based service discovery
- HA packet format and authentication
- HA state machine (
INIT,BACKUP,MASTER) - HA management API on
9376/tcp - HA transition hooks (
on_promote,on_demote,on_backup) - HA fault hook (
on_fault) for address-action and runtime failure paths - Graceful VIP cleanup on shutdown (
SIGINT,SIGTERM) - Ansible-based HA lab deployment workflow
- Split-brain prevention and conservative failover rules
- Performance tuning for heartbeat, timers, and failover latency
KV
- KV mode using Raft consensus
-
raft-enginefor the dedicated Raft log - RocksDB for applied key-value state
- Snapshot creation and install flow
- Aggressive Raft log truncation after snapshot
- Snapshot lifecycle management
- Membership change and bootstrap rules
- Witness/arbiter support
- Discovery bootstrap by URL
Operations
- Backpressure before disk exhaustion
- Quotas and reserved free space
- Clear write-stall behavior under pressure
- Security model (mTLS and auth)
- Observability (metrics and tracing)
DRAFT: Configuration Model
Configuration should use YAML.
Default config lookup order:
--config /path/to/gruezi.yamlGRUEZI_CONFIG=/path/to/gruezi.yaml/etc/gruezi/gruezi.yaml
Example configs for local testing live in:
examples/ha.yamlexamples/kv.yamlansible-playbook -i ansible/inventory/lab.yml ansible/deploy-ha-lab.ymlafter creating local files fromansible/inventory/lab.yml.exampleandansible/group_vars/gruezi_ha_lab.yml.exampleansible/README.mdfor the HA lab deployment workflowcontrib/README.mdfor package-oriented.deband.rpmbuildsCONTRIBUTING.mdfor the development and pull request workflow
Draft default ports:
9375/udpforgruezi-ha9376/tcpforgruezi-api9377/tcpforgruezi-peer
For v1, gruezi should expose a single top-level selector:
mode: ha
or:
mode: kv
Current meaning:
mode: ha: keepalived-like high availability for a minimum 2-node deploymentmode: kv: etcd-like distributed key-value store with quorum semantics
Deployment guidance:
mode: hafor 2-node deploymentsmode: kvfor 3+ quorum participants
In other words, when only 2 nodes are available, HA is the viable mode. Once a cluster has enough participants for Raft, consensus should become the source of truth for leadership and failover decisions.
The mode field is the user-facing operational choice. The underlying protocol or algorithm used to implement that mode is an internal detail.
Internally, the configuration can still be normalized into dedicated sections, but the user-facing interface should remain simple:
mode: ha
node:
id: node-a
ha:
bind: 0.0.0.0:9375
interface: eth0
group_id: cluster-ha
addresses:
- 10.0.0.10/24
peer: 10.0.0.2:7000
protocol_version: 1
priority: 100
preempt: true
advert_interval_ms: 1000
dead_factor: 3
hold_down_ms: 3000
jitter_ms: 100
auth:
mode: none
mode: kv
kv:
role: voter
listen_client: 0.0.0.0:9376
listen_peer: 0.0.0.0:9377
data_dir: /var/lib/gruezi
initial_cluster:
- node-a=http://10.0.0.1:2380
- node-b=http://10.0.0.2:2380
- witness=http://10.0.0.3:2380
This keeps the external configuration explicit and simple while leaving room for richer internal validation and future expansion.
DRAFT: Protocol Direction
HA mode
mode: ha should use a high-availability protocol over unicast UDP at L4.
Current implementation overview:
flowchart TD
A["Node A<br/>gruezi start --config ..."] --> B["Bind HA socket<br/>9375/udp"]
C["Node B<br/>gruezi start --config ..."] --> D["Bind HA socket<br/>9375/udp"]
B --> E["Periodic HA advertisement<br/>protocol_version, state, priority,<br/>dead_factor, advert_interval, sequence,<br/>node_id, group_id, auth_tag"]
D --> F["Periodic HA advertisement<br/>same packet shape"]
E --> G["Peer receives UDP packet"]
F --> H["Peer receives UDP packet"]
G --> I["Validate packet<br/>version, group_id, auth_tag,<br/>not looped local node"]
H --> J["Validate packet<br/>version, group_id, auth_tag,<br/>not looped local node"]
I --> K["Update peer observation<br/>peer node_id, state, priority,<br/>last seen timestamp"]
J --> L["Update peer observation<br/>peer node_id, state, priority,<br/>last seen timestamp"]
K --> M["Recompute local state<br/>INIT | BACKUP | MASTER"]
L --> N["Recompute local state<br/>INIT | BACKUP | MASTER"]
M --> O{"State changed?"}
N --> P{"State changed?"}
O -- "to MASTER" --> Q["Add VIP addresses to interface<br/>run promote hook"]
O -- "to BACKUP" --> R["Remove VIP addresses from interface<br/>run backup/demote hook"]
P -- "to MASTER" --> S["Add VIP addresses to interface<br/>run promote hook"]
P -- "to BACKUP" --> T["Remove VIP addresses from interface<br/>run backup/demote hook"]
M --> U["Publish HA status snapshot<br/>9376/tcp API"]
N --> V["Publish HA status snapshot<br/>9376/tcp API"]
In practice, each node continuously:
- sends HA advertisements to exactly one configured peer over
9375/udp - tracks the peer's last observed state, priority, and liveness deadline
- chooses
MASTERorBACKUPbased on peer health, priority, node ID tiebreak, andpreempt - adds or removes the configured VIP addresses on state transition
- exposes the current snapshot through the management API on
9376/tcp
Current implementation walkthrough:
- Startup:
the node loads the HA runtime config, binds the UDP socket on
ha.bind, parses the single configured peer, initializes the local state asINIT, and publishes an initial status snapshot. - Advertisement loop: each iteration recomputes local state, waits until the next advertisement deadline, then either sends one HA packet to the configured peer or handles one received packet from that peer.
- Packet validation:
received packets are accepted only when the magic bytes, protocol version,
group_id, and auth tag match, and when the packet is not looped back from the local node ID. - Peer observation: after a valid packet, the node stores the peer node ID, peer state, peer priority, and the timestamp of when that packet was observed.
- State choice:
if the peer is considered alive, the node compares peer state, peer priority, local priority, local node ID, and
preemptto decide betweenMASTERandBACKUP. if the peer is not alive, the node promotes itself after the startup follower deadline or keepsMASTERif it already held it. - VIP handling:
transitions to
MASTERadd the configured VIP addresses to the interface and runon_promote. transitions toBACKUPremove the configured VIP addresses and run eitheron_backuporon_demote. - Fault handling:
address add/remove failures trigger
on_fault. fatal runtime send/receive failures also trigger shutdown cleanup andon_fault. - Shutdown:
on graceful shutdown, including
SIGINTandSIGTERM, the node removes its configured VIP addresses before the runtime exits and publishes a finalINITsnapshot.
The goal is to preserve the operational model of VRRP/CARP best practices while avoiding a dependency on L2 multicast, gratuitous ARP, or other mechanisms commonly blocked by cloud providers.
This means:
- leader election and liveness detection happen over UDP
- the state machine should remain close to active/backup failover behavior
- priority, advertisement interval, preemption, and authentication should be first-class concepts
Operational scope for HA mode:
mode: hais an internal infrastructure component, similar in intent to keepalived- HA advertisements on
9375/udpare expected to stay on a trusted private network - the HA peer channel should not be exposed to the public Internet
- firewall rules should restrict HA traffic to the expected peer nodes
This is intentionally VRRP/CARP-like, not wire-compatible VRRP or CARP. The project should not claim protocol compatibility unless it implements the actual protocol semantics and packet format.
The HA advertisement format should be versioned and minimal. At a minimum, each packet should carry:
- protocol version
- node ID
- group or instance identifier
- current state
- priority
- advertisement interval
- sequence number
- authentication tag
Performance and reliability should be first-class requirements for HA mode:
- low-overhead UDP heartbeats
- deterministic state transitions
- conservative failover under packet loss or partitions
- explicit authentication on HA advertisements
- predictable recovery after peer restart or transient network loss
- first-class IPv6 support with IPv4 fallback when dual-stack binding is unavailable
HA observability should also be a first-class design goal:
- every promotion, demotion, backup transition, and VIP move should be explainable after the fact
- operators should be able to tell whether the cause was peer timeout, priority/preempt logic, node ID tiebreak, shutdown cleanup, or an explicit fault path
gruezi status, logs, hooks, and future metrics should make the decision path visible instead of only showing the final state- debugging HA should answer "why did this node take or drop the VIP?" without requiring packet capture as the primary source of truth
KV mode
mode: kv should use Raft for consensus.
The KV subsystem is intended to provide etcd-like semantics:
- quorum-based writes
- leader election through Raft
- replicated log and durable state
- membership-aware cluster status
Operationally, mode: kv requires at least 3 quorum participants, or 2 nodes plus a witness/arbiter.
API and management port
9376/tcp should be the common API and management port.
That means:
- in
mode: kv, it is the client-facing API port - in
mode: ha, it is the live management/status port - CLI commands such as
gruezi statusalready target this API instead of talking directly to the HA or Raft peer ports
The port split should remain:
9375/udp: HA peer advertisements9376/tcp: API, management, and client access9377/tcp: KV peer and Raft traffic
Separation of concerns
HA and KV solve different problems and should remain conceptually separate:
- HA decides which node should be active in a 2-node deployment
- KV uses Raft to decide leadership and maintain authoritative replicated state in a 3+ participant deployment
For v1, the user-facing configuration should stay explicit and simple with mode: ha or mode: kv. For 2 nodes, use HA. For 3 or more quorum participants, KV is the preferred mode because Raft already provides leadership and failover behavior through consensus.
DRAFT: KV Validation Strategy
mode: kv should be validated with Maelstrom, the Jepsen-based distributed systems workbench used by the Fly.io distributed systems challenges.
This is useful because Maelstrom provides:
- a simulated network with latency, loss, and partitions
- workload-specific correctness checks
- visualization and history analysis for distributed failures
The Fly.io challenges should be treated as a staged validation path for KV work, not as a literal promise to implement every challenge unchanged.
Initial KV validation milestones:
- basic RPC/message handling
- node identity and request correlation
- inter-node message propagation
- replicated log behavior
- Raft leader election and quorum behavior
As mode: kv evolves, tests should move from local unit coverage to Maelstrom-driven fault-injection and correctness checks.
DRAFT: Storage Direction
For mode: kv, storage should be split by workload:
raft-engineas the dedicated Raft log engine for the consensus journal- RocksDB for the applied key-value state
- snapshots managed separately from the live Raft log and KV state
The Raft log and the applied KV state are different things:
- the Raft log stores ordered replicated commands
- the KV state stores the result after committed commands are applied
This separation should make it easier to optimize for performance and resilience under disk pressure.
The intended operational goals are:
- fast sequential Raft appends and replay
- efficient log truncation after snapshotting
- durable and performant KV reads/writes through RocksDB
- better disk-pressure handling than a single shared backend
For v1, gruezi should use a single Raft group. raft-engine is still the preferred direction for the Raft log even though it is designed to support Multi-Raft, because the log engine characteristics are a better fit for consensus journals than a general-purpose KV backend.
To support that, the system should explicitly implement:
- aggressive Raft log truncation after snapshot
- snapshot lifecycle management
- backpressure before disk exhaustion
- quotas and reserved free space
- clear write-stall behavior under pressure
DRAFT: Snapshot Model
Snapshots are required to keep the Raft log from growing without bound.
The expected model is:
- committed log entries are applied to RocksDB
- a snapshot captures the applied KV state at a specific Raft index and term
- once the snapshot is durable, older Raft log segments can be truncated
Each snapshot should include:
- last included Raft index
- last included Raft term
- cluster membership metadata
- snapshot format version
- checksum
Snapshots should be used for:
- faster node restart and recovery
- catching up lagging or newly joined followers
- bounding disk usage for the Raft log
Snapshot lifecycle should define:
- when snapshots are triggered
- how snapshots are transferred to other nodes
- when old snapshots can be deleted
- when Raft log truncation is allowed after snapshot persistence
DRAFT: On-Disk Layout
mode: kv should separate persistent data by purpose:
raft/for the Raft log enginekv/for RocksDB applied statesnapshots/for snapshot files and metadata
This layout should allow independent quotas, cleanup policies, and recovery behavior.
DRAFT: Membership And Bootstrap
Cluster lifecycle needs explicit rules.
Items that must be defined:
- first cluster bootstrap
- adding a new node
- replacing a failed node
- removing a node safely
- restart behavior after crash or partial disk loss
- witness or arbiter behavior, if supported
For v1, membership changes should be conservative and explicit. Unsafe ad-hoc joins should be avoided.
DRAFT: Failure And Disk-Pressure Behavior
Disk pressure should be treated as a first-class failure mode.
Behavior to define:
- reserved free-space threshold
- warning threshold and critical threshold
- when writes are throttled
- when writes are rejected
- when compaction, truncation, or snapshot cleanup is triggered
- when the node reports degraded or read-only state
The goal is to fail predictably before the disk is fully exhausted.
DRAFT: HA Failure Semantics
For mode: ha, split-brain prevention must be documented explicitly.
Items to define:
- promotion rules
- preemption rules
- peer loss timeout
- behavior under network partition
- fencing or external safety checks, if required
For a 2-node deployment, HA should prefer deterministic and conservative failover behavior over aggressive promotion.
DRAFT: HA Implementation Priorities
The HA path should be built first as the smallest end-to-end feature.
Suggested order:
- YAML config schema and validation for
mode: ha - node identity, peer identity, and interface/address configuration
- UDP packet format with versioning and authentication fields
- heartbeat sender/receiver loop with bounded timers
- HA state machine with
INIT,BACKUP, andMASTER - promotion, preemption, and failover rules
- CLI status output and metrics
HA v1 should optimize for:
- strong reliability before fast failover
- bounded CPU and memory overhead
- clear behavior during packet loss, delay, or temporary partitions
- simple and observable state transitions for debugging
Recommended HA timer defaults:
advert_interval_ms: 1000dead_factor: 3hold_down_ms: 3000jitter_ms: 100
Recommended HA auth shape:
ha:
group_id: cluster-ha
auth:
mode: shared_key
key: change-me
Meaning of the HA fields:
group_id: logical HA domain. Only nodes in the same group should accept each other's advertisements.auth.mode: none: disable packet authentication. This is only suitable for local development or isolated lab testing.auth.mode: shared_key: every HA packet carries an authentication tag derived from a shared secret and the packet contents.auth.key: the shared secret used by all nodes in the same HA group. It should be treated like any other cluster secret and distributed securely.
shared_key in HA mode is not transport encryption. It exists to answer a narrower question:
- is this UDP advertisement from a node that knows the group secret?
- was the packet likely modified in transit?
This is a better fit for HA v1 because the HA control plane is unicast UDP. mTLS is a strong option for TCP-based APIs and Raft peer links, but it does not apply directly to raw UDP advertisements. The comparable UDP-level option would be DTLS or a more advanced per-packet cryptographic scheme, which adds more complexity than is needed for the initial HA protocol.
Recommended direction:
- HA over UDP: start with explicit packet authentication using
shared_key - API and KV peer traffic over TCP: use TLS/mTLS
- future HA hardening: consider DTLS or stronger keyed message authentication if the simpler HA packet auth is not sufficient
Threat-model note:
shared_keyis a practical first step for a private HA network, not a full Internet-facing security model- it should be combined with network isolation, peer allow-listing, and standard infrastructure firewalling
- if HA traffic ever needs to cross less-trusted networks, the design should be revisited with stronger transport or packet-level protections
Operational guidance:
- use
auth.mode: noneonly for tests and local experiments - use a different
auth.keyper HA group/environment - rotate the key carefully, because all HA peers in the same group must agree on it
- do not treat
shared_keyas a substitute for TLS on the management API
Current HA API
The current HA management API is read-only and listens on 9376/tcp.
Available endpoints:
GET /statusGET /ha/statusGET /healthz
The current gruezi status command queries this API.
For a live view during failover testing:
HA Packet Troubleshooting
For HA packet troubleshooting, use gruezi status --watch and tcpdump together:
Recommended tcpdump flags:
-n: disable DNS lookups-tttt: print readable timestamps for correlation withstatus --watch-vvv: increase protocol detail-X: show packet payload in hex and ASCII-s0: capture the full packet instead of truncating it
This lets you correlate:
sentandrecvcounters fromgruezi status --watch- peer liveness and
last_peer_seen - raw UDP payload bytes on
9375/udp
Current HA Hooks
The current HA implementation supports transition hooks in YAML:
ha:
hooks:
on_promote: /etc/gruezi/hooks/promote.sh
on_demote: /etc/gruezi/hooks/demote.sh
on_backup: /etc/gruezi/hooks/backup.sh
on_fault: /etc/gruezi/hooks/fault.sh
timeout_ms: 5000
Implemented today:
on_promoteon_demoteon_backupon_faultfor explicit HA address-action and runtime failure paths
Hook scripts currently receive runtime context through environment variables:
GRUEZI_EVENTGRUEZI_NODE_IDGRUEZI_GROUP_IDGRUEZI_INTERFACEGRUEZI_STATEGRUEZI_PREVIOUS_STATEGRUEZI_PEER_IDGRUEZI_PEER_STATE
DRAFT: API And Service Discovery
The external API surface for mode: kv still needs to be defined.
Questions to settle:
- etcd-compatible API or custom API
- gRPC, HTTP, or both
- key-space layout and prefix conventions
- watch/stream semantics
- lease/session behavior
API listeners must not be IPv4-only by default. The preferred behavior is:
- explicit
listenIP binds exactly that IP family - if no listen IP is provided, try dual-stack IPv6 first
- if dual-stack IPv6 is unavailable, fall back to IPv4
If an HTTP API is added, axum is a reasonable choice on top of a pre-bound TcpListener.
Service discovery also needs clear rules:
- how records are stored in KV
- how DNS responses are generated
- TTL and expiration behavior
- health integration and stale record cleanup
DRAFT: Security And Observability
Both modes should plan for production-grade safety and debugging.
Minimum areas to define:
- mTLS between nodes
- client authentication and authorization
- certificate rotation
- metrics for leadership, replication lag, snapshot size, disk usage, and write stalls
- tracing for elections, failover, and storage operations