# Cluster Setup
This guide covers running tsink as a replicated multi-node cluster. The same engine binary that runs single-node also runs in cluster mode — clustering is enabled with a flag.
---
## Overview
A tsink cluster is a group of nodes that share data via consistent-hash-ring sharding and configurable replication. The cluster handles:
- **Shard routing** — series are hashed to a logical shard, and each shard is owned by one or more replica nodes.
- **Replication** — writes fan out to all replica owners; consistency guarantees are tunable.
- **Hinted handoff** — writes destined for a temporarily unavailable peer are durably queued and replayed when it recovers.
- **Anti-entropy repair** — periodic digest exchange detects and repairs diverged shards.
- **Online rebalance** — shard ownership migrates automatically when nodes join or leave.
- **Control-plane consensus** — membership and ring state are managed by an internal Raft-like log replicated across all storage-capable nodes.
---
## Quick start: three-node cluster
Start three nodes, each binding to its own data path and internal RPC endpoint:
```bash
# Node 1
tsink-server \
--listen 0.0.0.0:9201 \
--data-path ./var/node1 \
--cluster-enabled \
--cluster-node-id node-1 \
--cluster-bind 0.0.0.0:9211 \
--cluster-seeds node-2:9212,node-3:9213 \
--cluster-replication-factor 3
# Node 2
tsink-server \
--listen 0.0.0.0:9202 \
--data-path ./var/node2 \
--cluster-enabled \
--cluster-node-id node-2 \
--cluster-bind 0.0.0.0:9212 \
--cluster-seeds node-1:9211,node-3:9213 \
--cluster-replication-factor 3
# Node 3
tsink-server \
--listen 0.0.0.0:9203 \
--data-path ./var/node3 \
--cluster-enabled \
--cluster-node-id node-3 \
--cluster-bind 0.0.0.0:9213 \
--cluster-seeds node-1:9211,node-2:9212 \
--cluster-replication-factor 3
```
Each node bootstraps by contacting the seed list until the control plane accepts its join. Once all three nodes are active, shard ownership is distributed across them.
---
## CLI flags reference
All cluster flags are only meaningful when `--cluster-enabled` is set.
| `--cluster-enabled` | `false` | Enable cluster mode. |
| `--cluster-node-id <ID>` | — | **Required.** Stable, unique identifier for this node (e.g. `node-1`). Must not be `"unknown"`. |
| `--cluster-bind <HOST:PORT>` | — | **Required.** Internal RPC listen/advertise address. Peers connect here. |
| `--cluster-node-role <ROLE>` | `hybrid` | Node role: `storage`, `query`, or `hybrid`. See [node roles](#node-roles). |
| `--cluster-seeds <HOST:PORT,...>` | — | Comma-separated list of peer endpoints used for initial join. Format: `host:port` or `node-id@host:port`. |
| `--cluster-shards <N>` | `128` | Number of logical shards. Must be > 0. Set this once at cluster creation and never change it. |
| `--cluster-replication-factor <N>` | `1` | Number of replicas per shard. Must be > 0 and ≤ number of storage-capable nodes. |
| `--cluster-write-consistency <MODE>` | `quorum` | Write consistency level: `one`, `quorum`, or `all`. |
| `--cluster-read-consistency <MODE>` | `eventual` | Read consistency level: `eventual`, `quorum`, or `strict`. |
| `--cluster-read-partial-response <MODE>` | `allow` | Whether to allow partial results when some shards are unavailable: `allow` or `deny`. |
| `--cluster-internal-auth-token <TOKEN>` | — | Shared-secret token sent on all internal RPC calls. Mutually exclusive with `--cluster-internal-auth-token-file`. |
| `--cluster-internal-auth-token-file <PATH>` | — | Path to a file containing the shared-secret token. Mutually exclusive with `--cluster-internal-auth-token`. |
| `--cluster-internal-mtls-enabled` | `false` | Enable mTLS for internal RPC. When set, token auth is disabled. Requires the three flags below. |
| `--cluster-internal-mtls-ca-cert <PATH>` | — | PEM CA bundle used to verify peer certificates. |
| `--cluster-internal-mtls-cert <PATH>` | — | PEM client certificate presented on outbound RPC. |
| `--cluster-internal-mtls-key <PATH>` | — | PEM private key for `--cluster-internal-mtls-cert`. |
---
## Node roles
Each node has a role that determines whether it owns data shards and whether it participates in query fanout.
| `hybrid` (default) | Yes | Yes |
| `storage` | Yes | Via RPC from query nodes |
| `query` | No | Yes (routes to storage/hybrid) |
**`hybrid`** is appropriate for homogeneous clusters where every node is equal. **`storage` + `query`** separation is useful when you want dedicated query nodes with more memory for merging results, while storage nodes focus on ingest and retention.
A `query`-role node requires at least one `storage` or `hybrid` node in its seed list:
```bash
tsink-server \
--cluster-node-role query \
--cluster-seeds storage-1:9211,storage-2:9212 \
...
```
---
## Sharding
Series are mapped to shards, and shards are mapped to replica owners, using a consistent hash ring:
1. **Series → shard**: `shard_index = series_id % shard_count`
2. **Shard → owners**: virtual-node consistent hash ring (xxHash64, seed 0). Each physical node gets 128 virtual nodes on the ring. The `replication_factor` clockwise-unique physical nodes are the shard's replica set.
The `--cluster-shards` value determines the total number of logical shards and **cannot be changed after the cluster is created**. The default of `128` is sufficient for most deployments. Use a higher value (e.g. `512` or `1024`) if you plan to grow to many nodes and want fine-grained shard migration.
---
## Consistency levels
### Write consistency
Controls how many replica acks are required before a write is confirmed.
| `one` | 1 | — |
| `quorum` (default) | majority | `⌊replication_factor / 2⌋ + 1` |
| `all` | all replicas | `replication_factor` |
The write consistency can be overridden per-request using the HTTP header:
```
x-tsink-write-consistency: one
```
Per-request overrides may only **weaken** (not strengthen) the server-configured level.
### Read consistency
Controls how many replica responses are required and whether all replicas are queried.
| `eventual` (default) | Primary only | 1 |
| `quorum` | All replicas | `⌊replicas / 2⌋ + 1` |
| `strict` | All replicas | All replicas |
### Partial response
When `--cluster-read-partial-response allow` (default), a query can succeed and return data even if some shards are temporarily unavailable. The response will include a `"partial_response": true` field in the metadata.
Set `--cluster-read-partial-response deny` to return an error instead when any shard is unavailable.
---
## Hinted handoff
When a write cannot be delivered to a replica because the peer is unreachable, tsink durably queues the data in a per-peer **hinted handoff outbox**. When the peer recovers, the queued data is replayed automatically.
Handoff behavior is tunable via environment variables:
| `TSINK_CLUSTER_OUTBOX_MAX_ENTRIES` | `100000` | Maximum number of queued entries across all peers. |
| `TSINK_CLUSTER_OUTBOX_MAX_BYTES` | `536870912` (512 MiB) | Total size cap for the handoff queue. |
| `TSINK_CLUSTER_OUTBOX_MAX_PEER_BYTES` | `268435456` (256 MiB) | Per-peer size cap. |
| `TSINK_CLUSTER_OUTBOX_MAX_LOG_BYTES` | `2147483648` (2 GiB) | Maximum WAL size for the outbox log. |
| `TSINK_CLUSTER_OUTBOX_REPLAY_INTERVAL_SECS` | `2` | How often to attempt replay to recovered peers (seconds). |
| `TSINK_CLUSTER_OUTBOX_REPLAY_BATCH_SIZE` | `256` | Rows per replay batch. |
| `TSINK_CLUSTER_OUTBOX_MAX_BACKOFF_SECS` | `30` | Maximum backoff between replay attempts. |
| `TSINK_CLUSTER_OUTBOX_MAX_RECORD_BYTES` | `2097152` (2 MiB) | Maximum size of a single outbox record. |
| `TSINK_CLUSTER_OUTBOX_CLEANUP_INTERVAL_SECS` | `30` | How often stale outbox records are cleaned up. |
| `TSINK_CLUSTER_OUTBOX_CLEANUP_MIN_STALE_RECORDS` | `1024` | Minimum stale records before cleanup runs. |
| `TSINK_CLUSTER_OUTBOX_STALLED_PEER_AGE_SECS` | `300` | Age threshold (seconds) at which a peer's outbox queue is considered stalled. |
| `TSINK_CLUSTER_OUTBOX_STALLED_PEER_MIN_ENTRIES` | `1` | Minimum entries for a peer queue to be considered stalled. |
| `TSINK_CLUSTER_OUTBOX_STALLED_PEER_MIN_BYTES` | `1` | Minimum bytes for a peer queue to be considered stalled. |
---
## Anti-entropy repair
Repair uses a **digest exchange**: each node periodically computes per-shard fingerprints and compares them with peers. Diverged shards trigger a targeted data backfill.
| `TSINK_CLUSTER_DIGEST_INTERVAL_SECS` | `30` | How often digest exchange runs per node. |
| `TSINK_CLUSTER_DIGEST_WINDOW_SECS` | `300` | Lookback window included in each digest (seconds). |
| `TSINK_CLUSTER_DIGEST_MAX_SHARDS_PER_TICK` | `64` | Max shards compared per tick. |
| `TSINK_CLUSTER_DIGEST_MAX_MISMATCH_REPORTS` | `128` | Max mismatch records tracked in memory. |
| `TSINK_CLUSTER_DIGEST_MAX_BYTES_PER_TICK` | `262144` (256 KiB) | Max digest payload size per tick. |
| `TSINK_CLUSTER_REPAIR_MAX_MISMATCHES_PER_TICK` | `2` | Max diverged shards repaired per tick. |
| `TSINK_CLUSTER_REPAIR_MAX_SERIES_PER_TICK` | `256` | Max series included in a repair round. |
| `TSINK_CLUSTER_REPAIR_MAX_ROWS_PER_TICK` | `16384` | Max rows transferred per repair tick. |
| `TSINK_CLUSTER_REPAIR_MAX_RUNTIME_MS_PER_TICK` | `100` | Max wall time per repair tick (ms). |
| `TSINK_CLUSTER_REPAIR_FAILURE_BACKOFF_SECS` | `30` | Backoff after a failed repair attempt. |
Repair can be paused, resumed, cancelled, or triggered on-demand via the admin API:
```bash
# Pause repair
curl -X POST http://node-1:9201/api/v1/admin/cluster/repair/pause
# Resume repair
curl -X POST http://node-1:9201/api/v1/admin/cluster/repair/resume
# Run repair immediately
curl -X POST http://node-1:9201/api/v1/admin/cluster/repair/run
# Check repair status
curl http://node-1:9201/api/v1/admin/cluster/repair/status
```
---
## Rebalance
When shard ownership changes (node joining, leaving, or ring update), data migrates to new owners. Rebalance runs automatically in the background and is rate-limited to avoid impacting production traffic.
| `TSINK_CLUSTER_REBALANCE_INTERVAL_SECS` | `5` | How often the rebalance loop ticks. |
| `TSINK_CLUSTER_REBALANCE_MAX_ROWS_PER_TICK` | `10000` | Max rows migrated per rebalance tick. |
| `TSINK_CLUSTER_REBALANCE_MAX_SHARDS_PER_TICK` | `4` | Max shards processed per rebalance tick. |
Rebalance management endpoints:
```bash
# Pause rebalance
curl -X POST http://node-1:9201/api/v1/admin/cluster/rebalance/pause
# Resume rebalance
curl -X POST http://node-1:9201/api/v1/admin/cluster/rebalance/resume
# Run rebalance immediately
curl -X POST http://node-1:9201/api/v1/admin/cluster/rebalance/run
# Check rebalance status
curl http://node-1:9201/api/v1/admin/cluster/rebalance/status
```
---
## Adding and removing nodes
### Adding a node
1. Start the new node with `--cluster-enabled`, a unique `--cluster-node-id`, and `--cluster-seeds` pointing at existing nodes.
2. The new node contacts the seed list and sends a `join` request to the control plane.
3. Once accepted, the control plane updates the ring and rebalance begins migrating shards to the new node.
You can also trigger the join manually:
```bash
curl -X POST http://new-node:9201/api/v1/admin/cluster/join
```
### Removing a node
Signal the node to drain its shards before stopping:
```bash
curl -X POST http://node-to-remove:9201/api/v1/admin/cluster/leave
```
This initiates a graceful handoff: the node transfers its shard data to the remaining owners before marking itself as removed. Check handoff status:
```bash
curl http://node-to-remove:9201/api/v1/admin/cluster/handoff/status
```
Once the handoff is complete, the node can be stopped safely.
### Shard handoff phases
When a shard is migrated, it goes through the following phases:
| `Warmup` | New owner receives data; old owner remains primary. |
| `Cutover` | Traffic switches to the new owner. |
| `FinalSync` | Final data sync before the old owner relinquishes the shard. |
| `Completed` | Migration done. |
| `Failed` | An error occurred; the handoff can be resumed from `Warmup`. |
---
## Control-plane consensus
Cluster membership and ring state are managed by an internal Raft-like consensus log replicated across all storage-capable nodes. The leader coordinates ring updates, join/leave decisions, and snapshots.
| `TSINK_CLUSTER_CONTROL_TICK_INTERVAL_SECS` | `2` | Consensus tick interval. |
| `TSINK_CLUSTER_CONTROL_MAX_APPEND_ENTRIES` | `64` | Max log entries per append round. |
| `TSINK_CLUSTER_CONTROL_SNAPSHOT_INTERVAL_ENTRIES` | `128` | Log entries between automatic snapshots. |
| `TSINK_CLUSTER_CONTROL_SUSPECT_TIMEOUT_SECS` | `6` | Seconds before a silent peer is marked suspect. |
| `TSINK_CLUSTER_CONTROL_DEAD_TIMEOUT_SECS` | `20` | Seconds before a suspect peer is declared dead. |
| `TSINK_CLUSTER_CONTROL_LEADER_LEASE_SECS` | `6` | Leader lease duration. |
Peer liveness transitions: `unknown → healthy → suspect → dead`.
### Snapshots
Snapshot and restore the control-plane state:
```bash
# Snapshot this node's control plane
curl -X POST http://node-1:9201/api/v1/admin/cluster/control/snapshot
# Cluster-wide coordinated snapshot
curl -X POST http://node-1:9201/api/v1/admin/cluster/snapshot
# Restore from a cluster-wide snapshot
curl -X POST http://node-1:9201/api/v1/admin/cluster/restore
```
---
## Internal security
All inter-node communication runs on the internal RPC port (`--cluster-bind`) under `/internal/v1/*` endpoints. Two mutually-exclusive authentication mechanisms are available.
### Shared-secret token (simpler)
Provide a token at startup. All nodes in the cluster must use the same token.
```bash
tsink-server \
--cluster-internal-auth-token "s3cr3t-tok3n" \
...
```
Or load it from a file (useful for secret management):
```bash
tsink-server \
--cluster-internal-auth-token-file /run/secrets/cluster-token \
...
```
The token is sent on every internal RPC call via the `x-tsink-internal-auth` header. Tokens can be rotated at runtime without restart — see the [secret rotation guide](docs/secret-rotation.md).
### mTLS (recommended for production)
Enable mTLS to authenticate and encrypt all peer-to-peer traffic using certificates. When mTLS is enabled, shared-secret token auth is disabled.
All three paths are required when `--cluster-internal-mtls-enabled` is set:
```bash
tsink-server \
--cluster-internal-mtls-enabled \
--cluster-internal-mtls-ca-cert /etc/tsink/ca.pem \
--cluster-internal-mtls-cert /etc/tsink/node.crt \
--cluster-internal-mtls-key /etc/tsink/node.key \
...
```
| `--cluster-internal-mtls-ca-cert` | PEM CA bundle. Used to verify incoming peer connections and outbound TLS. |
| `--cluster-internal-mtls-cert` | PEM certificate presented by this node on all outbound RPC calls. |
| `--cluster-internal-mtls-key` | PEM private key for the certificate above. |
mTLS certificates can be rotated at runtime without restart. See the [secret rotation guide](docs/secret-rotation.md).
---
## RPC tuning
The following environment variables control internal RPC behavior and resource limits:
| `TSINK_CLUSTER_FANOUT_CONCURRENCY` | `16` | Maximum concurrent fanout RPCs per request. |
| `TSINK_CLUSTER_RPC_TIMEOUT_MS` | `2000` | Timeout per individual RPC call (ms). |
| `TSINK_CLUSTER_RPC_MAX_RETRIES` | `2` | Maximum retries before falling back to hinted handoff (writes) or error (reads). |
| `TSINK_CLUSTER_WRITE_MAX_BATCH_ROWS` | `1024` | Maximum rows per write batch sent to a peer. |
| `TSINK_CLUSTER_WRITE_MAX_INFLIGHT_BATCHES` | `32` | Maximum concurrent write batches in flight to all peers. |
| `TSINK_CLUSTER_READ_MAX_MERGED_SERIES` | `250000` | Maximum number of series that can be merged in a single distributed read. |
| `TSINK_CLUSTER_READ_MAX_MERGED_POINTS_PER_SERIES` | `1000000` | Maximum data points per series in a distributed read result. |
| `TSINK_CLUSTER_READ_MAX_MERGED_POINTS_TOTAL` | `5000000` | Maximum total data points in a distributed read result. |
| `TSINK_CLUSTER_READ_MAX_INFLIGHT_QUERIES` | `64` | Maximum concurrent distributed read queries. |
| `TSINK_CLUSTER_READ_MAX_INFLIGHT_MERGED_POINTS` | `20000000` | Maximum total in-flight merged points across all concurrent reads. |
| `TSINK_CLUSTER_READ_RESOURCE_ACQUIRE_TIMEOUT_MS` | `25` | Timeout acquiring read concurrency slots (ms). |
---
## Write deduplication
To prevent double-writes during retry storms, tsink tracks idempotency keys per node in a sliding time window. Duplicate writes within the window are silently dropped on the receiving node.
| `TSINK_CLUSTER_DEDUPE_WINDOW_SECS` | `900` | Deduplication window (15 minutes). |
| `TSINK_CLUSTER_DEDUPE_MAX_ENTRIES` | `250000` | Maximum tracked idempotency keys. |
| `TSINK_CLUSTER_DEDUPE_MAX_LOG_BYTES` | `67108864` (64 MiB) | Maximum WAL size for the dedupe log. |
| `TSINK_CLUSTER_DEDUPE_CLEANUP_INTERVAL_SECS` | `30` | Cleanup interval for expired entries. |
---
## Cluster audit log
All control-plane operations (joins, leaves, ring changes, handoffs) are written to an audit log queryable via the admin API.
| `TSINK_CLUSTER_AUDIT_RETENTION_SECS` | `2592000` (30 days) | How long audit records are retained. |
| `TSINK_CLUSTER_AUDIT_MAX_LOG_BYTES` | `134217728` (128 MiB) | Maximum size of the audit log. |
| `TSINK_CLUSTER_AUDIT_MAX_QUERY_LIMIT` | `1000` | Maximum records returned per audit query. |
```bash
# Query recent audit records
curl http://node-1:9201/api/v1/admin/cluster/audit
# Export full audit log as NDJSON
curl http://node-1:9201/api/v1/admin/cluster/audit/export
```
---
## Configuration checklist
Before starting a production cluster:
- [ ] Every node has a unique, stable `--cluster-node-id`.
- [ ] `--cluster-bind` is reachable by all peers on the network.
- [ ] `--cluster-shards` is consistent across all nodes in the cluster.
- [ ] `--cluster-replication-factor` ≤ number of storage-capable nodes.
- [ ] Internal auth is configured: either `--cluster-internal-auth-token[|-file]` or `--cluster-internal-mtls-*`.
- [ ] `--cluster-seeds` lists enough existing nodes for bootstrap (any subset works).
- [ ] Firewalls allow traffic on both the public `--listen` port and the internal `--cluster-bind` port.