tsink 0.10.2

A lightweight embedded time-series database with a straightforward API
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
# Cluster Setup

This guide covers running tsink as a replicated multi-node cluster. The same engine binary that runs single-node also runs in cluster mode — clustering is enabled with a flag.

---

## Overview

A tsink cluster is a group of nodes that share data via consistent-hash-ring sharding and configurable replication. The cluster handles:

- **Shard routing** — series are hashed to a logical shard, and each shard is owned by one or more replica nodes.
- **Replication** — writes fan out to all replica owners; consistency guarantees are tunable.
- **Hinted handoff** — writes destined for a temporarily unavailable peer are durably queued and replayed when it recovers.
- **Anti-entropy repair** — periodic digest exchange detects and repairs diverged shards.
- **Online rebalance** — shard ownership migrates automatically when nodes join or leave.
- **Control-plane consensus** — membership and ring state are managed by an internal Raft-like log replicated across all storage-capable nodes.

---

## Quick start: three-node cluster

Start three nodes, each binding to its own data path and internal RPC endpoint:

```bash
# Node 1
tsink-server \
  --listen 0.0.0.0:9201 \
  --data-path ./var/node1 \
  --cluster-enabled \
  --cluster-node-id node-1 \
  --cluster-bind 0.0.0.0:9211 \
  --cluster-seeds node-2:9212,node-3:9213 \
  --cluster-replication-factor 3

# Node 2
tsink-server \
  --listen 0.0.0.0:9202 \
  --data-path ./var/node2 \
  --cluster-enabled \
  --cluster-node-id node-2 \
  --cluster-bind 0.0.0.0:9212 \
  --cluster-seeds node-1:9211,node-3:9213 \
  --cluster-replication-factor 3

# Node 3
tsink-server \
  --listen 0.0.0.0:9203 \
  --data-path ./var/node3 \
  --cluster-enabled \
  --cluster-node-id node-3 \
  --cluster-bind 0.0.0.0:9213 \
  --cluster-seeds node-1:9211,node-2:9212 \
  --cluster-replication-factor 3
```

Each node bootstraps by contacting the seed list until the control plane accepts its join. Once all three nodes are active, shard ownership is distributed across them.

---

## CLI flags reference

All cluster flags are only meaningful when `--cluster-enabled` is set.

| Flag | Default | Description |
|---|---|---|
| `--cluster-enabled` | `false` | Enable cluster mode. |
| `--cluster-node-id <ID>` || **Required.** Stable, unique identifier for this node (e.g. `node-1`). Must not be `"unknown"`. |
| `--cluster-bind <HOST:PORT>` || **Required.** Internal RPC listen/advertise address. Peers connect here. |
| `--cluster-node-role <ROLE>` | `hybrid` | Node role: `storage`, `query`, or `hybrid`. See [node roles]#node-roles. |
| `--cluster-seeds <HOST:PORT,...>` || Comma-separated list of peer endpoints used for initial join. Format: `host:port` or `node-id@host:port`. |
| `--cluster-shards <N>` | `128` | Number of logical shards. Must be > 0. Set this once at cluster creation and never change it. |
| `--cluster-replication-factor <N>` | `1` | Number of replicas per shard. Must be > 0 and ≤ number of storage-capable nodes. |
| `--cluster-write-consistency <MODE>` | `quorum` | Write consistency level: `one`, `quorum`, or `all`. |
| `--cluster-read-consistency <MODE>` | `eventual` | Read consistency level: `eventual`, `quorum`, or `strict`. |
| `--cluster-read-partial-response <MODE>` | `allow` | Whether to allow partial results when some shards are unavailable: `allow` or `deny`. |
| `--cluster-internal-auth-token <TOKEN>` || Shared-secret token sent on all internal RPC calls. Mutually exclusive with `--cluster-internal-auth-token-file`. |
| `--cluster-internal-auth-token-file <PATH>` || Path to a file containing the shared-secret token. Mutually exclusive with `--cluster-internal-auth-token`. |
| `--cluster-internal-mtls-enabled` | `false` | Enable mTLS for internal RPC. When set, token auth is disabled. Requires the three flags below. |
| `--cluster-internal-mtls-ca-cert <PATH>` || PEM CA bundle used to verify peer certificates. |
| `--cluster-internal-mtls-cert <PATH>` || PEM client certificate presented on outbound RPC. |
| `--cluster-internal-mtls-key <PATH>` || PEM private key for `--cluster-internal-mtls-cert`. |

---

## Node roles

Each node has a role that determines whether it owns data shards and whether it participates in query fanout.

| Role | Owns shards | Handles queries |
|---|---|---|
| `hybrid` (default) | Yes | Yes |
| `storage` | Yes | Via RPC from query nodes |
| `query` | No | Yes (routes to storage/hybrid) |

**`hybrid`** is appropriate for homogeneous clusters where every node is equal. **`storage` + `query`** separation is useful when you want dedicated query nodes with more memory for merging results, while storage nodes focus on ingest and retention.

A `query`-role node requires at least one `storage` or `hybrid` node in its seed list:

```bash
tsink-server \
  --cluster-node-role query \
  --cluster-seeds storage-1:9211,storage-2:9212 \
  ...
```

---

## Sharding

Series are mapped to shards, and shards are mapped to replica owners, using a consistent hash ring:

1. **Series → shard**: `shard_index = series_id % shard_count`
2. **Shard → owners**: virtual-node consistent hash ring (xxHash64, seed 0). Each physical node gets 128 virtual nodes on the ring. The `replication_factor` clockwise-unique physical nodes are the shard's replica set.

The `--cluster-shards` value determines the total number of logical shards and **cannot be changed after the cluster is created**. The default of `128` is sufficient for most deployments. Use a higher value (e.g. `512` or `1024`) if you plan to grow to many nodes and want fine-grained shard migration.

---

## Consistency levels

### Write consistency

Controls how many replica acks are required before a write is confirmed.

| Level | Acks required | Formula |
|---|---|---|
| `one` | 1 ||
| `quorum` (default) | majority | `⌊replication_factor / 2⌋ + 1` |
| `all` | all replicas | `replication_factor` |

The write consistency can be overridden per-request using the HTTP header:

```
x-tsink-write-consistency: one
```

Per-request overrides may only **weaken** (not strengthen) the server-configured level.

### Read consistency

Controls how many replica responses are required and whether all replicas are queried.

| Level | Replicas queried | Acks required |
|---|---|---|
| `eventual` (default) | Primary only | 1 |
| `quorum` | All replicas | `⌊replicas / 2⌋ + 1` |
| `strict` | All replicas | All replicas |

### Partial response

When `--cluster-read-partial-response allow` (default), a query can succeed and return data even if some shards are temporarily unavailable. The response will include a `"partial_response": true` field in the metadata.

Set `--cluster-read-partial-response deny` to return an error instead when any shard is unavailable.

---

## Hinted handoff

When a write cannot be delivered to a replica because the peer is unreachable, tsink durably queues the data in a per-peer **hinted handoff outbox**. When the peer recovers, the queued data is replayed automatically.

Handoff behavior is tunable via environment variables:

| Environment variable | Default | Description |
|---|---|---|
| `TSINK_CLUSTER_OUTBOX_MAX_ENTRIES` | `100000` | Maximum number of queued entries across all peers. |
| `TSINK_CLUSTER_OUTBOX_MAX_BYTES` | `536870912` (512 MiB) | Total size cap for the handoff queue. |
| `TSINK_CLUSTER_OUTBOX_MAX_PEER_BYTES` | `268435456` (256 MiB) | Per-peer size cap. |
| `TSINK_CLUSTER_OUTBOX_MAX_LOG_BYTES` | `2147483648` (2 GiB) | Maximum WAL size for the outbox log. |
| `TSINK_CLUSTER_OUTBOX_REPLAY_INTERVAL_SECS` | `2` | How often to attempt replay to recovered peers (seconds). |
| `TSINK_CLUSTER_OUTBOX_REPLAY_BATCH_SIZE` | `256` | Rows per replay batch. |
| `TSINK_CLUSTER_OUTBOX_MAX_BACKOFF_SECS` | `30` | Maximum backoff between replay attempts. |
| `TSINK_CLUSTER_OUTBOX_MAX_RECORD_BYTES` | `2097152` (2 MiB) | Maximum size of a single outbox record. |
| `TSINK_CLUSTER_OUTBOX_CLEANUP_INTERVAL_SECS` | `30` | How often stale outbox records are cleaned up. |
| `TSINK_CLUSTER_OUTBOX_CLEANUP_MIN_STALE_RECORDS` | `1024` | Minimum stale records before cleanup runs. |
| `TSINK_CLUSTER_OUTBOX_STALLED_PEER_AGE_SECS` | `300` | Age threshold (seconds) at which a peer's outbox queue is considered stalled. |
| `TSINK_CLUSTER_OUTBOX_STALLED_PEER_MIN_ENTRIES` | `1` | Minimum entries for a peer queue to be considered stalled. |
| `TSINK_CLUSTER_OUTBOX_STALLED_PEER_MIN_BYTES` | `1` | Minimum bytes for a peer queue to be considered stalled. |

---

## Anti-entropy repair

Repair uses a **digest exchange**: each node periodically computes per-shard fingerprints and compares them with peers. Diverged shards trigger a targeted data backfill.

| Environment variable | Default | Description |
|---|---|---|
| `TSINK_CLUSTER_DIGEST_INTERVAL_SECS` | `30` | How often digest exchange runs per node. |
| `TSINK_CLUSTER_DIGEST_WINDOW_SECS` | `300` | Lookback window included in each digest (seconds). |
| `TSINK_CLUSTER_DIGEST_MAX_SHARDS_PER_TICK` | `64` | Max shards compared per tick. |
| `TSINK_CLUSTER_DIGEST_MAX_MISMATCH_REPORTS` | `128` | Max mismatch records tracked in memory. |
| `TSINK_CLUSTER_DIGEST_MAX_BYTES_PER_TICK` | `262144` (256 KiB) | Max digest payload size per tick. |
| `TSINK_CLUSTER_REPAIR_MAX_MISMATCHES_PER_TICK` | `2` | Max diverged shards repaired per tick. |
| `TSINK_CLUSTER_REPAIR_MAX_SERIES_PER_TICK` | `256` | Max series included in a repair round. |
| `TSINK_CLUSTER_REPAIR_MAX_ROWS_PER_TICK` | `16384` | Max rows transferred per repair tick. |
| `TSINK_CLUSTER_REPAIR_MAX_RUNTIME_MS_PER_TICK` | `100` | Max wall time per repair tick (ms). |
| `TSINK_CLUSTER_REPAIR_FAILURE_BACKOFF_SECS` | `30` | Backoff after a failed repair attempt. |

Repair can be paused, resumed, cancelled, or triggered on-demand via the admin API:

```bash
# Pause repair
curl -X POST http://node-1:9201/api/v1/admin/cluster/repair/pause

# Resume repair
curl -X POST http://node-1:9201/api/v1/admin/cluster/repair/resume

# Run repair immediately
curl -X POST http://node-1:9201/api/v1/admin/cluster/repair/run

# Check repair status
curl http://node-1:9201/api/v1/admin/cluster/repair/status
```

---

## Rebalance

When shard ownership changes (node joining, leaving, or ring update), data migrates to new owners. Rebalance runs automatically in the background and is rate-limited to avoid impacting production traffic.

| Environment variable | Default | Description |
|---|---|---|
| `TSINK_CLUSTER_REBALANCE_INTERVAL_SECS` | `5` | How often the rebalance loop ticks. |
| `TSINK_CLUSTER_REBALANCE_MAX_ROWS_PER_TICK` | `10000` | Max rows migrated per rebalance tick. |
| `TSINK_CLUSTER_REBALANCE_MAX_SHARDS_PER_TICK` | `4` | Max shards processed per rebalance tick. |

Rebalance management endpoints:

```bash
# Pause rebalance
curl -X POST http://node-1:9201/api/v1/admin/cluster/rebalance/pause

# Resume rebalance
curl -X POST http://node-1:9201/api/v1/admin/cluster/rebalance/resume

# Run rebalance immediately
curl -X POST http://node-1:9201/api/v1/admin/cluster/rebalance/run

# Check rebalance status
curl http://node-1:9201/api/v1/admin/cluster/rebalance/status
```

---

## Adding and removing nodes

### Adding a node

1. Start the new node with `--cluster-enabled`, a unique `--cluster-node-id`, and `--cluster-seeds` pointing at existing nodes.
2. The new node contacts the seed list and sends a `join` request to the control plane.
3. Once accepted, the control plane updates the ring and rebalance begins migrating shards to the new node.

You can also trigger the join manually:

```bash
curl -X POST http://new-node:9201/api/v1/admin/cluster/join
```

### Removing a node

Signal the node to drain its shards before stopping:

```bash
curl -X POST http://node-to-remove:9201/api/v1/admin/cluster/leave
```

This initiates a graceful handoff: the node transfers its shard data to the remaining owners before marking itself as removed. Check handoff status:

```bash
curl http://node-to-remove:9201/api/v1/admin/cluster/handoff/status
```

Once the handoff is complete, the node can be stopped safely.

### Shard handoff phases

When a shard is migrated, it goes through the following phases:

| Phase | Description |
|---|---|
| `Warmup` | New owner receives data; old owner remains primary. |
| `Cutover` | Traffic switches to the new owner. |
| `FinalSync` | Final data sync before the old owner relinquishes the shard. |
| `Completed` | Migration done. |
| `Failed` | An error occurred; the handoff can be resumed from `Warmup`. |

---

## Control-plane consensus

Cluster membership and ring state are managed by an internal Raft-like consensus log replicated across all storage-capable nodes. The leader coordinates ring updates, join/leave decisions, and snapshots.

| Environment variable | Default | Description |
|---|---|---|
| `TSINK_CLUSTER_CONTROL_TICK_INTERVAL_SECS` | `2` | Consensus tick interval. |
| `TSINK_CLUSTER_CONTROL_MAX_APPEND_ENTRIES` | `64` | Max log entries per append round. |
| `TSINK_CLUSTER_CONTROL_SNAPSHOT_INTERVAL_ENTRIES` | `128` | Log entries between automatic snapshots. |
| `TSINK_CLUSTER_CONTROL_SUSPECT_TIMEOUT_SECS` | `6` | Seconds before a silent peer is marked suspect. |
| `TSINK_CLUSTER_CONTROL_DEAD_TIMEOUT_SECS` | `20` | Seconds before a suspect peer is declared dead. |
| `TSINK_CLUSTER_CONTROL_LEADER_LEASE_SECS` | `6` | Leader lease duration. |

Peer liveness transitions: `unknown → healthy → suspect → dead`.

### Snapshots

Snapshot and restore the control-plane state:

```bash
# Snapshot this node's control plane
curl -X POST http://node-1:9201/api/v1/admin/cluster/control/snapshot

# Cluster-wide coordinated snapshot
curl -X POST http://node-1:9201/api/v1/admin/cluster/snapshot

# Restore from a cluster-wide snapshot
curl -X POST http://node-1:9201/api/v1/admin/cluster/restore
```

---

## Internal security

All inter-node communication runs on the internal RPC port (`--cluster-bind`) under `/internal/v1/*` endpoints. Two mutually-exclusive authentication mechanisms are available.

### Shared-secret token (simpler)

Provide a token at startup. All nodes in the cluster must use the same token.

```bash
tsink-server \
  --cluster-internal-auth-token "s3cr3t-tok3n" \
  ...
```

Or load it from a file (useful for secret management):

```bash
tsink-server \
  --cluster-internal-auth-token-file /run/secrets/cluster-token \
  ...
```

The token is sent on every internal RPC call via the `x-tsink-internal-auth` header. Tokens can be rotated at runtime without restart — see the [secret rotation guide](docs/secret-rotation.md).

### mTLS (recommended for production)

Enable mTLS to authenticate and encrypt all peer-to-peer traffic using certificates. When mTLS is enabled, shared-secret token auth is disabled.

All three paths are required when `--cluster-internal-mtls-enabled` is set:

```bash
tsink-server \
  --cluster-internal-mtls-enabled \
  --cluster-internal-mtls-ca-cert /etc/tsink/ca.pem \
  --cluster-internal-mtls-cert    /etc/tsink/node.crt \
  --cluster-internal-mtls-key     /etc/tsink/node.key \
  ...
```

| Flag | Description |
|---|---|
| `--cluster-internal-mtls-ca-cert` | PEM CA bundle. Used to verify incoming peer connections and outbound TLS. |
| `--cluster-internal-mtls-cert` | PEM certificate presented by this node on all outbound RPC calls. |
| `--cluster-internal-mtls-key` | PEM private key for the certificate above. |

mTLS certificates can be rotated at runtime without restart. See the [secret rotation guide](docs/secret-rotation.md).

---

## RPC tuning

The following environment variables control internal RPC behavior and resource limits:

| Environment variable | Default | Description |
|---|---|---|
| `TSINK_CLUSTER_FANOUT_CONCURRENCY` | `16` | Maximum concurrent fanout RPCs per request. |
| `TSINK_CLUSTER_RPC_TIMEOUT_MS` | `2000` | Timeout per individual RPC call (ms). |
| `TSINK_CLUSTER_RPC_MAX_RETRIES` | `2` | Maximum retries before falling back to hinted handoff (writes) or error (reads). |
| `TSINK_CLUSTER_WRITE_MAX_BATCH_ROWS` | `1024` | Maximum rows per write batch sent to a peer. |
| `TSINK_CLUSTER_WRITE_MAX_INFLIGHT_BATCHES` | `32` | Maximum concurrent write batches in flight to all peers. |
| `TSINK_CLUSTER_READ_MAX_MERGED_SERIES` | `250000` | Maximum number of series that can be merged in a single distributed read. |
| `TSINK_CLUSTER_READ_MAX_MERGED_POINTS_PER_SERIES` | `1000000` | Maximum data points per series in a distributed read result. |
| `TSINK_CLUSTER_READ_MAX_MERGED_POINTS_TOTAL` | `5000000` | Maximum total data points in a distributed read result. |
| `TSINK_CLUSTER_READ_MAX_INFLIGHT_QUERIES` | `64` | Maximum concurrent distributed read queries. |
| `TSINK_CLUSTER_READ_MAX_INFLIGHT_MERGED_POINTS` | `20000000` | Maximum total in-flight merged points across all concurrent reads. |
| `TSINK_CLUSTER_READ_RESOURCE_ACQUIRE_TIMEOUT_MS` | `25` | Timeout acquiring read concurrency slots (ms). |

---

## Write deduplication

To prevent double-writes during retry storms, tsink tracks idempotency keys per node in a sliding time window. Duplicate writes within the window are silently dropped on the receiving node.

| Environment variable | Default | Description |
|---|---|---|
| `TSINK_CLUSTER_DEDUPE_WINDOW_SECS` | `900` | Deduplication window (15 minutes). |
| `TSINK_CLUSTER_DEDUPE_MAX_ENTRIES` | `250000` | Maximum tracked idempotency keys. |
| `TSINK_CLUSTER_DEDUPE_MAX_LOG_BYTES` | `67108864` (64 MiB) | Maximum WAL size for the dedupe log. |
| `TSINK_CLUSTER_DEDUPE_CLEANUP_INTERVAL_SECS` | `30` | Cleanup interval for expired entries. |

---

## Cluster audit log

All control-plane operations (joins, leaves, ring changes, handoffs) are written to an audit log queryable via the admin API.

| Environment variable | Default | Description |
|---|---|---|
| `TSINK_CLUSTER_AUDIT_RETENTION_SECS` | `2592000` (30 days) | How long audit records are retained. |
| `TSINK_CLUSTER_AUDIT_MAX_LOG_BYTES` | `134217728` (128 MiB) | Maximum size of the audit log. |
| `TSINK_CLUSTER_AUDIT_MAX_QUERY_LIMIT` | `1000` | Maximum records returned per audit query. |

```bash
# Query recent audit records
curl http://node-1:9201/api/v1/admin/cluster/audit

# Export full audit log as NDJSON
curl http://node-1:9201/api/v1/admin/cluster/audit/export
```

---

## Configuration checklist

Before starting a production cluster:

- [ ] Every node has a unique, stable `--cluster-node-id`.
- [ ] `--cluster-bind` is reachable by all peers on the network.
- [ ] `--cluster-shards` is consistent across all nodes in the cluster.
- [ ] `--cluster-replication-factor` ≤ number of storage-capable nodes.
- [ ] Internal auth is configured: either `--cluster-internal-auth-token[|-file]` or `--cluster-internal-mtls-*`.
- [ ] `--cluster-seeds` lists enough existing nodes for bootstrap (any subset works).
- [ ] Firewalls allow traffic on both the public `--listen` port and the internal `--cluster-bind` port.