tsink 0.10.2

A lightweight embedded time-series database with a straightforward API
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
# Monitoring & Observability

tsink exposes three built-in observability surfaces: health probes for Kubernetes liveness/readiness checks, a Prometheus-format self-instrumentation endpoint, and support bundles for ad-hoc diagnostics. All three are available without any extra configuration.

---

## Health probes

| Endpoint | Purpose |
|---|---|
| `GET /healthz` | Liveness probe — returns `ok` with HTTP 200 if the server process is running |
| `GET /ready` | Readiness probe — returns `ready` with HTTP 200 when the server is ready to serve traffic |

Both endpoints bypass authentication, respond with `Content-Type: text/plain`, and are safe to poll from infrastructure tools without bearer tokens.

**Kubernetes example**

```yaml
livenessProbe:
  httpGet:
    path: /healthz
    port: 9201
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 9201
  initialDelaySeconds: 5
  periodSeconds: 10
```

---

## Self-instrumentation endpoint

```
GET /metrics
```

Returns all internal metrics in [Prometheus text exposition format 0.0.4](https://prometheus.io/docs/instrumenting/exposition_formats/). The response is suitable for direct Prometheus scraping.

```bash
curl http://127.0.0.1:9201/metrics
```

When RBAC is enabled, this endpoint requires a token with the `metrics:read` permission. In unauthenticated deployments it is open.

### Scraping with Prometheus

```yaml
scrape_configs:
  - job_name: tsink
    static_configs:
      - targets: ['127.0.0.1:9201']
    # If RBAC is enabled:
    # authorization:
    #   credentials: <service-account-token>
```

---

## Metric reference

All metrics use the `tsink_` prefix. The sections below enumerate every metric group emitted by the server.

### General

| Metric | Type | Description |
|---|---|---|
| `tsink_uptime_seconds` | gauge | Server uptime in seconds |
| `tsink_series_total` | gauge | Number of known metric series |

### Memory

| Metric | Type | Description |
|---|---|---|
| `tsink_memory_used_bytes` | gauge | Bytes counted against the configured memory budget |
| `tsink_memory_budget_bytes` | gauge | Configured memory budget |
| `tsink_memory_excluded_bytes` | gauge | Memory intentionally excluded from the budget |
| `tsink_memory_registry_bytes` | gauge | Budget bytes used by the in-memory series registry |
| `tsink_memory_metadata_cache_bytes` | gauge | Budget bytes used by metadata caches and indexes |
| `tsink_memory_persisted_index_bytes` | gauge | Budget bytes used by persisted chunk refs and timestamp indexes |
| `tsink_memory_persisted_mmap_bytes` | gauge | Budget bytes used by mmap-backed segment payloads |
| `tsink_memory_tombstone_bytes` | gauge | Budget bytes used by tombstone state |

### Write-Ahead Log (WAL)

| Metric | Type | Description |
|---|---|---|
| `tsink_wal_enabled` | gauge | WAL enabled (1) or disabled (0) |
| `tsink_wal_size_bytes` | gauge | WAL size on disk |
| `tsink_wal_segments` | gauge | WAL segment files present |
| `tsink_wal_active_segment` | gauge | Current WAL segment id |
| `tsink_wal_acknowledged_writes_durable` | gauge | Whether acknowledged writes are fsync-durable (1) or append-only (0) |
| `tsink_wal_highwater_segment` | gauge | Last appended WAL highwater segment |
| `tsink_wal_highwater_frame` | gauge | Last appended WAL highwater frame |
| `tsink_wal_durable_highwater_segment` | gauge | Last durable WAL highwater segment |
| `tsink_wal_durable_highwater_frame` | gauge | Last durable WAL highwater frame |
| `tsink_wal_replay_runs_total` | counter | WAL replay runs |
| `tsink_wal_replay_frames_total` | counter | WAL replayed frames |
| `tsink_wal_replay_series_definitions_total` | counter | WAL replayed series definitions |
| `tsink_wal_replay_sample_batches_total` | counter | WAL replayed sample batches |
| `tsink_wal_replay_points_total` | counter | WAL replayed points |
| `tsink_wal_replay_errors_total` | counter | WAL replay errors |
| `tsink_wal_replay_duration_nanoseconds_total` | counter | WAL replay runtime |
| `tsink_wal_append_series_definitions_total` | counter | WAL appended series definitions |
| `tsink_wal_append_sample_batches_total` | counter | WAL appended sample batches |
| `tsink_wal_append_points_total` | counter | WAL appended points |
| `tsink_wal_append_bytes_total` | counter | WAL appended bytes |
| `tsink_wal_append_errors_total` | counter | WAL append errors |
| `tsink_wal_resets_total` | counter | WAL resets |
| `tsink_wal_reset_errors_total` | counter | WAL reset errors |

### Flush pipeline

The flush pipeline moves active (in-memory) chunks into persisted segments and manages tier lifecycle.

| Metric | Type | Description |
|---|---|---|
| `tsink_flush_pipeline_runs_total` | counter | Flush pipeline runs |
| `tsink_flush_pipeline_success_total` | counter | Successful flush pipeline runs |
| `tsink_flush_pipeline_timeout_total` | counter | Flush pipeline write-timeout skips |
| `tsink_flush_pipeline_errors_total` | counter | Flush pipeline errors |
| `tsink_flush_pipeline_duration_nanoseconds_total` | counter | Flush pipeline runtime |
| `tsink_flush_active_runs_total` | counter | Active chunk flush runs |
| `tsink_flush_active_errors_total` | counter | Active chunk flush errors |
| `tsink_flush_active_series_total` | counter | Active series flushed into sealed chunks |
| `tsink_flush_active_chunks_total` | counter | Active chunks flushed |
| `tsink_flush_active_points_total` | counter | Active points flushed |
| `tsink_flush_persist_runs_total` | counter | Persist attempts |
| `tsink_flush_persist_success_total` | counter | Successful persist runs |
| `tsink_flush_persist_noop_total` | counter | Persist runs with no new chunks |
| `tsink_flush_persist_errors_total` | counter | Persist errors |
| `tsink_flush_persisted_series_total` | counter | Series persisted |
| `tsink_flush_persisted_chunks_total` | counter | Chunks persisted |
| `tsink_flush_persisted_points_total` | counter | Points persisted |
| `tsink_flush_persisted_segments_total` | counter | Segments emitted by persist |
| `tsink_flush_persist_duration_nanoseconds_total` | counter | Persist runtime |
| `tsink_flush_evicted_sealed_chunks_total` | counter | Sealed chunks evicted after persistence |
| `tsink_flush_tier_moves_total` | counter | Persisted segments moved across tiers |
| `tsink_flush_tier_move_errors_total` | counter | Tier-move errors |
| `tsink_flush_expired_segments_total` | counter | Segments expired by retention |
| `tsink_flush_hot_segments_visible` | gauge | Hot-tier persisted segments visible to queries |
| `tsink_flush_warm_segments_visible` | gauge | Warm-tier persisted segments visible to queries |
| `tsink_flush_cold_segments_visible` | gauge | Cold-tier persisted segments visible to queries |

### Compaction

| Metric | Type | Description |
|---|---|---|
| `tsink_compaction_runs_total` | counter | Compaction invocations |
| `tsink_compaction_success_total` | counter | Compaction runs that rewrote segments |
| `tsink_compaction_noop_total` | counter | Compaction runs with no rewrite |
| `tsink_compaction_errors_total` | counter | Compaction errors |
| `tsink_compaction_source_segments_total` | counter | Source segments considered |
| `tsink_compaction_output_segments_total` | counter | Output segments emitted |
| `tsink_compaction_source_chunks_total` | counter | Source chunks considered |
| `tsink_compaction_output_chunks_total` | counter | Output chunks emitted |
| `tsink_compaction_source_points_total` | counter | Source points considered |
| `tsink_compaction_output_points_total` | counter | Output points emitted |
| `tsink_compaction_duration_nanoseconds_total` | counter | Compaction runtime |

### Query

| Metric | Type | Description |
|---|---|---|
| `tsink_query_select_calls_total` | counter | `select` calls |
| `tsink_query_select_errors_total` | counter | `select` errors |
| `tsink_query_select_duration_nanoseconds_total` | counter | `select` runtime |
| `tsink_query_select_points_returned_total` | counter | Points returned by `select` |
| `tsink_query_select_with_options_calls_total` | counter | `select_with_options` calls |
| `tsink_query_select_with_options_errors_total` | counter | `select_with_options` errors |
| `tsink_query_select_with_options_duration_nanoseconds_total` | counter | `select_with_options` runtime |
| `tsink_query_select_with_options_points_returned_total` | counter | Points returned |
| `tsink_query_select_all_calls_total` | counter | `select_all` calls |
| `tsink_query_select_all_errors_total` | counter | `select_all` errors |
| `tsink_query_select_all_duration_nanoseconds_total` | counter | `select_all` runtime |
| `tsink_query_select_all_series_returned_total` | counter | Series returned |
| `tsink_query_select_all_points_returned_total` | counter | Points returned |
| `tsink_query_select_series_calls_total` | counter | `select_series` calls |
| `tsink_query_select_series_errors_total` | counter | `select_series` errors |
| `tsink_query_select_series_duration_nanoseconds_total` | counter | `select_series` runtime |
| `tsink_query_select_series_returned_total` | counter | Series returned |
| `tsink_query_merge_path_queries_total` | counter | Series collections using merge path |
| `tsink_query_merge_path_shard_snapshots_total` | counter | Merge-path shard snapshots taken |
| `tsink_query_merge_path_shard_snapshot_wait_nanoseconds_total` | counter | Merge-path time waiting for shard read locks |
| `tsink_query_merge_path_shard_snapshot_hold_nanoseconds_total` | counter | Merge-path time holding shard read locks |
| `tsink_query_append_sort_path_queries_total` | counter | Series collections using append/sort path |
| `tsink_query_hot_only_plans_total` | counter | Query plans satisfied from the hot tier only |
| `tsink_query_warm_tier_plans_total` | counter | Query plans that include the warm tier |
| `tsink_query_cold_tier_plans_total` | counter | Query plans that include the cold tier |
| `tsink_query_hot_tier_persisted_chunks_read_total` | counter | Hot-tier persisted chunks decoded |
| `tsink_query_warm_tier_persisted_chunks_read_total` | counter | Warm-tier persisted chunks decoded |
| `tsink_query_cold_tier_persisted_chunks_read_total` | counter | Cold-tier persisted chunks decoded |
| `tsink_query_warm_tier_fetch_duration_nanoseconds_total` | counter | Warm-tier chunk fetch and decode time |
| `tsink_query_cold_tier_fetch_duration_nanoseconds_total` | counter | Cold-tier chunk fetch and decode time |
| `tsink_query_rollup_plans_total` | counter | Queries that used persisted rollup artifacts |
| `tsink_query_partial_rollup_plans_total` | counter | Queries that mixed rollups with raw tail reads |
| `tsink_query_rollup_points_read_total` | counter | Persisted rollup points read |

### Remote (object-store) storage

| Metric | Type | Description |
|---|---|---|
| `tsink_remote_storage_accessible` | gauge | `1` when object-store access is healthy |
| `tsink_remote_storage_compute_only` | gauge | `1` when running in compute-only mode |
| `tsink_remote_storage_mirror_hot_segments` | gauge | `1` when hot segments are mirrored to object store |
| `tsink_remote_storage_catalog_refreshes_total` | counter | Remote catalog refreshes attempted |
| `tsink_remote_storage_catalog_refresh_errors_total` | counter | Remote catalog refresh errors |
| `tsink_remote_storage_catalog_refresh_consecutive_failures` | gauge | Consecutive catalog refresh failures |
| `tsink_remote_storage_catalog_refresh_backoff_active` | gauge | `1` when retry backoff is active |

### Rollups

| Metric | Type | Description |
|---|---|---|
| `tsink_rollup_worker_runs_total` | counter | Rollup maintenance passes attempted |
| `tsink_rollup_worker_success_total` | counter | Successful maintenance passes |
| `tsink_rollup_worker_errors_total` | counter | Maintenance passes that errored |
| `tsink_rollup_policy_runs_total` | counter | Individual rollup policy evaluations |
| `tsink_rollup_buckets_materialized_total` | counter | Rollup buckets materialized |
| `tsink_rollup_points_materialized_total` | counter | Rollup points materialized |
| `tsink_rollup_last_run_duration_nanoseconds` | gauge | Duration of the most recent maintenance pass |
| `tsink_rollup_policy_status{policy,metric,aggregation,kind}` | gauge | Per-policy coverage, lag, and timing |

`tsink_rollup_policy_status` is emitted once per configured rollup policy with a `kind` label for each dimension:

| `kind` value | Meaning |
|---|---|
| `matched_series` | Series matched by the policy selector |
| `materialized_series` | Series with persisted rollup artifacts |
| `interval` | Configured rollup interval in milliseconds |
| `materialized_through` | Latest materialized timestamp (unix ms) |
| `lag` | Materialization lag in milliseconds |
| `last_run_duration_nanos` | Duration of the last policy run |
| `last_run_started_at_ms` | Start time of the last policy run |
| `last_run_completed_at_ms` | Completion time of the last policy run |

### Rules engine

| Metric | Type | Description |
|---|---|---|
| `tsink_rules_scheduler_runs_total` | counter | Rules scheduler ticks attempted |
| `tsink_rules_scheduler_skipped_not_leader_total` | counter | Ticks skipped — not cluster leader |
| `tsink_rules_scheduler_skipped_inflight_total` | counter | Ticks skipped — previous run still in flight |
| `tsink_rules_evaluated_total` | counter | Rules evaluated |
| `tsink_rules_evaluation_failures_total` | counter | Rules evaluations that errored |
| `tsink_rules_recording_rows_written_total` | counter | Samples written by recording rules |
| `tsink_rules_scheduler_active` | gauge | `1` when this node is the active rules scheduler |
| `tsink_rules_configured{kind}` | gauge | Configured groups, rules, pending alerts, firing alerts |
| `tsink_rules_runtime_limits{kind}` | gauge | Scheduler tick interval and per-evaluation limits |

### Exemplars

| Metric | Type | Description |
|---|---|---|
| `tsink_exemplars_accepted_total` | counter | Exemplars accepted |
| `tsink_exemplars_rejected_total` | counter | Exemplars rejected |
| `tsink_exemplars_dropped_total` | counter | Exemplars dropped due to retention guardrails |
| `tsink_exemplars_query_requests_total` | counter | Exemplar query requests served |
| `tsink_exemplars_query_series_total` | counter | Exemplar series returned by queries |
| `tsink_exemplars_query_results_total` | counter | Exemplars returned by queries |
| `tsink_exemplars_stored{kind}` | gauge | Currently stored series and exemplars |
| `tsink_exemplar_limits{kind}` | gauge | Configured exemlar quotas and guardrails |

### Ingest protocols

#### Prometheus remote write

| Metric | Type | Description |
|---|---|---|
| `tsink_prometheus_payload_feature_enabled{payload}` | gauge | Feature flag per payload kind (metadata, exemplar, histogram) |
| `tsink_prometheus_payload_accepted_total{payload}` | counter | Payloads accepted per kind |
| `tsink_prometheus_payload_rejected_total{payload}` | counter | Payloads rejected per kind |

#### OTLP

| Metric | Type | Description |
|---|---|---|
| `tsink_otlp_metrics_enabled` | gauge | OTLP metrics ingest enabled (1) or not (0) |
| `tsink_otlp_requests_total{outcome}` | counter | OTLP `/v1/metrics` requests, labeled `accepted` or `rejected` |
| `tsink_otlp_data_points_total{kind,outcome}` | counter | OTLP data points by metric kind and outcome |
| `tsink_otlp_exemplars_total{outcome}` | counter | OTLP exemplars by outcome |
| `tsink_otlp_supported_shape{shape}` | gauge | `1` for each supported OTLP metric shape |

#### Legacy ingest (StatsD, Graphite, InfluxDB)

| Metric | Type | Description |
|---|---|---|
| `tsink_legacy_ingest_enabled{adapter}` | gauge | `1` for each enabled legacy adapter |

### Admission control

Write and read admission are tracked independently.

#### Write admission

| Metric | Type | Description |
|---|---|---|
| `tsink_write_admission_rejections_total` | counter | Total public write admission rejections |
| `tsink_write_admission_request_slot_rejections_total` | counter | Rejections due to concurrency saturation |
| `tsink_write_admission_row_budget_rejections_total` | counter | Rejections due to in-flight row saturation |
| `tsink_write_admission_oversize_rows_rejections_total` | counter | Rejections for requests exceeding the row budget |
| `tsink_write_admission_acquire_wait_nanoseconds_total` | counter | Wait time acquiring admission permits |
| `tsink_write_admission_active_requests` | gauge | Active requests holding admission slots |
| `tsink_write_admission_active_rows` | gauge | Active rows reserved against admission budget |

#### Read admission

| Metric | Type | Description |
|---|---|---|
| `tsink_read_admission_rejections_total` | counter | Total public read admission rejections |
| `tsink_read_admission_request_slot_rejections_total` | counter | Rejections due to concurrency saturation |
| `tsink_read_admission_query_budget_rejections_total` | counter | Rejections due to in-flight query saturation |
| `tsink_read_admission_oversize_queries_rejections_total` | counter | Rejections for requests exceeding the query budget |
| `tsink_read_admission_acquire_wait_nanoseconds_total` | counter | Wait time acquiring admission permits |
| `tsink_read_admission_active_requests` | gauge | Active requests holding admission slots |

#### Per-tenant admission

Tenant admission metrics carry a `tenant` label when multi-tenancy is enabled.

### Edge sync

Edge sync ships writes queued on edge/source nodes upstream. Metrics are emitted per role.

| Metric | Type | Description |
|---|---|---|
| `tsink_edge_sync_enabled{role}` | gauge | Source and accept mode enablement |
| `tsink_edge_sync_queue{kind}` | gauge | Backlog entries, bytes, log size, oldest age, and retention window |
| `tsink_edge_sync_events_total{event}` | counter | Enqueue, replay, and retention-drop events |
| `tsink_edge_sync_replayed_rows_total` | counter | Rows replayed upstream |
| `tsink_edge_sync_accept_dedupe{...}` | gauge/counter | Accept-side idempotency window state |

### Cluster — write routing

| Metric | Type | Description |
|---|---|---|
| `tsink_cluster_write_requests_total` | counter | Write requests routed through the coordinator |
| `tsink_cluster_write_local_rows_total` | counter | Rows inserted locally |
| `tsink_cluster_write_routed_rows_total` | counter | Rows forwarded to remote owners |
| `tsink_cluster_write_routed_batches_total` | counter | Remote write batches sent |
| `tsink_cluster_write_failures_total` | counter | Write routing failures |
| `tsink_cluster_write_shard_rows_total{shard}` | counter | Rows routed per shard |
| `tsink_cluster_write_peer_routed_rows_total{node_id}` | counter | Rows routed per peer |
| `tsink_cluster_write_peer_routed_batches_total{node_id}` | counter | Batches routed per peer |
| `tsink_cluster_write_remote_requests_total{node_id}` | counter | Remote write RPC requests per peer |
| `tsink_cluster_write_remote_failures_total{node_id}` | counter | Remote write RPC failures per peer |
| `tsink_cluster_write_remote_request_duration_seconds{node_id,le}` | histogram | Remote write RPC latency per peer |

### Cluster — deduplication

| Metric | Type | Description |
|---|---|---|
| `tsink_cluster_dedupe_requests_total` | counter | Idempotency key checks |
| `tsink_cluster_dedupe_accepted_total` | counter | Requests accepted for dedupe tracking |
| `tsink_cluster_dedupe_duplicates_total` | counter | Requests deduplicated |
| `tsink_cluster_dedupe_inflight_rejections_total` | counter | Conflicts while key is in-flight |
| `tsink_cluster_dedupe_commits_total` | counter | Dedupe marker commits |
| `tsink_cluster_dedupe_aborts_total` | counter | Dedupe marker aborts |
| `tsink_cluster_dedupe_cleanup_runs_total` | counter | Cleanup runs |
| `tsink_cluster_dedupe_expired_keys_total` | counter | Keys expired by TTL |
| `tsink_cluster_dedupe_evicted_keys_total` | counter | Keys evicted by size bound |
| `tsink_cluster_dedupe_persistence_failures_total` | counter | Dedupe marker persistence failures |
| `tsink_cluster_dedupe_active_keys` | gauge | Active dedupe keys in window |
| `tsink_cluster_dedupe_inflight_keys` | gauge | In-flight dedupe keys |
| `tsink_cluster_dedupe_log_bytes` | gauge | Durable dedupe marker log size on disk |

### Cluster — read fanout

| Metric | Type | Description |
|---|---|---|
| `tsink_cluster_fanout_requests_total` | counter | Read fanout requests |
| `tsink_cluster_fanout_failures_total` | counter | Read fanout failures |
| `tsink_cluster_fanout_duration_nanoseconds_total` | counter | Fanout execution time |
| `tsink_cluster_fanout_remote_requests_total` | counter | Remote RPC requests |
| `tsink_cluster_fanout_remote_failures_total` | counter | Remote RPC failures |
| `tsink_cluster_fanout_resource_rejections_total` | counter | Guardrail rejections |
| `tsink_cluster_fanout_resource_acquire_wait_nanoseconds_total` | counter | Wait time for query permits |
| `tsink_cluster_fanout_resource_active_queries` | gauge | Active distributed reads holding permits |
| `tsink_cluster_fanout_resource_active_merged_points` | gauge | Merged-point budget in use |
| `tsink_cluster_fanout_operation_requests_total{operation}` | counter | Fanout requests per operation |
| `tsink_cluster_fanout_operation_failures_total{operation}` | counter | Fanout failures per operation |
| `tsink_cluster_fanout_remote_requests_by_peer_total{node_id,operation}` | counter | RPC requests per peer and operation |
| `tsink_cluster_fanout_remote_failures_by_peer_total{node_id,operation}` | counter | RPC failures per peer and operation |
| `tsink_cluster_fanout_remote_request_duration_seconds{node_id,operation,le}` | histogram | RPC latency per peer and operation |

### Cluster — read planner

| Metric | Type | Description |
|---|---|---|
| `tsink_cluster_read_planner_requests_total` | counter | Read planner requests |
| `tsink_cluster_read_planner_candidate_shards_total` | counter | Candidate shards evaluated |
| `tsink_cluster_read_planner_pruned_shards_total` | counter | Shards pruned |
| `tsink_cluster_read_planner_local_shards_total` | counter | Local shards selected |
| `tsink_cluster_read_planner_remote_targets_total` | counter | Remote peer targets selected |
| `tsink_cluster_read_planner_remote_shards_total` | counter | Remote shard assignments |
| `tsink_cluster_read_planner_operation_requests_total{operation}` | counter | Planner requests per operation |
| `tsink_cluster_read_planner_operation_candidate_shards_total{operation}` | counter | Candidate shards per operation |
| `tsink_cluster_read_planner_operation_pruned_shards_total{operation}` | counter | Pruned shards per operation |
| `tsink_cluster_read_planner_operation_remote_targets_total{operation}` | counter | Remote targets per operation |

### Cluster — hinted handoff (outbox)

| Metric | Type | Description |
|---|---|---|
| `tsink_cluster_outbox_enqueued_total` | counter | Replica batches enqueued |
| `tsink_cluster_outbox_enqueue_rejected_total` | counter | Enqueue rejections due to quota limits |
| `tsink_cluster_outbox_persistence_failures_total` | counter | Outbox persistence failures |
| `tsink_cluster_outbox_replay_attempts_total` | counter | Outbox replay attempts |
| `tsink_cluster_outbox_replay_success_total` | counter | Successful replays |
| `tsink_cluster_outbox_replay_failures_total` | counter | Failed replays |
| `tsink_cluster_outbox_queued_entries` | gauge | Pending outbox entries |
| `tsink_cluster_outbox_queued_bytes` | gauge | Pending outbox bytes |
| `tsink_cluster_outbox_log_bytes` | gauge | Outbox log file size |
| `tsink_cluster_outbox_stale_records` | gauge | Stale log records pending cleanup |
| `tsink_cluster_outbox_stalled_peers` | gauge | Peers currently stalled |
| `tsink_cluster_outbox_stalled_oldest_age_milliseconds` | gauge | Oldest stalled peer backlog age |
| `tsink_cluster_outbox_cleanup_runs_total` | counter | Cleanup worker iterations |
| `tsink_cluster_outbox_cleanup_compactions_total` | counter | Cleanup-triggered compactions |
| `tsink_cluster_outbox_cleanup_reclaimed_bytes_total` | counter | Bytes reclaimed by cleanup |
| `tsink_cluster_outbox_cleanup_failures_total` | counter | Cleanup failures |
| `tsink_cluster_outbox_stalled_alerts_total` | counter | Stalled-peer alert transitions |
| `tsink_cluster_outbox_peer_queued_entries{node_id}` | gauge | Pending entries per peer |
| `tsink_cluster_outbox_peer_queued_bytes{node_id}` | gauge | Pending bytes per peer |
| `tsink_cluster_outbox_peer_stalled{node_id}` | gauge | `1` when a peer is stalled |

### Cluster — control plane

| Metric | Type | Description |
|---|---|---|
| `tsink_cluster_control_current_term` | gauge | Current consensus term |
| `tsink_cluster_control_commit_index` | gauge | Current committed control-log index |
| `tsink_cluster_control_leader_stale` | gauge | `1` if the current leader is considered stale |
| `tsink_cluster_control_leader_contact_age_ms` | gauge | Milliseconds since last leader heartbeat |
| `tsink_cluster_control_suspect_peers` | gauge | Peers currently marked suspect |
| `tsink_cluster_control_dead_peers` | gauge | Peers currently marked dead |
| `tsink_cluster_control_peer_status{node_id,status}` | gauge | One-hot liveness status per peer (`unknown`, `healthy`, `suspect`, `dead`) |
| `tsink_cluster_control_peer_last_success_unix_ms{node_id}` | gauge | Last successful heartbeat per peer |
| `tsink_cluster_control_peer_last_failure_unix_ms{node_id}` | gauge | Last failed attempt per peer |
| `tsink_cluster_control_peer_consecutive_failures{node_id}` | gauge | Consecutive failures per peer |

### Security & RBAC

| Metric | Type | Description |
|---|---|---|
| `tsink_secret_rotation_generation{target}` | gauge | Current rotation generation per secret target |
| `tsink_secret_rotation_reload_total{target}` | counter | Secret reload operations per target |
| `tsink_secret_rotation_total{target}` | counter | Secret rotation operations per target |
| `tsink_secret_rotation_failures_total{target}` | counter | Failures per target |
| `tsink_secret_rotation_last_success_unix_ms{target}` | gauge | Last successful reload or rotation (unix ms) |
| `tsink_secret_rotation_last_failure_unix_ms{target}` | gauge | Last failure (unix ms) |
| `tsink_secret_rotation_previous_credential_active{target}` | gauge | `1` during overlap grace window |
| `tsink_rbac_service_accounts_total` | gauge | Configured RBAC service accounts |
| `tsink_rbac_service_accounts_disabled` | gauge | Disabled RBAC service accounts |
| `tsink_rbac_service_accounts_last_rotated_unix_ms` | gauge | Latest service-account rotation timestamp |

### Usage accounting

| Metric | Type | Description |
|---|---|---|
| `tsink_usage_ledger_records_total` | gauge | Durable or in-memory tenant usage ledger records |
| `tsink_usage_ledger_tenants_total` | gauge | Distinct tenants in the usage ledger |
| `tsink_usage_ledger_storage_reconciliations_total` | counter | Storage reconciliation snapshots recorded |
| `tsink_usage_ledger_durable` | gauge | `1` when the ledger is backed by a durable on-disk store |

---

## Alert recommendations

The following metrics are good starting points for alerts:

| Concern | Metric | Threshold guidance |
|---|---|---|
| WAL errors | `tsink_wal_append_errors_total` | Rate > 0 for 1 minute |
| WAL replay errors | `tsink_wal_replay_errors_total` | Any increase |
| Flush errors | `tsink_flush_pipeline_errors_total` | Rate > 0 for 2 minutes |
| Persist errors | `tsink_flush_persist_errors_total` | Rate > 0 |
| Compaction errors | `tsink_compaction_errors_total` | Rate > 0 |
| Memory pressure | `tsink_memory_used_bytes / tsink_memory_budget_bytes` | > 0.90 |
| Write admission rejections | `tsink_write_admission_rejections_total` | Rate sustained > 0 |
| Read admission rejections | `tsink_read_admission_rejections_total` | Rate sustained > 0 |
| Object-store inaccessible | `tsink_remote_storage_accessible` | == 0 for 2 minutes |
| Remote catalog backoff | `tsink_remote_storage_catalog_refresh_consecutive_failures` | > 3 |
| Dead cluster peers | `tsink_cluster_control_dead_peers` | > 0 |
| Leader stale | `tsink_cluster_control_leader_stale` | == 1 for 1 minute |
| Hinted handoff stalled | `tsink_cluster_outbox_stalled_peers` | > 0 |
| Secret rotation failure | `tsink_secret_rotation_failures_total` | Any increase |
| Rollup lag | `tsink_rollup_policy_status{kind="lag"}` | > acceptable lag threshold |

---

## Support bundles

```
GET /api/v1/admin/support_bundle?tenant=<id>
```

Downloads a bounded JSON diagnostic snapshot for a single tenant. Requires the `admin:read` RBAC permission. The response is returned as a downloadable `.json` file with a `Content-Disposition: attachment` header.

The bundle includes:

| Section | Contents |
|---|---|
| `statusTsdb` | TSDB status endpoint snapshot |
| `usage` | Tenant usage accounting summary |
| `rbacState` | Live RBAC roles, service accounts, and OIDC mappings |
| `rbacAudit` | Last 50 RBAC decision and reload audit entries |
| `securityState` | Secret rotation and TLS state |
| `clusterAudit` | Last 50 cluster admin mutation log entries |
| `clusterHandoff` | Cluster handoff progress |
| `clusterRepair` | Cluster repair progress |
| `clusterRebalance` | Shard rebalance progress |
| `rules` | Rules engine status |
| `rollups` | Rollup policy freshness and coverage |

```bash
curl -H "Authorization: Bearer $TOKEN" \
  'http://127.0.0.1:9201/api/v1/admin/support_bundle?tenant=default' \
  -o tsink-support-bundle.json
```