# Architecture Overview
This document describes the high-level design of tsink, the component interactions, and the end-to-end data flow for writes and reads.
---
## Table of Contents
1. [Deployment modes](#1-deployment-modes)
2. [Repository layout](#2-repository-layout)
3. [Public API surface](#3-public-api-surface)
4. [Storage engine overview](#4-storage-engine-overview)
5. [Write path](#5-write-path)
6. [Read path](#6-read-path)
7. [Write-Ahead Log (WAL)](#7-write-ahead-log-wal)
8. [Chunk encoding](#8-chunk-encoding)
9. [Segments and on-disk index](#9-segments-and-on-disk-index)
10. [Series registry](#10-series-registry)
11. [Compaction](#11-compaction)
12. [Tiered storage and lifecycle](#12-tiered-storage-and-lifecycle)
13. [Visibility and MVCC fence](#13-visibility-and-mvcc-fence)
14. [PromQL engine](#14-promql-engine)
15. [Rollups and downsampling](#15-rollups-and-downsampling)
16. [Server mode](#16-server-mode)
17. [Clustering and replication](#17-clustering-and-replication)
18. [Security model](#18-security-model)
19. [Observability and background runtime](#19-observability-and-background-runtime)
20. [Configuration reference summary](#20-configuration-reference-summary)
---
## 1. Deployment modes
tsink ships as three interconnected Rust crates in a single workspace.
| `tsink` (root `src/`) | Embeddable library — the complete storage engine with no async runtime dependency. |
| `tsink-server` (`crates/tsink-server/`) | Standalone HTTP server binary with protocol ingest, clustering, and RBAC. |
| `tsink-uniffi` (`crates/tsink-uniffi/`) | UniFFI-generated Python bindings that wrap the library crate. |
All three share the same engine code. The server and Python bindings call into the same `Storage` trait that application code uses directly.
---
## 2. Repository layout
```text
src/
lib.rs — public re-exports
storage.rs — Storage trait, StorageBuilder, public DTOs
async.rs — runtime-agnostic async façade (AsyncStorage)
value.rs — Value enum, NativeHistogram
label.rs — Label, canonical identity, stable hash
wal.rs — WalSyncMode, WalReplayMode
mmap.rs — PlatformMmap (memmap2 wrapper)
concurrency.rs — Semaphore (write gate)
cgroup.rs — container-aware CPU/memory detection
validation.rs — metric/label boundary checks
query_aggregation.rs — Aggregation enum, downsampling
query_matcher.rs — CompiledSeriesMatcher (regex → postings)
query_selection.rs — PreparedSeriesSelection execution
error.rs — StorageError
engine/ — all internal engine modules (see §4)
promql/ — PromQL lexer, parser, AST, evaluator
crates/
tsink-server/src/ — HTTP server, protocol adapters, cluster logic
tsink-uniffi/src/ — UniFFI bindings
```
---
## 3. Public API surface
### Core interface — `Storage` trait (`src/storage.rs`)
All interaction with the engine goes through the `Storage` trait. Key methods:
| `insert_rows(&[Row])` | Synchronous batch write. |
| `insert_rows_with_result` | Batch write returning per-row `WriteResult`. |
| `select(metric, labels, start, end)` | Range read for a single series. |
| `select_with_options` | Range read with `QueryOptions` (aggregation, downsampling, pagination). |
| `select_all` | Bulk range read across all series matching a selector. |
| `list_metrics` | Enumerate stored metric names. |
| `select_series` | Enumerate series matching a label selector with time-range filtering. |
| `scan_series_rows` / `scan_metric_rows` | Paginated scan for large result sets. |
| `delete_series` | Tombstone-based deletion with optional time range. |
| `snapshot` | Atomic directory snapshot. |
| `add_rollup_policy` / `run_rollup_pass` | Manage persistent downsampling policies. |
| `observability_snapshot` | Current internal counters and health state. |
### Key data types
- **`Row`** — `(metric: String, labels: Vec<Label>, data_point: DataPoint)`.
- **`DataPoint`** — `(timestamp: i64, value: Value)`.
- **`Value`** — `F64(f64) | I64(i64) | U64(u64) | Bool(bool) | Bytes(Vec<u8>) | Histogram(Box<NativeHistogram>)`.
- **`Label`** — `(name: String, value: String)`.
- **`TimestampPrecision`** — `Nanoseconds | Microseconds | Milliseconds | Seconds`.
### Async façade — `AsyncStorage` (`src/async.rs`)
`AsyncStorage` wraps `Storage` without requiring Tokio. It spawns OS threads internally:
- One dedicated write worker receives `WriteCommand` messages over an `async-channel`.
- A pool of read workers (sized by `cgroup::default_workers_limit()`) handles `ReadCommand` messages concurrently.
- Uses `parking_lot::Mutex` and `std::thread` — suitable for embedding in any async runtime.
---
## 4. Storage engine overview
The engine lives in `src/engine/` and is structured in three layers.
### Layer 1 — Foundation libraries
Pure data-structure and codec modules with no cross-cutting concerns:
| `chunk` | In-memory chunk builder and sealed `EncodedChunk`. |
| `encoder` | Timestamp and value codec selection and execution. |
| `segment` | On-disk segment format, reader, writer, index postings. |
| `series` | Series identity, interned string dictionaries, Roaring Bitmap postings. |
| `wal` | WAL frame format, segmented log files, replay stream. |
| `index` | In-memory `ChunkIndex` and `PostingsIndex`. |
| `query` | Core query primitives: `QueryPlan`, `ChunkSeriesCursor`, chunk decoding. |
| `compactor` | L0/L1/L2 compaction planning and execution. |
### Layer 2 — Shared shell
State containers and cross-cutting services that the owner modules operate on:
| `engine.rs` | `ChunkStorage` — the top-level struct with five state buckets. |
| `state` | All internal state type definitions. |
| `visibility` | MVCC fence and per-series visibility summaries. |
| `runtime` | Background thread management and supervision. |
| `metrics` / `observability` | Lockless counters and snapshot generation. |
| `shard_routing` | Maps series IDs to 1-of-64 shards; dead-lock-safe lock ordering. |
| `metadata_lookup` | Distributed-shard series resolution. |
| `process_lock` | Exclusive-open file lock. |
### Layer 3 — Owner modules
Higher-level operations that drive the shell:
| `bootstrap` | Six-phase startup: discovery → recovery → WAL open → hydrate → replay → finalize. |
| `ingest` + `ingest_pipeline` | Five-phase write pipeline (resolve → prepare → apply → commit). |
| `write_buffer` | Active-head flush policies. |
| `lifecycle` | Flush, persist, replay, and shutdown orchestration. |
| `query_exec` + `query_read` | High-level query execution and low-level read path. |
| `deletion` | Tombstone-based series deletion. |
| `rollups` | Persistent downsampling materialization. |
| `maintenance` + `tiering` + `registry_catalog` | Tier lifecycle, catalog validation. |
### State buckets in `ChunkStorage`
To contain lock contention, `ChunkStorage` organises mutable state into five separate structs:
| `CatalogState` | `SeriesRegistry`, 64 `write_txn_shards` (`Mutex`), pending series IDs, persistence lock. |
| `ChunkBufferState` | 64-shard active builders, 64-shard sealed chunks, persisted-chunk watermarks, monotone chunk sequence counter. |
| `VisibilityState` | Tombstones, materialized series, per-series visibility summaries, flush visibility RwLock. |
| `PersistedStorageState` | `PersistedIndexState`, WAL handle, compactors, tiered-storage config, pending segment diff. |
| `RuntimeConfigState` | Retention/partition/skew windows, memory budget, cardinality limit, WAL size limit, concurrency bounds. |
---
## 5. Write path
A single `insert_rows` call traverses six steps before returning to the caller.
```text
insert_rows(&[Row])
│
▼
[1] Semaphore acquire (max_writers permits, write_timeout deadline)
│
▼
[2] Shard lock acquisition (minimal set of write_txn_shards, ascending order)
│
▼
[3] WriteResolver — validate labels, resolve or mint SeriesId,
│ append WAL SeriesDefinitionFrame for new series
▼
[4] WritePreparer — normalise timestamps, reject future skew,
│ reject below retention floor,
│ admission control: memory budget, cardinality limit, WAL size
▼
[5] WriteApplier — append ChunkPoint to ActivePartitionHead (ChunkBuilder),
│ append WAL SamplesBatchFrame
▼
[6] WriteCommitter — seal full chunks, publish visibility,
update tombstone/rollup cross-references,
notify background flush/compaction workers
```
Each `Row` carries a `metric`, `Vec<Label>`, and a `DataPoint`. The 64-shard split means that independent series in the same batch can be admitted concurrently.
**Admission control** in the prepare phase enforces:
- `memory_budget_bytes` — total in-memory chunk bytes.
- `cardinality_limit` — maximum unique series count.
- WAL size limit — maximum outstanding un-flushed WAL bytes.
Admission failures return a per-Row `WriteResult` when calling `insert_rows_with_result`; the synchronous `insert_rows` drops rejected rows silently.
---
## 6. Read path
```text
select / select_all / select_with_options
│
▼
[1] candidate_planner — resolve metric + label matchers → Vec<SeriesId>
│ via PostingsIndex (postings bitmaps, regex → finite literal set)
▼
[2] TieredQueryPlan — compute which tiers to include (hot-only if within hot window,
│ else hot + warm + cold)
▼
[3] QuerySnapshotContext — take visibility-fenced snapshot:
│ acquire flush_visibility_lock (read)
│ snapshot active chunks, sealed chunks, persisted index
│ release lock
▼
[4] Per-series read — for each SeriesId:
│ a. Scan active ChunkBuilders (in-memory, unsorted tail)
│ b. Scan sealed EncodedChunks (flushed from active, pending segment write)
│ c. Scan PersistedIndexState (mmap'd segment files on disk, hot tier)
│ d. Optionally scan warm/cold tiers (object store or additional paths)
▼
[5] Merge and decode — ChunkSeriesCursor binary-searches each chunk's
│ TimestampSearchIndex, decodes only the needed range,
│ merge-sorts streams from all sources
▼
[6] Aggregation / downsampling — optional bucket aggregation via query_aggregation
│
▼
[7] Return SeriesPoints
```
The `flush_visibility_lock` is the consistency fence: writers hold a write lock during flush publication; readers hold a shared read lock for the duration of snapshot capture, guaranteeing a consistent view.
---
## 7. Write-Ahead Log (WAL)
The WAL lives under `<data_path>/wal/` and is managed by the `FramedWal` struct.
### Frame wire format
```text
Offset Size Field
0 4 Magic = b"TSFR"
4 1 Frame type (1 = SeriesDef, 2 = Samples)
5 4 CRC-32 of payload
9 4 Payload length (max 64 MiB)
13 11 Sequence number (monotone u64 + padding)
24 N Payload
```
**`SeriesDefinitionFrame`** — written once per new series: `(series_id, metric, labels)`.
**`SamplesBatchFrame`** — written per chunk flush: `(series_id, lane, timestamp_codec_id, value_codec_id, point_count, base_timestamp, encoded_timestamps, encoded_values)`.
### Durability modes
| `WalSyncMode::PerAppend` | `fsync` on every frame append. Crash-safe; throughput-limited. |
| `WalSyncMode::Periodic(d)` | Background `fsync` every duration `d`. Higher throughput; up to `d` of data can be lost on hard crash. |
### Replay modes
| `WalReplayMode::Strict` | Abort if any WAL segment is corrupted or unreadable. |
| `WalReplayMode::Salvage` | Skip damaged segments and continue replay. |
### High-water mark
`WalHighWatermark { segment: u64, frame: u64 }` tracks how far replay has been committed. Each persisted segment stores the WAL high-water mark at the time it was written, so recovery knows which WAL frames to replay and which to skip.
The published high-water file (`wal.published`) uses magic `b"TSHW"`.
---
## 8. Chunk encoding
### Chunk lifecycle
```text
ChunkBuilder (active)
│ accumulates ChunkPoints in 64-point frozen blocks + mutable tail
│ sealed when point count reaches DEFAULT_CHUNK_POINTS (2048)
▼
EncodedChunk (sealed)
│ raw points cleared; only encoded payload survives
│ held in sealed_chunks until background flush
▼
SegmentWriter → persisted segment file
```
### Timestamp codecs
| `FixedStepRle` | Constant-interval series (e.g. scrape-interval metrics). |
| `DeltaVarint` | Variable-interval series with small deltas. |
| `DeltaOfDeltaBitpack` | Gorilla-style double-delta with bit-packing for irregular series. |
The encoder auto-selects the best codec via `choose_best_timestamp_codec` after inspecting the actual delta distribution.
### Value codecs
| `GorillaXorF64` | XOR floating-point compression (Gorilla paper). |
| `ZigZagDeltaBitpackI64` | Signed integer delta + bit-packing. |
| `DeltaBitpackU64` | Unsigned integer delta + bit-packing. |
| `ConstantRle` | Run-length encoding for constant-value series. |
| `BoolBitpack` | 1-bit packing for boolean streams. |
| `BytesDeltaBlock` | Block delta encoding for byte-string payloads. |
### Timestamp search index
Each sealed chunk builds a `TimestampSearchIndex` — an array of anchor entries every 64 points. Queries binary-search this index to seek directly into the encoded payload without full decompression, giving O(log N) range access.
---
## 9. Segments and on-disk index
### Directory structure
```text
<data_path>/
lane_numeric/ — float/int/bool/histogram chunks
l0/ — freshly flushed segments
l1/ — first compaction level
l2/ — final compaction level
lane_blob/ — bytes-value chunks (same level layout)
wal/ — WAL segment files
series_index.bin — binary RIDX v2 series registry snapshot
series_index.delta.d/ — incremental series registry delta checkpoints
series_index.catalog.json — fingerprint catalog for fast registry reload
tombstones.json.store/ — sharded binary tombstone store (256 shards)
.rollups/ — rollup state JSON (policies.json, state.json)
.process.lock — exclusive-open process lock
```
### Segment manifest
Each segment directory contains:
- Encoded chunk files (one per series × value lane).
- A `SegmentManifest` (`segment_id, level, chunk/point/series counts, min/max timestamps, wal_highwater`).
- A `SegmentPostingsIndex` for label/metric lookups within the segment.
- Checksums; corrupted segments are quarantined automatically.
### In-memory index
`ChunkIndex` — a flat sorted `Vec<ChunkIndexEntry>`, binary-searched for time-range queries. Each entry stores `(series_id, min_ts, max_ts, chunk_offset, chunk_len, point_count, lane, codec IDs, level)`.
`PostingsIndex` — a `BTreeMap<label_name → BTreeMap<label_value → BTreeSet<SeriesId>>>` used to resolve label-matcher queries to candidate series lists.
---
## 10. Series registry
The registry maps `(metric, labels)` pairs to stable numeric `SeriesId` values and maintains postings for fast label-based lookup.
### Internals
- **64-shard split** — `series_shards`, `metric_postings_shards`, `label_postings_shards`, `all_series_shards`; each shard protected by an independent `RwLock`. Reads and writes on different series never contend.
- **Interned string dictionaries** — `metric_dict`, `label_name_dict`, `label_value_dict` store unique strings once; `DictionaryId (u32)` is used internally. Reduces memory for high-cardinality label sets.
- **Roaring Bitmaps** — postings lists are `RoaringTreemap` — compact, fast union/intersection for label-matcher resolution.
- **`SeriesValueFamily`** — inferred at first write (`F64 | I64 | U64 | Bool | Blob | Histogram`); used to select the appropriate codec path for all subsequent writes.
### Series identity
`canonical_series_identity(metric, labels)` produces a deterministic binary key: labels are sorted lexicographically, then length-prefixed and concatenated with the metric name. `stable_series_identity_hash` applies xxh64 to this key for O(1) shard dispatch.
### Persistence
The registry is persisted as a binary RIDX v2 file (`series_index.bin`) with incremental delta checkpoints in `series_index.delta.d/`. On startup, `registry_catalog.rs` validates per-segment xxh64 fingerprints against the catalog; a mismatch triggers a full registry rebuild from segment postings files.
---
## 11. Compaction
tsink uses a three-level LSM-inspired compaction scheme.
```text
Active chunks
│ background flush (250 ms default)
▼
L0 segments ← trigger: 4 L0 segments → compact → L1
▼
L1 segments ← trigger: 4 L1 segments → compact → L2
▼
L2 segments (cold-compacted, largest, longest time span)
```
Each compaction pass:
1. **Planning** (`compactor/planning.rs`) — selects a window of up to 8 source segments at the triggering level.
2. **Execution** (`compactor/execution.rs`) — merge-sorts all chunks, deduplicates overlapping samples, re-encodes with optimal codecs, writes new target-level segment.
3. **Atomic replacement** — the replacement is staged in `.compaction-replacements/` and renamed into place. If the process crashes mid-replacement, `finalize_pending_compaction_replacements` cleans up the staging area at next startup.
Compaction is tombstone-aware: tombstoned time ranges are excluded from the output segment, physically removing deleted data from disk.
---
## 12. Tiered storage and lifecycle
### Tiers
| Hot | Local filesystem (`lane_numeric/`, `lane_blob/`) | Direct mmap reads. |
| Warm | Configurable local path or object store | Loaded on demand. |
| Cold | Object store | Loaded on demand; full fetch before query. |
### Tier lifecycle
A background maintenance pass computes a `PostFlushMaintenancePolicyPlan` from `RetentionTierPolicy`:
- **`expired_actions`** — delete segments that fall outside the retention window.
- **`rewrite_actions`** — trim segment boundaries at time-split points.
- **`move_actions`** — promote segments from Hot → Warm or Warm → Cold when they cross the configured retention cutoffs.
### Query tier selection
`TieredQueryPlan { start, end, include_warm, include_cold }` is computed from the query time range vs. tier cutoffs. A query entirely within the hot window skips all remote tier I/O.
---
## 13. Visibility and MVCC fence
### The problem
A flush publishes many new segment entries and visibility summaries at once. Without a fence, a reader could see a partially-published flush: some new segments visible, others not.
### The solution
`flush_visibility_lock` is a `RwLock<()>`:
- **Writers** acquire the exclusive `write` lock for the entire duration of flush publication.
- **Readers** acquire a shared `read` lock during snapshot capture (step 3 of the read path).
This is a brief, bounded mutual-exclusion window — not a per-operation lock. Normal write ingest does not touch this lock.
### Per-series visibility summaries
`SeriesVisibilitySummary` tracks up to `SERIES_VISIBILITY_SUMMARY_MAX_RANGES = 32` disjoint time ranges per series. This handles series with gaps (e.g. intermittently reporting nodes) efficiently: readers can skip time ranges that are known to be empty without scanning segment files.
Fields per summary:
- `latest_visible_timestamp` — highest timestamp ingested for this series.
- `latest_bounded_visible_timestamp` — highest timestamp whose containing chunk has been sealed and persisted.
- `exhaustive_floor_inclusive` — data below this timestamp is complete (no more writes expected).
- `truncated_before_floor` — all data before this point has been tombstoned.
- `ranges` — list of `(min, max)` tuples defining continuous data windows.
---
## 14. PromQL engine
tsink includes a full in-process PromQL evaluator in `src/promql/`.
### Components
| `lexer.rs` | Hand-written lexer; produces `Vec<Token>`. |
| `parser.rs` | Recursive-descent Pratt parser; produces `ast::Expr`. |
| `ast.rs` | Complete AST: `VectorSelector`, `MatrixSelector`, `SubqueryExpr`, `BinaryExpr`, `AggregationExpr`, `CallExpr`, `AtModifier`, all operators. |
| `eval/` | Evaluator; implements all PromQL functions and aggregations. |
### Evaluator
`Engine { storage, default_lookback_delta, timestamp_units_per_second }`:
- **`instant_query(query, time)`** — evaluates at a single timestamp.
- **`range_query(query, start, end, step)`** — vectorized evaluation over a step range.
A `PrefetchCache` is keyed by metric name and populated via `select_all` before evaluation, amortising storage round-trips across all eval timestamps in a range query.
Supported features: `rate`, `irate`, `increase`, `delta`, `histogram_quantile`, all standard aggregations (`sum`, `avg`, `count`, `min`, `max`, `topk`, `bottomk`, `quantile`), binary operators with `on`/`ignoring`/`group_left`/`group_right`, subqueries, `@` modifier.
---
## 15. Rollups and downsampling
Rollup policies define persistent, automated downsampling of raw data.
### Policy structure
`RollupPolicy`:
- `policy_id` — stable identifier.
- Source selector — metric name + optional label matchers.
- `aggregation` — `Sum | Min | Max | Avg | Count | ...`.
- `resolution` — output bucket width.
- Retention rules — when to expire rolled-up data.
### Materialization
Materialized series are stored under synthetic metric names of the form `__tsink_rollup__:<policy_id>:<original_metric>`. A rollup worker runs every 5 seconds, reads raw data for each pending policy/source pair, computes aggregated buckets, and writes the result back via a normal `insert_rows` call.
Rollup state is checkpointed as JSON files in `.rollups/` with per-policy per-source progress cursors. When a tombstone is written to a source series, any rolled-up data covering the deleted range is invalidated and scheduled for re-materialization.
---
## 16. Server mode
The `tsink-server` crate wraps the library engine in a full HTTP server.
### Protocol support
| Prometheus Remote Write | `POST /api/v1/write` (Snappy-framed protobuf) |
| Prometheus Remote Read | `POST /api/v1/read` |
| Prometheus Text Exposition | `POST /api/v1/import/prometheus` |
| InfluxDB Line Protocol v1/v2 | `POST /write`, `POST /api/v2/write` |
| OTLP HTTP | `POST /v1/metrics` (protobuf; gauges, sums, histograms, summaries) |
| StatsD | UDP (counter, gauge, timer, set) |
| Graphite | TCP (plaintext) |
| PromQL instant query | `GET /api/v1/query` |
| PromQL range query | `GET /api/v1/query_range` |
### Connection handling
- `MAX_CONNECTIONS = 1024` simultaneous TCP connections.
- `KEEP_ALIVE_TIMEOUT = 30 s`, `HANDSHAKE_TIMEOUT = 10 s`, `SHUTDOWN_GRACE_PERIOD = 10 s`.
- TLS via `tokio-rustls`; HTTP/1.1 framing.
### Multi-tenancy
Tenant identity is propagated via the `x-tsink-tenant` HTTP header and injected as a `__tsink_tenant__` synthetic label on all writes and reads. Per-tenant admission semaphores govern ingest, query, metadata, and retention concurrency independently.
---
## 17. Clustering and replication
Clustering is an opt-in layer over the single-node server. All cluster logic lives in `crates/tsink-server/src/cluster/`.
### Topology
```text
Client
│ writes / reads
▼
Any cluster node (HTTP)
│
├── WriteRouter — splits batch by consistent hash ring → local + remote shards
│ │
│ ├── local write — directly into local ChunkStorage
│ └── remote write — parallel RPC fanout to replica owners
│
└── ReadFanoutExecutor — fans query to owner shards, merges responses
```
### Consistent hash ring (`cluster/ring.rs`)
`ShardRing { shard_count, replication_factor, virtual_nodes_per_node = 128 }`:
- Each node contributes 128 virtual tokens distributed by xxh64 hash around the ring.
- Each shard is pre-assigned to `replication_factor` owner nodes.
- `stable_series_identity_hash(metric, labels)` → deterministic shard selection — the same series always lands on the same shard regardless of which node receives the write.
### Write routing (`cluster/replication.rs`)
`WriteRouter` decomposes an incoming batch into a `WritePlan { local_rows, remote_batches }`. Remote batches are sent in parallel via `tokio::task::JoinSet` (max 32 in-flight batches, max 1024 rows per batch). Failed remote writes are enqueued to `HintedHandoffOutbox` for retry.
**Consistency levels** — `ClusterWriteConsistency { One | Quorum | All }` — control how many replicas must acknowledge before the write is considered successful.
### Control plane (`cluster/consensus.rs`)
A custom Raft-like log replication layer manages cluster membership and control state:
- Tick interval: 2 s; max append entries: 64; snapshot every 128 entries.
- Leader election with suspect timeout: 6 s; dead timeout: 20 s; leader lease: 6 s.
- Control log persisted as NDJSON to `tsink-control-log`.
### Anti-entropy repair (`cluster/repair.rs`)
`DigestExchangeRuntime` runs background anti-entropy:
- Every 30 s, each node exchanges window digest hashes with its peers for up to 64 shards per tick.
- A digest mismatch triggers a `InternalRepairBackfillRequest` to re-replicate missing data.
- Rebalance moves shard ownership between nodes in 5 s intervals with configurable row-per-tick limits.
### Deduplication
`DedupeWindowStore` provides idempotency-key deduplication to prevent re-ingestion of retried writes during hinted handoff re-delivery.
---
## 18. Security model
### TLS
All external HTTP and peer-to-peer traffic can be protected by TLS. Certificates are loaded via `tokio-rustls` and support hot rotation without server restart via `SecurityManager`.
Internal cluster traffic uses mTLS with a dedicated cluster CA to authenticate peer nodes.
### Authentication
Two methods are supported:
1. **Bearer tokens** — loaded from a file or via an exec command; verified on every request.
2. **OIDC JWT** — RS256, ES256, or HS256 tokens issued by a configured identity provider; JWKS fetched at startup with 60 s clock skew tolerance.
### RBAC (`src/rbac.rs`)
`RbacRegistry` assigns roles to principals. Actions are `Read | Write`; resource kinds are `Tenant | Admin | System`. Service accounts carry 32-byte random tokens (base64url-encoded, HMAC-backed) with configurable rotation schedules.
An in-memory ring buffer retains the last 256 RBAC decisions for audit review.
### Secret rotation
`SecurityManager` supports runtime rotation of:
- Public and admin auth tokens.
- Cluster internal auth token.
- TLS listener certificate/key.
- Cluster mTLS certificate/key.
A 300 s overlap grace period (configurable) allows in-flight requests to complete with the old credential before it is invalidated.
---
## 19. Observability and background runtime
### Self-instrumentation
`StorageObservabilityCounters` provides lockless `AtomicU64` counters grouped by subsystem:
| `WalObservabilityCounters` | Replay runs, frames, points, errors, durations. |
| `FlushObservabilityCounters` | Flush runs, backpressure events, active chunk stats. |
| `CompactionObservabilityCounters` | Runs, source/output segment/chunk/point counts. |
| `QueryObservabilityCounters` | Hot/warm/cold tier plans, chunks read, fetch durations. |
| `RollupObservabilityCounters` | Materialization runs, points produced. |
| `RetentionObservabilityCounters` | Expired series, segments, points. |
| `HealthObservabilityState` | `fail_fast_triggered`, last error strings. |
`observability_snapshot()` converts raw counters + runtime state into the public `StorageObservabilitySnapshot` DTO.
The server exposes `/metrics` in Prometheus exposition format and `/healthz` / `/ready` Kubernetes-compatible probes.
### Background threads
The runtime (`src/engine/runtime.rs`) manages three background workers:
| Flush | 250 ms | Flushes `BackgroundEligible` or `BackgroundBounded` active chunks to the sealed queue. |
| Compaction | 5 s (+ triggered) | Runs `Compactor::compact_once` for numeric and blob lanes. |
| Rollup | 5 s | Executes pending rollup materializations. |
Workers check three lifecycle states (`STORAGE_OPEN / CLOSING / CLOSED`) and stop themselves when the engine begins shutdown. If `background_fail_fast = true` (default), a background worker panic sets `fail_fast_triggered` and causes subsequent writes to return an error.
---
## 20. Configuration reference summary
Key engine knobs and their defaults:
| `timestamp_precision` | Nanoseconds | \`Ns | µs | ms | s\` |
| `retention_window` | 14 days | Data older than this is eligible for expiry. | | | |
| `future_skew_window` | 15 min | Writes with future timestamps beyond this are rejected. | | | |
| `partition_window` | 1 hour | Time-bucket width for active partition heads. | | | |
| `max_active_partition_heads_per_series` | 8 | Maximum concurrent open partitions per series. | | | |
| `max_writers` | cgroup CPU count | Write parallelism gate (`Semaphore` permits). | | | |
| `write_timeout` | 30 s | Maximum wait time for a write permit. | | | |
| `memory_budget_bytes` | `u64::MAX` (no limit) | Total in-memory chunk budget. | | | |
| `cardinality_limit` | `usize::MAX` (no limit) | Maximum unique series count. | | | |
| `chunk_points` | 2048 | Points per sealed chunk. | | | |
| `compaction_interval` | 5 s | Background compaction frequency. | | | |
| `flush_interval` | 250 ms | Background flush frequency. | | | |
| `background_fail_fast` | `true` | Worker panic triggers engine failure mode. | | | |
Container-aware defaults: `cgroup.rs` reads `/sys/fs/cgroup/cpu.max` and `/sys/fs/cgroup/memory.max` to detect CPU and memory limits. The `TSINK_MAX_CPUS` environment variable overrides the detected CPU count.