# Amaters Server — Operations Manual
> **Scope** This manual covers internals that operators need during day-to-day production operations: lifecycle phases, configuration fields, health probes, metrics, cluster management, snapshots, certificate rotation, and rolling upgrades.
>
> For signal handling (SIGHUP reload, SIGTERM), log rotation, and CLI commands see [operations-guide.md](operations-guide.md).
> For build instructions, filesystem layout, TLS preparation, systemd units, Docker, and Kubernetes see [deployment-guide.md](deployment-guide.md).
> For the full configuration field listing see [configuration-reference.md](configuration-reference.md).
---
## 1. Server Lifecycle
### Core Types
`Server` (defined in `src/server.rs`) holds:
| config | `Arc<ServerConfig>` | Shared, atomically reloadable config |
| storage | `Option<Arc<Storage>>` | Optional persistent or in-memory store |
| network | `Option<NetworkService>` | Optional network listener |
| raft | `Option<Arc<RaftNode>>` | Cluster consensus node (cluster feature) |
| shutdown | `ShutdownCoordinator` | Phase-aware shutdown sequencer |
| health | `HealthChecker` | Probe aggregator |
| metrics | `MetricsCollector` | Atomic counter/histogram store |
| active_queries | `Arc<AtomicUsize>` | In-flight query counter |
`Storage` is an enum with two variants:
- `Memory(MemoryStorage)` — ephemeral, zero-config, for testing
- `Lsm(LsmTreeStorage)` — durable, WAL-backed, default production engine
### Shutdown Sequence
The server drains in five ordered steps:
1. Stop accepting new connections
2. Stop the network service
3. Drain in-flight connections (5 s maximum, 100 ms poll interval)
4. Flush storage
5. Close storage
### ShutdownCoordinator
Defined in `src/shutdown.rs`. Tracks lifecycle through `ShutdownPhase`:
```
Running → Draining → FlushingState → Terminated
```
`DrainConfig` defaults: `drain_timeout` 30 s, `check_interval` 1 s, `flush_timeout` 30 s.
Built-in hooks run in order during shutdown:
| `WalFlushHook` | Flushes the write-ahead log |
| `MemtableFlushHook` | Flushes the active memtable |
| `ConnectionDrainHook` | Polls `AtomicUsize` every 100 ms until connections reach zero |
| `MetricsSnapshotHook` | Captures a final metrics snapshot before exit |
Custom hooks can be added with `register_shutdown_hook()` by implementing the `ShutdownHook` trait.
**In-flight tracking** — call `request_start()` when a request begins and `request_end()` when it ends. The `ConnectionDrainHook` reads `AtomicUsize` directly; RAII wrappers are recommended.
**`ShutdownGuard`** — triggers shutdown on drop unless `disarm()` is called first. Useful for ensuring shutdown runs even on panic paths.
**Signal handlers** — `setup_signal_handlers()` registers SIGTERM + SIGINT (Unix) or Ctrl+C (non-Unix). `setup_sighup_handler()` (Unix only) calls `config.reload_from_stored_path()`. See [operations-guide.md](operations-guide.md) for the reload pipeline.
### Query Backpressure
`try_acquire_query()` returns a `QueryGuard` (RAII decrement on drop) or `ServerError::ResourceExhausted` when `active_queries` is at the configured limit. `ResourceExhausted` should be surfaced to the client as a retriable 429/RESOURCE_EXHAUSTED.
Other `ServerError` variants: `AlreadyRunning`, `ShutdownTimeout`, `Migration`.
### PID File
`is_running()` checks for an existing PID file and process liveness. `write_pid_file()` and `remove_pid_file()` manage the file. `stop_server()` sends SIGTERM; if the process does not exit it escalates to SIGKILL. Default path: `/var/run/amaters-server.pid` (override via `server.pid_file`).
---
## 2. Configuration Quick Reference
Full field listing: [configuration-reference.md](configuration-reference.md).
This section lists the fields most commonly tuned in production.
### Top-level sections
`ServerConfig` contains: `server`, `storage`, `network`, `cluster` (optional), `logging`, `metrics`, `auth`, `authz`, `resource_limits`, `circuit_cache`, `timeouts`.
### ServerSettings
| `bind_address` | — | `SocketAddr` |
| `data_dir` | — | Path to data directory |
| `pid_file` | `/var/run/amaters-server.pid` | |
| `max_connections` | 1000 | Hard cap |
| `shutdown_timeout_secs` | 30 | |
### StorageSettings
| `engine` | `"lsm"` | Also `"memory"` |
| `memtable_size_mb` | 64 | In-memory write buffer |
| `block_cache_size_mb` | 256 | Read cache |
| `wal.enabled` | — | |
| `wal.dir` | `"wal"` | Relative to `data_dir` |
| `wal.segment_size_mb` | 64 | |
| `wal.sync_mode` | `"interval"` | |
| `compaction.strategy` | `"leveled"` | Also `"tiered"` |
| `compaction.num_levels` | 7 | |
| `compaction.level_multiplier` | 10 | |
| `compaction.max_concurrent` | 4 | |
### NetworkSettings
| `tls_enabled` | — |
| `tls_cert` / `tls_key` / `tls_ca` | — |
| `require_client_cert` | — |
| `connection_timeout_secs` | 30 |
| `keepalive_interval_secs` | 60 |
### ResourceLimits
| `max_connections_per_client` | 10 |
| `max_requests_per_second_global` | 10,000 |
| `max_active_queries` | 1,000 |
### TimeoutConfig
| `request_timeout_ms` | 30,000 |
| `idle_connection_timeout_ms` | 60,000 |
| `graceful_shutdown_timeout_ms` | 5,000 |
| `keep_alive_interval_ms` | 15,000 |
### Hot-Reload vs Restart-Required
`ReloadableSection` variants (SIGHUP safe): `Logging`, `Metrics`, `Compaction`, `RateLimit`.
`NonReloadableSection` variants (require full restart): `BindAddress`, `Port`, `TlsCertPath`, `TlsKeyPath`, `StorageEngine`, `DataDir`, `ClusterNodeId`.
### Environment Overrides
`AMATERS_BIND_ADDRESS`, `AMATERS_DATA_DIR`, `AMATERS_LOG_LEVEL`, `AMATERS_TLS_ENABLED`.
---
## 3. Health Monitoring
Defined in `src/health.rs`.
### Status Enums
`HealthStatus`:
| `Starting` | Initializing — not yet ready for traffic |
| `Healthy` | All probes pass — accept traffic |
| `Degraded` | Some probes warn (e.g. low disk) — still operational |
| `Unhealthy` | Critical probe failed — route traffic away |
| `ShuttingDown` | Graceful shutdown in progress |
`ProbeStatus` (`Healthy`, `Degraded`, `Unhealthy`) is aggregated across probes using `worse()` — the worst result wins.
### HealthChecker
- `is_alive()` — returns `true` when state is `Healthy`, `Starting`, or `Degraded`
- `is_ready()` — returns `true` when `Healthy` or `Degraded` **and** `storage_healthy` **and** `network_healthy`
- `HealthHistory` — ring buffer (default capacity 10 snapshots); `uptime_percent()` counts snapshots where `alive` is true
### Built-in Probes
| `StorageProbe` | Write / read / delete a `.health_probe_test` key |
| `WalProbe` | Open `.wal_health_probe` with append flag |
| `DiskSpaceProbe` | `< min_free_bytes/4` → Unhealthy; `< min_free_bytes` → Degraded |
### HTTP Health Endpoints
A lightweight TCP server (no framework) exposes:
| `GET /health` | Full `HealthCheckResponse` JSON; 200 or 503 |
| `GET /healthz` | Alive bool |
| `GET /readyz` | Ready bool |
| `GET /livez` | `LivenessResponse` JSON |
| `GET /metrics` | History + uptime JSON |
`HealthCheckResponse` fields: `status`, `version`, `uptime_seconds`, `components`, `dependencies`, `probes` (HashMap), `uptime_percent`, `timestamp`.
Use `/readyz` for Kubernetes `readinessProbe` and `/livez` for `livenessProbe`. See [deployment-guide.md](deployment-guide.md) for example probe configuration.
---
## 4. Metrics
Defined in `src/metrics.rs`. All counters are lock-free atomics.
### Counters and Gauges
| `requests_total`, `success`, `failed` | Counters |
| `bytes_read`, `bytes_written` | Counters |
| `active_connections` | Gauge |
| `queries_total`, `query_time_us` | Counters |
| `memtable_size_bytes`, `sstable_count` | Storage gauges |
| `compaction_count`, `compaction_bytes_written` | Storage counters |
| `wal_size_bytes` | Storage gauge |
| `block_cache_hits`, `block_cache_misses` | Cache counters |
### Request Latency Histogram
`DEFAULT_BUCKETS` (12 buckets, seconds):
```
0.001 0.005 0.01 0.025 0.05 0.1 0.25 0.5 1.0 2.5 5.0 10.0
```
### Per-Operation Metrics
`OperationType` variants: `Get`, `Put`, `Delete`, `Range`, `Batch`, `Stream`.
Each operation type has its own `OperationMetrics` (count, errors, latency). Prometheus output labels these with `op="get"`, `op="put"`, etc.
### Prometheus Exposition
`MetricsCollector::to_prometheus()` emits `amaters_*`-prefixed metrics in standard Prometheus text format with `# HELP` / `# TYPE` comments, `_bucket{le=...}` / `_sum` / `_count` for histograms, and per-operation labels.
Scrape endpoint: `GET /metrics` on `metrics.bind_address` (default `127.0.0.1:9090`).
`MetricsSnapshot` helpers: `success_rate()`, `avg_query_time_us()`, `format_human()`.
---
## 5. Cluster Operations
### Peer Configuration
`ClusterSettings` format for `peers`: `"node_id:address"`, e.g. `"1:127.0.0.1:7879"`. Each entry in the `peers` Vec uses this format.
Key timing fields: `heartbeat_interval_ms` 100, `election_timeout_ms` 300 (±33% jitter applied at runtime to stagger elections).
### RaftNode API
| `is_leader()` | `bool` | |
| `commit_index()` | `LogIndex` | |
| `last_log_index()` | `LogIndex` | |
| `propose(command)` | `RaftResult<LogIndex>` | `RaftError::NotLeader { leader_id }` when not leader |
Clients that receive `NotLeader` should redirect to the returned `leader_id`.
### Admin API
`AdminApi` (defined in `src/admin.rs`):
- `get_cluster_status()` returns `ClusterStatusResponse` with fields: `node_id`, `is_leader`, `num_shards`, `num_nodes`
- Also exposes: `ShardSummary`, `ShardListResponse`, `ShardOpResponse`
### Alerting
`AlertEvent` variants (from `crates/amaters-cluster/src/failover.rs`):
| `NodeFailed` | |
| `NodeRecovered` | |
| `LeaderChanged` | `old_leader: Option<NodeId>`, `new_leader` |
| `QuorumLost` | `cluster_size`, `reachable` |
| `SlowReplication` | `follower`, `lag_entries` |
`AlertSeverity` (from `crates/amaters-cluster/src/alert_rules.rs`): `Info`, `Warning`, `Critical`.
Default rules from `default_rules(slow_repl_threshold)`:
| `node_failed` | Critical | `NodeFailed` event |
| `quorum_lost` | Critical | `QuorumLost` event |
| `leader_changed` | Warning | `LeaderChanged` event |
| `slow_replication` | Warning | `lag_entries >= threshold` |
The `RuleEngine` deduplicates alerts using a `Mutex<HashMap<(rule_name, dedup_key), Instant>>`. Dedup keys follow the pattern `"node_failed:42"`, `"quorum_lost"`, `"slow_replication:5"`.
### Failover Coordinator
`FailoverCoordinator::tick()` calls `detector.check_timeouts()` and schedules an election when `leader_failure_count >= max_consecutive_failures` (default 3). `should_redirect(my_id)` returns `true` when the known leader is not this node.
---
## 6. Snapshot Management
Defined in `src/snapshot.rs`.
### File Format
Each snapshot file (`snapshot-{id:020}.bin`) begins with a 40-byte header:
| 0–7 | Magic `AMSNAP\x01\x00` |
| 8–15 | Snapshot ID (u64 LE) |
| 16–23 | Timestamp ms (u64 LE) |
| 24–31 | Original uncompressed size (u64 LE) |
| 32–39 | FNV-64 checksum over compressed payload (u64 LE) |
Payload: LZ4-compressed via `oxiarc_lz4`. On-disk metadata uses `meta.bin` / `manifest.bin` encoded with oxicode.
### Operations
| `write_snapshot(id, data)` | Compress → FNV-64 → write header + payload → `sync_all()` |
| `read_snapshot(id)` | Validate magic → id match → checksum → decompress |
| `list_snapshots()` | Reads headers only (no decompression), returns sorted descending by id |
### Remote Snapshots
Requires a `SnapshotUploader` implementation to be set:
| `upload(id)` | Delegates to `uploader.upload_snapshot` |
| `restore_from_remote(uri, local_id)` | Delegates to `uploader.download_snapshot` |
`LocalSnapshotUploader` stores files as `remote-snapshot-{id:020}.bin` and uses URIs of the form `local://absolute_path`.
`SnapshotUploader` trait requires: `upload_snapshot`, `download_snapshot`, `list_remote_snapshots`.
### Creating Snapshots via Admin API
`AdminApi::create_snapshot()` encodes current cluster status as JSON, uses nanosecond wall time as the snapshot ID, and delegates to `SnapshotManager::write_snapshot`.
---
## 7. Certificate Rotation
TLS credentials are stored behind an `ArcSwap<TlsCreds>`, enabling atomic pointer swap with no lock contention.
**Zero-downtime rotation procedure:**
1. Replace the certificate and key files on disk at the paths set in `network.tls_cert` and `network.tls_key`.
2. Send SIGHUP to the server process (see [operations-guide.md](operations-guide.md) for signal handling details).
3. `setup_sighup_handler()` calls `config.reload_from_stored_path()`, which loads the new credentials and swaps the `ArcSwap`.
4. New TLS connections immediately use the new certificate. In-flight TLS sessions are not interrupted.
**Constraints:**
- `TlsCertPath` and `TlsKeyPath` are `NonReloadableSection` — the *paths* themselves cannot change without a full restart. Only the files at those paths are reloaded.
- `require_client_cert` controls mTLS enforcement and is set at startup.
---
## 8. Rolling Upgrades
### Compatibility
`VersionHandshake` is used for peer compatibility checks during cluster upgrades. Nodes with incompatible versions will not form a quorum, preventing split-brain during partial upgrades.
The ±33% jitter on `election_timeout_ms` (default 300 ms) allows staggered leader elections during rolling restarts, reducing the chance of simultaneous election timeouts.
`NonReloadableSection` fields — `ClusterNodeId`, `StorageEngine`, `DataDir`, `BindAddress` — require a full process restart to take effect.
### Procedure
1. **Upgrade one follower at a time.** After each restart, verify:
- `is_leader()` returns `false` on the upgraded node
- `commit_index()` is advancing (node is replicating)
2. **Step down the leader.** Trigger a leader step-down or wait for the natural election after the last follower upgrade.
3. **Upgrade the former leader node** following the same restart procedure.
4. **Verify cluster health.** Confirm `is_leader()` returns `true` on exactly one node and `commit_index()` is advancing on all nodes.
Do not upgrade more than one node simultaneously. The cluster requires quorum to process writes; taking two nodes offline in a three-node cluster loses quorum.
For systemd and container restart procedures see [deployment-guide.md](deployment-guide.md).