# Troubleshooting Guide
Common failure modes for `amaters-server`, how to read the logs and metrics, and how to recover from data-level issues.
## First-line diagnostics
```bash
amaters-server status # Is the process up?
journalctl -u amaters -n 200 # Last 200 log lines
amaters-cli admin LOGS 100 false # In-memory ring buffer
```
When in doubt, set `AMATERS_LOG_LEVEL=debug` and restart — most issues become visible at debug level without noise from FHE internals.
## Common errors
### `Configuration validation failed: Invalid bind address`
Cause: `server.bind_address` does not parse as a `SocketAddr` (missing port, invalid IPv6 form).
Fix: confirm the value is a literal `SocketAddr`, not a hostname:
```toml
[server]
bind_address = "0.0.0.0:7878" # not "amaters.local"
```
### `Configuration validation failed: TLS enabled but no certificate file specified`
Cause: `network.tls_enabled = true` without `tls_cert` or `tls_key`.
Fix: provide both PEM paths, or set `tls_enabled = false`. See [Deployment Guide](deployment-guide.md#tls--mtls-preparation).
### `Server is already running`
Cause: a prior process held `server.pid_file` and exited without cleanup.
Fix:
```bash
amaters-server start --config /etc/amaters/config.toml
```
### gRPC `UNAUTHENTICATED` or 401 from auth middleware
Cause: `auth.enabled = true`, request had no credential, and `reject_unauthenticated = true`.
Diagnose:
```bash
The audit log records the auth method tried and the failure reason (missing header, expired JWT, unknown API key, mTLS CN mismatch).
Fix: ensure the client sends the credential the server is configured to accept. Cross-check `auth.methods`, `auth.jwt.algorithm`, and `auth.api_key.header_name` with the client.
### gRPC `UNAUTHENTICATED` due to JWT validation
Common JWT-side causes:
| `Token expired` | `expiration_secs` mismatch or clock skew between issuer and server. |
| `Invalid signature` | Wrong `secret` (HS\*) or `public_key_path` (RS\*); algorithm mismatch. |
| `Invalid issuer` | `auth.jwt.issuer` set on server but token's `iss` differs. |
| `Invalid audience` | Same for `aud`. |
### TLS handshake failure on client
Run `openssl s_client` against the listener:
```bash
openssl s_client -connect amaters.example.com:7878 -alpn h2 -showcerts
```
Inspect:
- `verify return code: 19 (self signed certificate in certificate chain)` — client doesn't trust the server CA. Add it to the client trust store, or use `accept_invalid_certs` for testing only.
- `unsupported protocol` — ALPN negotiated something other than `h2`. tonic requires HTTP/2.
- `tlsv1 alert unknown ca` (with mTLS) — server cannot validate the client cert; check `network.tls_ca` and the cert's signing chain.
### `429 Too Many Requests` / rate-limit rejections
Cause: incoming RPS exceeded the configured rate-limit token bucket. (When the adaptive rate limiter is enabled, refill rate also drops under high CPU/memory load.)
Diagnose with metrics:
```bash
Fix: increase `server.max_connections` (hot-reloadable via `SIGHUP`), or thin the workload. If load-induced throttling is the cause, adding capacity is the only fix.
### Transient storage errors during write bursts
Symptom: gRPC `ABORTED` with messages like `WAL flush stalled`, `memtable full`.
Diagnose:
- `storage.memtable_size_mb` too small for write rate: bump to 128–256 MB.
- `storage.wal.sync_mode = "always"` saturates fsync; switch to `interval` if your durability tier permits.
- `storage.compaction.max_concurrent` too low for the write rate; increase to match available cores.
These knobs (compaction) are hot-reloadable; memtable/WAL changes require restart.
## Log inspection
### journalctl
```bash
# Tail
journalctl -u amaters -f
# Last hour, only warnings and above
journalctl -u amaters --since "1 hour ago" -p warning
# Audit subset
### File logs
When `logging.file_enabled = true`:
```bash
tail -F /var/log/amaters/amaters.log
# Search rotated files too
zgrep "ERROR" /var/log/amaters/amaters.log*
```
### In-memory ring buffer
The 256-entry `recent_log` buffer (`crates/amaters-net/src/server_admin.rs`) survives even when file logging is disabled:
```bash
amaters-cli admin LOGS 100 false
```
Each entry is `{message, timestamp}` with a Unix-epoch second. The buffer is also useful when you need to capture context after a transient failure — it has `LOG_RING_CAPACITY = 256` entries with FIFO eviction.
## Metrics interpretation
The Prometheus endpoint exposes lock-free counters from `crates/amaters-net/src/metrics_layer.rs`. Key signals:
| `amaters_net_active_requests` | In-flight gauge. Sustained high values relative to `max_connections` indicate saturation. |
| `amaters_net_requests_total` / `amaters_net_errors_total` | Error rate. Compute `rate(errors[5m]) / rate(requests[5m])`. |
| `amaters_net_rtt_bucket{le="..."}` | Request latency. p50 = bucket where cumulative count first exceeds 50% of total; p99 = 99%. |
| `amaters_net_method_requests_total{method=...}` | Per-RPC volume. Use to spot a single hot path. |
| `amaters_net_bytes_sent_total` / `amaters_net_bytes_received_total` | I/O volume. Sustained imbalance between the two often signals a runaway range query. |
Latency buckets are fixed at `[1, 5, 10, 50, 100, 500, 1000]` ms plus `+Inf`. Anything past `le="500"` is unusual for non-FHE workloads.
A useful Grafana panel: `histogram_quantile(0.99, sum by (le) (rate(amaters_net_rtt_bucket[5m])))`.
## Snapshot recovery
To recover from a corrupted data directory:
```bash
# Stop the server
amaters-server stop
# Rename the corrupt data dir (don't delete until recovery succeeds)
mv /var/lib/amaters/data /var/lib/amaters/data.bad
# Create a fresh data dir
mkdir -p /var/lib/amaters/data
chown amaters:amaters /var/lib/amaters/data
# Start the server with the fresh data dir
amaters-server start --config /etc/amaters/config.toml &
sleep 3
# Restore from the most recent backup
amaters-cli admin RESTORE /var/lib/amaters/snapshots/2026-05-08
```
`RESTORE` rejects mismatched `schema_version` values — check the JSON response. If the schema has rolled forward since the backup was taken you must use a server build at or below the schema's high-water mark, or first migrate the snapshot.
For the snapshot pipeline that ships in 0.2.1 (`SnapshotManager`), additional CRC32 verification is performed on `meta.bin` and `manifest.bin`. Decode failures point at exactly the corrupt file.
## mTLS debugging
Confirm the server presents the cert you expect:
```bash
openssl s_client -connect amaters.example.com:7878 -alpn h2 -showcerts < /dev/null \
| openssl x509 -noout -subject -issuer -dates
```
When `network.require_client_cert = true`, also verify the client cert is acceptable:
```bash
# As the client
openssl s_client -connect amaters.example.com:7878 \
-cert client.pem -key client.key \
-CAfile /etc/amaters/ca.crt -alpn h2 < /dev/null
```
If the handshake fails with `tlsv1 alert bad certificate`, the server rejected the client cert. Most likely causes:
- Cert not signed by a CA in `network.tls_ca` / `auth.mtls.ca_certs_dir`.
- Cert revoked according to `auth.mtls.crl_path`.
- Cert CN does not match the expected user identity (when `auth.mtls.verify_cn = true`).
- Cert's `O=` not in `auth.mtls.allowed_organizations` (when set).
The audit log (`amaters_net::audit` target) records the specific reason at `warn!` level.
## Hot-reload didn't apply my changes
Inspect the reload report in the logs:
```bash
Possible cases:
- `Config reload: no changes detected` — your edit didn't change any reloadable section. Compare with the [hot-reloadability list](configuration-reference.md#hot-reloadability).
- `requires restart - skipping` — the changed section is non-reloadable (e.g., `bind_address`, `storage.engine`). Restart the process.
- `Config reload failed — keeping old config: Validation failed: ...` — TOML was syntactically valid but failed validation. Fix the file and resend SIGHUP.
- No reload message at all — SIGHUP wasn't delivered. Check that you signalled the right PID:
```bash
kill -HUP $(cat /var/run/amaters-server.pid)
```