zinit 0.3.9 - Docs.rs

# Service Configuration Defaults

When creating a new service in zinit using the `service.set` RPC endpoint, fields that are not explicitly specified will use the default values documented below.

## Service Definition Defaults

The `[service]` section defines the core service properties.

### Required Fields

These fields **must** be specified when creating a service:

| Field  | Type   | Description                                                |
| ------ | ------ | ---------------------------------------------------------- |
| `name` | string | Unique service identifier                                  |
| `exec` | string | Executable command to run (full path or relative to $PATH) |

### Optional Fields with Defaults

| Field      | Type    | Default      | Description                                                                                |
| ---------- | ------- | ------------ | ------------------------------------------------------------------------------------------ |
| `dir`      | string  | `null`       | Working directory for the service. If not set, inherits parent's working directory         |
| `oneshot`  | boolean | `false`      | If true, service runs once and exits. Supervisor does not keep it running                  |
| `env`      | object  | `{}` (empty) | Environment variables to pass to the service as key-value pairs                            |
| `status`   | string  | `"start"`    | Desired state. Options: `"start"`, `"stop"`, `"ignore"`                                    |
| `class`    | string  | `"user"`     | Service classification. Options: `"user"`, `"system"`                                      |
| `critical` | boolean | `false`      | If true (PID1 only), service failure triggers emergency shell. Normal services ignore this |

### Status Field

Controls how the supervisor manages the service:

- **`start`** (default) - Supervisor ensures service is running. If it crashes, it will be restarted according to restart policy
- **`stop`** - Supervisor ensures service is stopped and keeps it stopped
- **`ignore`** - Supervisor doesn't manage this service's state. No auto-restart or auto-stop. Useful for manual control via CLI

### Class Field

Controls service protection and visibility:

- **`user`** (default) - Normal service. Affected by bulk operations (e.g., `zinit stop-all` will stop this service)
- **`system`** - Protected service. Bulk operations skip system services. Only stopped explicitly or if dependencies require it

---

## Dependency Defaults

The `[dependencies]` section declares relationships with other services.

All dependency fields are **optional** and default to empty lists if not specified:

| Field       | Type             | Default | Description                                                                      |
| ----------- | ---------------- | ------- | -------------------------------------------------------------------------------- |
| `after`     | array of strings | `[]`    | Services that must start BEFORE this service starts                              |
| `requires`  | array of strings | `[]`    | Hard dependencies. If any fail, this service cannot start                        |
| `wants`     | array of strings | `[]`    | Soft dependencies. Missing dependencies are ignored                              |
| `conflicts` | array of strings | `[]`    | Services that cannot run at the same time. If any are running, this cannot start |

### Dependency Semantics

- **`after`**: Purely ordering. Service A must start before service B, but B doesn't fail if A crashes
- **`requires`**: Hard dependency. If dependency fails, this service is blocked. If dependency is removed, this service is cascade-removed
- **`wants`**: Soft dependency. If dependency is missing or fails, this service starts anyway
- **`conflicts`**: Mutual exclusion. Services cannot run simultaneously. If conflicting service is running, this service is blocked

---

## Lifecycle Defaults

The `[lifecycle]` section controls restart behavior, timeouts, and signals.

| Field                  | Type    | Default        | Description                                                           |
| ---------------------- | ------- | -------------- | --------------------------------------------------------------------- |
| `restart`              | string  | `"on_failure"` | When to restart. Options: `"always"`, `"on_failure"`, `"never"`       |
| `restart_delay_ms`     | integer | `1000`         | Initial delay before first restart (milliseconds)                     |
| `restart_delay_max_ms` | integer | `300000`       | Maximum delay cap for exponential backoff (5 minutes)                 |
| `max_restarts`         | integer | `10`           | Max restart attempts. 0 = unlimited                                   |
| `stability_period_ms`  | integer | `30000`        | Service must run this long before backoff counter resets (30 seconds) |
| `start_timeout_ms`     | integer | `30000`        | Maximum time allowed for service startup before timeout (30 seconds)  |
| `stop_timeout_ms`      | integer | `10000`        | Maximum time for graceful shutdown before SIGKILL (10 seconds)        |
| `stop_signal`          | string  | `"SIGTERM"`    | Signal sent during graceful shutdown                                  |

### Restart Policy

- **`always`** - Service always restarts when it exits, regardless of exit code
- **`on_failure`** (default) - Service only restarts if it exits with non-zero exit code
- **`never`** - Service never restarts. Manual restart required via CLI

### Restart Backoff Algorithm

1. Service crashes → wait `restart_delay_ms`
2. If service crashes again → wait `restart_delay_ms * 2`
3. Double delay continues up to `restart_delay_max_ms`
4. After `stability_period_ms` of stable running → reset delay to `restart_delay_ms`

**Example**: With defaults (1000ms initial, 300000ms max):

- 1st restart: 1 second
- 2nd restart: 2 seconds
- 3rd restart: 4 seconds
- 4th restart: 8 seconds
- ... continues doubling ...
- Cap at 300 seconds (5 minutes)
- If service runs stable for 30 seconds, counter resets to 1 second

### Shutdown Flow

When supervisor stops a service:

1. Send `stop_signal` (default: SIGTERM) to process group
2. Wait up to `stop_timeout_ms` (default: 10 seconds)
3. If still running → send SIGKILL
4. Wait for process to exit

---

## Health Check Defaults

The `[health]` section is **optional**. If omitted, no health checks are performed.

### Health Check Types

Health checks are tagged unions - you specify which type by the fields present:

#### HTTP Health Check

```toml
[health]
type = "http"
target = "http://localhost:8080/health"
expect_status = 200
interval_ms = 10000
timeout_ms = 5000
retries = 3
start_period_ms = 0
```

#### TCP Health Check

```toml
[health]
type = "tcp"
target = "localhost:5432"
interval_ms = 10000
timeout_ms = 5000
retries = 3
start_period_ms = 0
```

#### Command Execution Health Check

```toml
[health]
type = "exec"
target = "/usr/bin/healthcheck.sh"
interval_ms = 10000
timeout_ms = 5000
retries = 3
start_period_ms = 0
```

### Health Check Common Fields

All health check types share these common fields:

| Field             | Type    | Default | Description                                             |
| ----------------- | ------- | ------- | ------------------------------------------------------- |
| `interval_ms`     | integer | `10000` | How often to run the check (10 seconds)                 |
| `timeout_ms`      | integer | `5000`  | Maximum time for check to complete (5 seconds)          |
| `retries`         | integer | `3`     | Consecutive failures before marking unhealthy           |
| `start_period_ms` | integer | `0`     | Grace period before first check (0 = check immediately) |

### HTTP-Specific Fields

| Field           | Type    | Default | Description                          |
| --------------- | ------- | ------- | ------------------------------------ |
| `expect_status` | integer | `200`   | Expected HTTP status code for health |

### Health Check Behavior

- **Grace period**: First check doesn't run until `start_period_ms` has elapsed
- **Healthy → Unhealthy**: Requires `retries` consecutive failures
- **Unhealthy effect**: Service marked unhealthy but supervisor doesn't auto-restart (informational only)
- **Missing field**: If a required field for the check type is missing, health check is ignored

---

## Logging Defaults

The `[logging]` section controls in-memory buffering and optional log persistence.

| Field          | Type    | Default | Description                                                  |
| -------------- | ------- | ------- | ------------------------------------------------------------ |
| `buffer_lines` | integer | `1000`  | Number of recent log lines kept in memory                    |
| `file`         | string  | `null`  | Optional file path to write logs to                          |
| `forward`      | string  | `null`  | Optional destination to forward logs to (syslog, HTTP, etc.) |

### Logging Behavior

- **In-memory buffer**: Always kept, contains last N lines. Can be retrieved via `service.logs` RPC
- **File logging**: If specified, all output is also written to this file
- **Log forwarding**: If specified, logs are forwarded to external system (future feature)

---

## Metrics Collection

Service metrics are collected on-demand via the `service.stats` RPC endpoint.

### Available Metrics

| Metric         | Type | When Available  | Description                                                       |
| -------------- | ---- | --------------- | ----------------------------------------------------------------- |
| `pid`          | u32  | Always          | Process ID of running service (0 if not running)                  |
| `memory_bytes` | u64  | Only if running | Memory usage in bytes from `/proc` stats                          |
| `cpu_percent`  | f32  | Only if running | CPU usage percentage (currently returns 0.0, reserved for future) |

**Note**: Metrics are collected on-demand when requested, not continuously tracked.

---

## Minimal Service Example

The absolute minimum required to create a service:

```toml
[service]
name = "my-app"
exec = "/usr/bin/my-app"
```

This uses **all defaults** for everything else:

- `dir`: null (inherit working directory)
- `oneshot`: false
- `env`: {} (no custom env vars)
- `status`: "start" (supervisor keeps it running)
- `class`: "user"
- `critical`: false
- **Lifecycle defaults**: on_failure restart, 1-10 second backoff, 30s startup timeout, SIGTERM stop
- **No health checks**
- **Logging**: 1000 line buffer, no file persistence

---

## Full Service Example

A service with all fields explicitly specified:

```toml
[service]
name = "web-server"
exec = "/usr/bin/nginx -g 'daemon off;'"
dir = "/var/www"
oneshot = false
status = "start"
class = "user"
critical = false

[service.env]
PORT = "8080"
DEBUG = "false"
LOG_LEVEL = "info"

[dependencies]
after = ["network-ready", "filesystem"]
requires = ["kernel"]
wants = ["metrics", "syslog"]
conflicts = ["old-nginx", "apache"]

[lifecycle]
restart = "on_failure"
restart_delay_ms = 2000
restart_delay_max_ms = 300000
max_restarts = 10
stability_period_ms = 30000
start_timeout_ms = 60000
stop_timeout_ms = 30000
stop_signal = "SIGQUIT"

[health]
type = "http"
target = "http://localhost:8080/health"
expect_status = 200
interval_ms = 5000
timeout_ms = 2000
retries = 3
start_period_ms = 10000

[logging]
buffer_lines = 500
file = "/var/log/nginx.log"
```

---

## Service Creation Flow

When a service is created via `service.set` RPC endpoint:

### Validation Steps

1. **Parse configuration** - Convert JSON/TOML to ServiceConfig struct
2. **Validate fields** - Check required fields present, types valid
3. **Check executable** - Verify `exec` path exists and is executable
4. **Check dependencies** - Verify all referenced services exist (unless they're soft `wants`)
5. **Check for conflicts** - Ensure no conflicting services are currently running

### Creation Steps

1. If service with same name exists:
   - Stop the existing service (hard stop, no graceful shutdown)
   - Remove it from supervisor graph
2. Persist to disk:
   - Write service config as TOML file to disk
   - Location: Determined by `ZINIT_CONFIG_DIR` env var (default: `~/.config/zinit/services/`)
   - Filename: `{service_name}.toml`
3. Register in memory:
   - Add service to supervisor's in-memory service graph
   - Initialize service state based on `status` field
4. Auto-start if needed:
   - If `status` is `"start"` → immediately transition service to Starting state
   - If `status` is `"stop"` → keep service in Inactive state
   - If `status` is `"ignore"` → keep service in Inactive state

### Persistence

- **All services are persisted to disk** when created via `service.set`
- Services survive supervisor restart (reload from disk on startup)
- Configuration files are human-readable TOML

### Error Handling

If any validation fails, the entire operation is aborted:

- Service is not created
- Nothing is written to disk
- Error is returned to client with description

---

## Service State Machine

Once created, services follow this state machine:

```
Inactive
  ↓
Starting → Running → Stopping → Exited
  ↓           ↓         ↓          ↓
Failed ← ← ← Blocked ← ← ← ← ← ← ← ↓
```

- **Inactive**: Service not currently running
- **Blocked**: Dependencies not met or conflicts present
- **Starting**: Transitioning to Running, within `start_timeout_ms`
- **Running**: Service is executing
- **Stopping**: Graceful shutdown in progress, within `stop_timeout_ms`
- **Exited**: Service exited with success (exit code 0)
- **Failed**: Service exited with failure (non-zero exit code)

---

## Validation Rules

Services are validated according to these rules:

| Rule                     | Behavior                                                                                      |
| ------------------------ | --------------------------------------------------------------------------------------------- |
| Duplicate name           | Rejected - service names must be unique                                                       |
| Missing required fields  | Rejected - name and exec required                                                             |
| Nonexistent executable   | Rejected - exec path must be valid and executable                                             |
| Circular dependencies    | Rejected - dependency graphs must be acyclic                                                  |
| Nonexistent dependencies | Partially allowed - `wants` can reference missing services, but `after` and `requires` cannot |
| Self-reference           | Rejected - service cannot depend on itself                                                    |
| Unknown fields           | Ignored - unknown TOML fields are skipped without error                                       |

---

## Practical Examples

### Simple Web Server (HTTP)

```toml
[service]
name = "web"
exec = "/usr/bin/python3 -m http.server 8000"
status = "start"

[health]
type = "http"
target = "http://localhost:8000/"
```

Uses all other defaults. Web server starts immediately, restarts on failure with exponential backoff.

### Database Service (TCP Check)

```toml
[service]
name = "postgres"
exec = "/usr/lib/postgresql/bin/postgres -D /var/lib/postgresql/data"
status = "start"

[lifecycle]
start_timeout_ms = 60000
stop_timeout_ms = 30000

[health]
type = "tcp"
target = "localhost:5432"
start_period_ms = 5000
```

Longer startup timeout, TCP health check with 5-second grace period.

### One-Shot Initialization Task

```toml
[service]
name = "setup-db"
exec = "/usr/bin/db-migrate.sh"
oneshot = true
status = "start"

[lifecycle]
restart = "never"
start_timeout_ms = 300000
```

Runs once, no auto-restart, 5-minute timeout for long-running migration.

### System Service (Protected)

```toml
[service]
name = "sshd"
exec = "/usr/sbin/sshd -D"
class = "system"
status = "start"

[lifecycle]
max_restarts = 0
restart = "always"
```

System service won't be stopped by bulk operations, always restarts if it crashes.