zinit 0.3.9

Process supervisor with dependency management
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
# Service Configuration Defaults

When creating a new service in zinit using the `service.set` RPC endpoint, fields that are not explicitly specified will use the default values documented below.

## Service Definition Defaults

The `[service]` section defines the core service properties.

### Required Fields

These fields **must** be specified when creating a service:

| Field  | Type   | Description                                                |
| ------ | ------ | ---------------------------------------------------------- |
| `name` | string | Unique service identifier                                  |
| `exec` | string | Executable command to run (full path or relative to $PATH) |

### Optional Fields with Defaults

| Field      | Type    | Default      | Description                                                                                |
| ---------- | ------- | ------------ | ------------------------------------------------------------------------------------------ |
| `dir`      | string  | `null`       | Working directory for the service. If not set, inherits parent's working directory         |
| `oneshot`  | boolean | `false`      | If true, service runs once and exits. Supervisor does not keep it running                  |
| `env`      | object  | `{}` (empty) | Environment variables to pass to the service as key-value pairs                            |
| `status`   | string  | `"start"`    | Desired state. Options: `"start"`, `"stop"`, `"ignore"`                                    |
| `class`    | string  | `"user"`     | Service classification. Options: `"user"`, `"system"`                                      |
| `critical` | boolean | `false`      | If true (PID1 only), service failure triggers emergency shell. Normal services ignore this |

### Status Field

Controls how the supervisor manages the service:

- **`start`** (default) - Supervisor ensures service is running. If it crashes, it will be restarted according to restart policy
- **`stop`** - Supervisor ensures service is stopped and keeps it stopped
- **`ignore`** - Supervisor doesn't manage this service's state. No auto-restart or auto-stop. Useful for manual control via CLI

### Class Field

Controls service protection and visibility:

- **`user`** (default) - Normal service. Affected by bulk operations (e.g., `zinit stop-all` will stop this service)
- **`system`** - Protected service. Bulk operations skip system services. Only stopped explicitly or if dependencies require it

---

## Dependency Defaults

The `[dependencies]` section declares relationships with other services.

All dependency fields are **optional** and default to empty lists if not specified:

| Field       | Type             | Default | Description                                                                      |
| ----------- | ---------------- | ------- | -------------------------------------------------------------------------------- |
| `after`     | array of strings | `[]`    | Services that must start BEFORE this service starts                              |
| `requires`  | array of strings | `[]`    | Hard dependencies. If any fail, this service cannot start                        |
| `wants`     | array of strings | `[]`    | Soft dependencies. Missing dependencies are ignored                              |
| `conflicts` | array of strings | `[]`    | Services that cannot run at the same time. If any are running, this cannot start |

### Dependency Semantics

- **`after`**: Purely ordering. Service A must start before service B, but B doesn't fail if A crashes
- **`requires`**: Hard dependency. If dependency fails, this service is blocked. If dependency is removed, this service is cascade-removed
- **`wants`**: Soft dependency. If dependency is missing or fails, this service starts anyway
- **`conflicts`**: Mutual exclusion. Services cannot run simultaneously. If conflicting service is running, this service is blocked

---

## Lifecycle Defaults

The `[lifecycle]` section controls restart behavior, timeouts, and signals.

| Field                  | Type    | Default        | Description                                                           |
| ---------------------- | ------- | -------------- | --------------------------------------------------------------------- |
| `restart`              | string  | `"on_failure"` | When to restart. Options: `"always"`, `"on_failure"`, `"never"`       |
| `restart_delay_ms`     | integer | `1000`         | Initial delay before first restart (milliseconds)                     |
| `restart_delay_max_ms` | integer | `300000`       | Maximum delay cap for exponential backoff (5 minutes)                 |
| `max_restarts`         | integer | `10`           | Max restart attempts. 0 = unlimited                                   |
| `stability_period_ms`  | integer | `30000`        | Service must run this long before backoff counter resets (30 seconds) |
| `start_timeout_ms`     | integer | `30000`        | Maximum time allowed for service startup before timeout (30 seconds)  |
| `stop_timeout_ms`      | integer | `10000`        | Maximum time for graceful shutdown before SIGKILL (10 seconds)        |
| `stop_signal`          | string  | `"SIGTERM"`    | Signal sent during graceful shutdown                                  |

### Restart Policy

- **`always`** - Service always restarts when it exits, regardless of exit code
- **`on_failure`** (default) - Service only restarts if it exits with non-zero exit code
- **`never`** - Service never restarts. Manual restart required via CLI

### Restart Backoff Algorithm

1. Service crashes → wait `restart_delay_ms`
2. If service crashes again → wait `restart_delay_ms * 2`
3. Double delay continues up to `restart_delay_max_ms`
4. After `stability_period_ms` of stable running → reset delay to `restart_delay_ms`

**Example**: With defaults (1000ms initial, 300000ms max):

- 1st restart: 1 second
- 2nd restart: 2 seconds
- 3rd restart: 4 seconds
- 4th restart: 8 seconds
- ... continues doubling ...
- Cap at 300 seconds (5 minutes)
- If service runs stable for 30 seconds, counter resets to 1 second

### Shutdown Flow

When supervisor stops a service:

1. Send `stop_signal` (default: SIGTERM) to process group
2. Wait up to `stop_timeout_ms` (default: 10 seconds)
3. If still running → send SIGKILL
4. Wait for process to exit

---

## Health Check Defaults

The `[health]` section is **optional**. If omitted, no health checks are performed.

### Health Check Types

Health checks are tagged unions - you specify which type by the fields present:

#### HTTP Health Check

```toml
[health]
type = "http"
target = "http://localhost:8080/health"
expect_status = 200
interval_ms = 10000
timeout_ms = 5000
retries = 3
start_period_ms = 0
```

#### TCP Health Check

```toml
[health]
type = "tcp"
target = "localhost:5432"
interval_ms = 10000
timeout_ms = 5000
retries = 3
start_period_ms = 0
```

#### Command Execution Health Check

```toml
[health]
type = "exec"
target = "/usr/bin/healthcheck.sh"
interval_ms = 10000
timeout_ms = 5000
retries = 3
start_period_ms = 0
```

### Health Check Common Fields

All health check types share these common fields:

| Field             | Type    | Default | Description                                             |
| ----------------- | ------- | ------- | ------------------------------------------------------- |
| `interval_ms`     | integer | `10000` | How often to run the check (10 seconds)                 |
| `timeout_ms`      | integer | `5000`  | Maximum time for check to complete (5 seconds)          |
| `retries`         | integer | `3`     | Consecutive failures before marking unhealthy           |
| `start_period_ms` | integer | `0`     | Grace period before first check (0 = check immediately) |

### HTTP-Specific Fields

| Field           | Type    | Default | Description                          |
| --------------- | ------- | ------- | ------------------------------------ |
| `expect_status` | integer | `200`   | Expected HTTP status code for health |

### Health Check Behavior

- **Grace period**: First check doesn't run until `start_period_ms` has elapsed
- **Healthy → Unhealthy**: Requires `retries` consecutive failures
- **Unhealthy effect**: Service marked unhealthy but supervisor doesn't auto-restart (informational only)
- **Missing field**: If a required field for the check type is missing, health check is ignored

---

## Logging Defaults

The `[logging]` section controls in-memory buffering and optional log persistence.

| Field          | Type    | Default | Description                                                  |
| -------------- | ------- | ------- | ------------------------------------------------------------ |
| `buffer_lines` | integer | `1000`  | Number of recent log lines kept in memory                    |
| `file`         | string  | `null`  | Optional file path to write logs to                          |
| `forward`      | string  | `null`  | Optional destination to forward logs to (syslog, HTTP, etc.) |

### Logging Behavior

- **In-memory buffer**: Always kept, contains last N lines. Can be retrieved via `service.logs` RPC
- **File logging**: If specified, all output is also written to this file
- **Log forwarding**: If specified, logs are forwarded to external system (future feature)

---

## Metrics Collection

Service metrics are collected on-demand via the `service.stats` RPC endpoint.

### Available Metrics

| Metric         | Type | When Available  | Description                                                       |
| -------------- | ---- | --------------- | ----------------------------------------------------------------- |
| `pid`          | u32  | Always          | Process ID of running service (0 if not running)                  |
| `memory_bytes` | u64  | Only if running | Memory usage in bytes from `/proc` stats                          |
| `cpu_percent`  | f32  | Only if running | CPU usage percentage (currently returns 0.0, reserved for future) |

**Note**: Metrics are collected on-demand when requested, not continuously tracked.

---

## Minimal Service Example

The absolute minimum required to create a service:

```toml
[service]
name = "my-app"
exec = "/usr/bin/my-app"
```

This uses **all defaults** for everything else:

- `dir`: null (inherit working directory)
- `oneshot`: false
- `env`: {} (no custom env vars)
- `status`: "start" (supervisor keeps it running)
- `class`: "user"
- `critical`: false
- **Lifecycle defaults**: on_failure restart, 1-10 second backoff, 30s startup timeout, SIGTERM stop
- **No health checks**
- **Logging**: 1000 line buffer, no file persistence

---

## Full Service Example

A service with all fields explicitly specified:

```toml
[service]
name = "web-server"
exec = "/usr/bin/nginx -g 'daemon off;'"
dir = "/var/www"
oneshot = false
status = "start"
class = "user"
critical = false

[service.env]
PORT = "8080"
DEBUG = "false"
LOG_LEVEL = "info"

[dependencies]
after = ["network-ready", "filesystem"]
requires = ["kernel"]
wants = ["metrics", "syslog"]
conflicts = ["old-nginx", "apache"]

[lifecycle]
restart = "on_failure"
restart_delay_ms = 2000
restart_delay_max_ms = 300000
max_restarts = 10
stability_period_ms = 30000
start_timeout_ms = 60000
stop_timeout_ms = 30000
stop_signal = "SIGQUIT"

[health]
type = "http"
target = "http://localhost:8080/health"
expect_status = 200
interval_ms = 5000
timeout_ms = 2000
retries = 3
start_period_ms = 10000

[logging]
buffer_lines = 500
file = "/var/log/nginx.log"
```

---

## Service Creation Flow

When a service is created via `service.set` RPC endpoint:

### Validation Steps

1. **Parse configuration** - Convert JSON/TOML to ServiceConfig struct
2. **Validate fields** - Check required fields present, types valid
3. **Check executable** - Verify `exec` path exists and is executable
4. **Check dependencies** - Verify all referenced services exist (unless they're soft `wants`)
5. **Check for conflicts** - Ensure no conflicting services are currently running

### Creation Steps

1. If service with same name exists:
   - Stop the existing service (hard stop, no graceful shutdown)
   - Remove it from supervisor graph
2. Persist to disk:
   - Write service config as TOML file to disk
   - Location: Determined by `ZINIT_CONFIG_DIR` env var (default: `~/.config/zinit/services/`)
   - Filename: `{service_name}.toml`
3. Register in memory:
   - Add service to supervisor's in-memory service graph
   - Initialize service state based on `status` field
4. Auto-start if needed:
   - If `status` is `"start"` → immediately transition service to Starting state
   - If `status` is `"stop"` → keep service in Inactive state
   - If `status` is `"ignore"` → keep service in Inactive state

### Persistence

- **All services are persisted to disk** when created via `service.set`
- Services survive supervisor restart (reload from disk on startup)
- Configuration files are human-readable TOML

### Error Handling

If any validation fails, the entire operation is aborted:

- Service is not created
- Nothing is written to disk
- Error is returned to client with description

---

## Service State Machine

Once created, services follow this state machine:

```
Inactive
  ↓
Starting → Running → Stopping → Exited
  ↓           ↓         ↓          ↓
Failed ← ← ← Blocked ← ← ← ← ← ← ← ↓
```

- **Inactive**: Service not currently running
- **Blocked**: Dependencies not met or conflicts present
- **Starting**: Transitioning to Running, within `start_timeout_ms`
- **Running**: Service is executing
- **Stopping**: Graceful shutdown in progress, within `stop_timeout_ms`
- **Exited**: Service exited with success (exit code 0)
- **Failed**: Service exited with failure (non-zero exit code)

---

## Validation Rules

Services are validated according to these rules:

| Rule                     | Behavior                                                                                      |
| ------------------------ | --------------------------------------------------------------------------------------------- |
| Duplicate name           | Rejected - service names must be unique                                                       |
| Missing required fields  | Rejected - name and exec required                                                             |
| Nonexistent executable   | Rejected - exec path must be valid and executable                                             |
| Circular dependencies    | Rejected - dependency graphs must be acyclic                                                  |
| Nonexistent dependencies | Partially allowed - `wants` can reference missing services, but `after` and `requires` cannot |
| Self-reference           | Rejected - service cannot depend on itself                                                    |
| Unknown fields           | Ignored - unknown TOML fields are skipped without error                                       |

---

## Practical Examples

### Simple Web Server (HTTP)

```toml
[service]
name = "web"
exec = "/usr/bin/python3 -m http.server 8000"
status = "start"

[health]
type = "http"
target = "http://localhost:8000/"
```

Uses all other defaults. Web server starts immediately, restarts on failure with exponential backoff.

### Database Service (TCP Check)

```toml
[service]
name = "postgres"
exec = "/usr/lib/postgresql/bin/postgres -D /var/lib/postgresql/data"
status = "start"

[lifecycle]
start_timeout_ms = 60000
stop_timeout_ms = 30000

[health]
type = "tcp"
target = "localhost:5432"
start_period_ms = 5000
```

Longer startup timeout, TCP health check with 5-second grace period.

### One-Shot Initialization Task

```toml
[service]
name = "setup-db"
exec = "/usr/bin/db-migrate.sh"
oneshot = true
status = "start"

[lifecycle]
restart = "never"
start_timeout_ms = 300000
```

Runs once, no auto-restart, 5-minute timeout for long-running migration.

### System Service (Protected)

```toml
[service]
name = "sshd"
exec = "/usr/sbin/sshd -D"
class = "system"
status = "start"

[lifecycle]
max_restarts = 0
restart = "always"
```

System service won't be stopped by bulk operations, always restarts if it crashes.