zinit 0.3.9

Process supervisor with dependency management
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
# ADR-006: Binary Upgrade Flow for zinit-pid1

## Status

Draft

## Context

### Current State

- `SIGUSR1` to zinit-pid1 triggers `soft_restart_server()`:
  1. Calls `prepare_restart` RPC to save state
  2. Waits for zinit-server to exit
  3. Spawns new server (same binary path)

- `SIGUSR2` placeholder exists but not implemented

### Reference: zinit.vibed approach

In `zinit.vibed`, the updater IS pid1 (renamed). It:
- Reads config with update URL
- Periodically checks Forgejo API for new commits
- Downloads via `curl | bash` install script
- Sends SIGUSR1, waits, respawns

### Key Insight: `mv` works on running binaries

"Text file busy" (ETXTBSY) only applies to `open()` with write mode. The `rename()` syscall works fine - the running process keeps using the old inode until exit.

## Decision

### Enhance zinit-pid1 with update capabilities

pid1 gains:
1. Config file reading (`/etc/zinit/update.toml`)
2. Periodic update checking (configurable interval)
3. Binary download from URL
4. Swap + restart on SIGUSR1

### Config File: `/etc/zinit/update.toml`

```toml
# Update configuration
[update]
# URL to check for new binary (direct download link)
url = "https://releases.example.com/zinit/zinit-server"

# Optional: URL for checksum file
checksum_url = "https://releases.example.com/zinit/zinit-server.sha256"

# Check interval in seconds (0 = disable periodic checks)
interval = 300

# Staging directory
staging_dir = "/var/cache/system"
```

### Upgrade Flow

**On SIGUSR1:**
1. Read config file
2. If `url` configured: download new binary to staging
3. Verify checksum (if `checksum_url` configured)
4. Backup: `mv /usr/bin/zinit-server → .old`
5. Install: `mv staging/zinit-server → /usr/bin/zinit-server`
6. `prepare_restart` RPC to save state
7. Wait for server exit
8. Spawn new server

**On failure at any step:**
- Restore `.old` backup if it exists
- Log error
- Restart existing binary (don't leave system without supervisor)

**Crash detection:**
- Track crashes since last upgrade
- If new binary crashes > 3 times in 60 seconds → rollback to `.old`

### Periodic Checks (optional)

If `interval > 0`:
- Every N seconds, fetch checksum URL
- Compare with installed binary checksum
- If different → trigger upgrade (same as SIGUSR1)

### pid1 self-update: Reboot

pid1 updates require reboot. Too risky for live update.

## Implementation

### File: `zinit-pid1/Cargo.toml`

Add dependencies:
```toml
[dependencies]
ureq = { version = "2.12", features = ["tls"] }  # HTTP client
sha2 = "0.10"                                     # Checksum
toml = "0.8"                                      # Config parsing
serde = { version = "1.0", features = ["derive"] }
```

### File: `zinit-pid1/src/main.rs`

#### 1. Add config types

```rust
use serde::Deserialize;

const UPDATE_CONFIG_PATH: &str = "/etc/zinit/update.toml";
const SERVER_INSTALL_PATH: &str = "/usr/bin/zinit-server";

#[derive(Debug, Default, Deserialize)]
struct UpdateConfig {
    #[serde(default)]
    update: UpdateSettings,
}

#[derive(Debug, Deserialize)]
struct UpdateSettings {
    /// URL to download zinit-server binary
    url: Option<String>,
    /// URL to download checksum file
    checksum_url: Option<String>,
    /// Check interval in seconds (0 = disable)
    #[serde(default = "default_interval")]
    interval: u64,
    /// Staging directory
    #[serde(default = "default_staging_dir")]
    staging_dir: String,
}

fn default_interval() -> u64 { 0 }  // Disabled by default
fn default_staging_dir() -> String { "/var/cache/system".to_string() }

impl Default for UpdateSettings {
    fn default() -> Self {
        Self {
            url: None,
            checksum_url: None,
            interval: 0,
            staging_dir: default_staging_dir(),
        }
    }
}
```

#### 2. Add crash tracking for rollback

```rust
struct CrashTracker {
    /// Time of last upgrade
    last_upgrade: Option<Instant>,
    /// Crash count since upgrade
    crashes_since_upgrade: u32,
}

impl CrashTracker {
    fn new() -> Self {
        Self { last_upgrade: None, crashes_since_upgrade: 0 }
    }

    fn record_upgrade(&mut self) {
        self.last_upgrade = Some(Instant::now());
        self.crashes_since_upgrade = 0;
    }

    fn record_crash(&mut self) -> bool {
        self.crashes_since_upgrade += 1;
        // If >3 crashes within 60s of upgrade, trigger rollback
        if let Some(t) = self.last_upgrade {
            if t.elapsed().as_secs() < 60 && self.crashes_since_upgrade > 3 {
                return true; // Should rollback
            }
        }
        false
    }
}
```

#### 3. Add download and upgrade functions

```rust
fn load_update_config() -> UpdateConfig {
    std::fs::read_to_string(UPDATE_CONFIG_PATH)
        .ok()
        .and_then(|s| toml::from_str(&s).ok())
        .unwrap_or_default()
}

fn download_file(url: &str, dest: &Path) -> Result<(), String> {
    tracing::info!(url = url, dest = %dest.display(), "downloading");

    let response = ureq::get(url)
        .call()
        .map_err(|e| format!("HTTP request failed: {}", e))?;

    let mut reader = response.into_reader();
    let mut file = std::fs::File::create(dest)
        .map_err(|e| format!("failed to create file: {}", e))?;

    std::io::copy(&mut reader, &mut file)
        .map_err(|e| format!("failed to write file: {}", e))?;

    // Make executable
    use std::os::unix::fs::PermissionsExt;
    let mut perms = file.metadata().map_err(|e| e.to_string())?.permissions();
    perms.set_mode(0o755);
    std::fs::set_permissions(dest, perms).map_err(|e| e.to_string())?;

    Ok(())
}

fn verify_checksum(binary: &Path, expected: &str) -> Result<(), String> {
    use sha2::{Sha256, Digest};
    use std::io::Read;

    let mut file = std::fs::File::open(binary)
        .map_err(|e| format!("failed to open binary: {}", e))?;
    let mut hasher = Sha256::new();
    let mut buffer = [0u8; 8192];
    loop {
        let n = file.read(&mut buffer).map_err(|e| e.to_string())?;
        if n == 0 { break; }
        hasher.update(&buffer[..n]);
    }
    let actual = format!("{:x}", hasher.finalize());

    // Expected format: "abc123... filename" or just "abc123..."
    let expected = expected.trim().split_whitespace().next().unwrap_or("");

    if actual != expected {
        return Err(format!("checksum mismatch: expected {}, got {}", expected, actual));
    }
    Ok(())
}

/// Download, verify, and install new binary. Returns true if upgraded.
fn try_upgrade(config: &UpdateSettings) -> Result<bool, String> {
    let url = match &config.url {
        Some(u) => u,
        None => return Ok(false), // No URL configured
    };

    let staging_dir = Path::new(&config.staging_dir);
    let staged_binary = staging_dir.join("zinit-server");
    let installed = Path::new(SERVER_INSTALL_PATH);
    let backup = installed.with_extension("old");

    // Download new binary
    download_file(url, &staged_binary)?;

    // Verify checksum if configured
    if let Some(checksum_url) = &config.checksum_url {
        let checksum_response = ureq::get(checksum_url)
            .call()
            .map_err(|e| format!("failed to fetch checksum: {}", e))?;
        let expected = checksum_response.into_string()
            .map_err(|e| format!("failed to read checksum: {}", e))?;

        verify_checksum(&staged_binary, &expected)?;
        tracing::info!("checksum verified");
    }

    // Backup current binary
    if installed.exists() {
        std::fs::rename(installed, &backup)
            .map_err(|e| format!("failed to backup: {}", e))?;
        tracing::info!(backup = %backup.display(), "backed up current binary");
    }

    // Install new binary
    match std::fs::rename(&staged_binary, installed) {
        Ok(()) => {
            tracing::info!("installed new zinit-server binary");
            Ok(true)
        }
        Err(e) => {
            // Restore backup
            if backup.exists() {
                let _ = std::fs::rename(&backup, installed);
            }
            Err(format!("failed to install: {}", e))
        }
    }
}

/// Rollback to .old binary
fn rollback_binary() -> bool {
    let installed = Path::new(SERVER_INSTALL_PATH);
    let backup = installed.with_extension("old");

    if backup.exists() {
        tracing::warn!("rolling back to previous binary");
        if std::fs::rename(&backup, installed).is_ok() {
            return true;
        }
    }
    false
}
```

#### 4. Modify soft_restart_server

```rust
fn soft_restart_server(
    server_pid: &mut Option<Pid>,
    pid1_mode: bool,
    crash_tracker: &mut CrashTracker,
) {
    if let Some(pid) = *server_pid {
        tracing::info!(pid = pid.as_raw(), "Initiating soft restart of zinit-server");

        // Load config and try upgrade
        let config = load_update_config();
        match try_upgrade(&config.update) {
            Ok(true) => {
                tracing::info!("new binary installed, proceeding with restart");
                crash_tracker.record_upgrade();
            }
            Ok(false) => {
                tracing::debug!("no upgrade available or configured");
            }
            Err(e) => {
                tracing::error!(error = %e, "upgrade failed, restarting existing binary");
            }
        }

        // Call prepare_restart RPC to save state
        // ... (existing code)
    }

    // Spawn new server
    *server_pid = spawn_server(pid1_mode);
}
```

#### 5. Modify main loop for crash detection

```rust
// In reap_zombies or main loop, when server dies:
if server_pid.is_none() {
    if crash_tracker.record_crash() {
        // Too many crashes after upgrade, rollback
        if rollback_binary() {
            tracing::warn!("rolled back after repeated crashes");
            crash_tracker.crashes_since_upgrade = 0;
        }
    }

    tracing::error!("zinit-server died, respawning in 1s...");
    thread::sleep(Duration::from_secs(1));
    server_pid = spawn_server(pid1_server_mode);
}
```

#### 6. Optional: Periodic update checks

```rust
// In main loop:
let config = load_update_config();
let mut last_update_check = Instant::now();

loop {
    // ... existing signal handling ...

    // Periodic update check
    if config.update.interval > 0
        && last_update_check.elapsed().as_secs() >= config.update.interval
    {
        // Check if update available (compare checksums)
        if let Some(ref url) = config.update.checksum_url {
            // ... fetch and compare, trigger SIGUSR1 if different
        }
        last_update_check = Instant::now();
    }
}
```

## Protected Services (restart_all)

Services that should NOT be affected by bulk restart operations:

| Service | Reason |
|---------|--------|
| `class = "system"` | Already protected by class system |
| `oneshot = true` | Already completed, no point restarting |
| udevd | Kernel device management |
| network/dhcpcd | Network connectivity |
| myfs/fuse mounts | Remote filesystem access |
| sysvol | Local storage |

**Already implemented**: `start_all`, `stop_all`, `delete_all` skip `class = "system"` services.

**If adding `restart_all`**: Same pattern - skip system class + skip oneshots.

## Files to Modify

| File | Change |
|------|--------|
| `zinit-pid1/Cargo.toml` | Add `ureq`, `sha2`, `toml`, `serde` |
| `zinit-pid1/src/main.rs` | Add config, download, upgrade, rollback, crash tracking |

## What We Might Be Forgetting

1. **State file format changes** - New binary might not read old state file correctly
   - Mitigation: Version field in state file, backwards compat

2. **Filesystem not mounted** - `/var/cache/system` might not exist if sysvol hasn't run
   - Mitigation: Check mount before upgrade, fallback to `/tmp`

3. **Network not up** - Can't download if network isn't ready
   - Mitigation: Retry logic, or only upgrade after network service is running

4. **Disk full** - Download fails partway through
   - Mitigation: Download to temp file, rename atomically

5. **Signature verification** - SHA256 only verifies integrity, not authenticity
   - Future: Add GPG/minisign signature verification

## Consequences

### Positive
- Self-updating system from URL
- Automatic rollback on crash
- Config-driven, flexible
- No manual intervention needed

### Negative
- Adds HTTP client to pid1 (~1MB increase)
- More complexity in pid1
- Network dependency for updates

### Mitigations
- Updates are optional (disabled by default)
- All failures gracefully fall back to existing binary
- Crash detection prevents boot loops