zinit 0.3.7 - Docs.rs

# 08 - Live Binary Updates

## Status

**Proposed** - To be implemented after core functionality is stable.

## Context

zinit consists of two processes:
- **zinit-pid1**: The init process (PID 1), minimal, spawns and monitors zinit-server
- **zinit-server**: The process supervisor, handles service lifecycle, dependency graph, socket activation

Both need to support in-place updates without losing track of running services or dropping connections.

### Constraints

- PID 1 cannot exit on VM/bare-metal (kernel panic)
- Running services must continue uninterrupted
- Open sockets (RPC, socket-activated) should not be dropped
- State consistency: no "lost" processes after update

## Decision

### zinit-pid1: Minimal state via argv

pid1 has almost no state. On self-update:

```rust
fn exec_new_pid1(new_binary: &Path, server_pid: Pid) {
    // Pass server PID via argv
    let args = [
        CString::new(new_binary.to_str().unwrap()).unwrap(),
        CString::new("--adopt-server").unwrap(),
        CString::new(server_pid.to_string()).unwrap(),
    ];
    
    // exec replaces process image, keeps PID 1
    execv(&args[0], &args).unwrap();
    // never returns on success
}
```

New pid1 starts with `--adopt-server <pid>`, monitors that PID instead of spawning fresh.

**Trigger:** `SIGUSR2` to pid1

**Sequence:**
1. pid1 receives SIGUSR2
2. Verify new binary exists and is executable
3. `execv()` into new binary with `--adopt-server <server_pid>`
4. New pid1 resumes monitoring

### zinit-server: Serialize + FD passing

Server has significant state. On update:

```rust
#[derive(Serialize, Deserialize)]
struct PersistentState {
    services: HashMap<String, ServiceSnapshot>,
    boot_time: u64,
    // FD numbers stored separately
}

#[derive(Serialize, Deserialize)]
struct ServiceSnapshot {
    name: String,
    state: ServiceState,
    pid: Option<i32>,
    restart_count: u32,
    current_restart_delay_ms: u64,
    last_exit_code: Option<i32>,
}
```

**Trigger:** `SIGUSR1` to pid1 (which signals server to prepare, then restarts it)

**Sequence:**
1. pid1 receives SIGUSR1
2. pid1 sends `PrepareRestart` RPC to server
3. Server serializes state to `/run/zinit/state.json`
4. Server prepares FDs (clears CLOEXEC on sockets to keep)
5. Server encodes FD map to env: `ZINIT_FDS={"rpc":5,"svc_foo_stdout":7}`
6. Server exits cleanly
7. pid1 spawns new server binary
8. New server detects `/run/zinit/state.json`, restores state
9. New server restores FDs from `ZINIT_FDS` env
10. New server verifies PIDs still exist, adjusts state if needed
11. Cleanup: remove state file

## Approaches Considered

### 1. Serialize to file/memfd → exec → reload

Dump state to JSON/bincode before exec, reload after.

```rust
// Before exec
let snapshot = build_snapshot(&state);
std::fs::write("/run/zinit/state.json", serde_json::to_string(&snapshot)?)?;

// After exec
fn try_restore_state() -> Option<SupervisorState> {
    let json = std::fs::read_to_string("/run/zinit/state.json").ok()?;
    std::fs::remove_file("/run/zinit/state.json").ok();
    serde_json::from_str(&json).ok()
}
```

**Pros:** Simple, debuggable, works for any serializable state.

**Cons:** Race window between serialize and exec. Non-serializable state (FDs) lost.

### 2. FD passing (keep sockets open across exec)

File descriptors survive `exec()` unless marked `CLOEXEC`.

```rust
// Before exec: clear CLOEXEC on FDs to keep
fn prepare_fds_for_exec(fds: &[(&str, RawFd)]) {
    for (name, fd) in fds {
        let flags = fcntl(*fd, FcntlArg::F_GETFD).unwrap();
        let new_flags = FdFlag::from_bits_truncate(flags.bits() & !FdFlag::FD_CLOEXEC.bits());
        fcntl(*fd, FcntlArg::F_SETFD(new_flags)).unwrap();
    }
    
    let fd_map: HashMap<&str, i32> = fds.iter().map(|(n, f)| (*n, *f)).collect();
    std::env::set_var("ZINIT_FDS", serde_json::to_string(&fd_map).unwrap());
}

// After exec: restore FDs
fn restore_fds() -> Option<HashMap<String, RawFd>> {
    let fd_json = std::env::var("ZINIT_FDS").ok()?;
    serde_json::from_str(&fd_json).ok()
}
```

**Pros:** No reconnection needed, no data loss in pipes.

**Cons:** Only works for FDs, not arbitrary state.

### 3. Shared memory (mmap + memfd)

Create memfd-backed mmap region that survives exec.

```rust
#[repr(C)]
struct SharedState {
    magic: u64,
    version: u64,
    server_pid: AtomicI32,
    services: [SharedServiceState; 256],
}

fn init_shared_state() -> (*mut SharedState, RawFd) {
    let memfd = memfd::MemfdOptions::new()
        .close_on_exec(false)
        .create("zinit-shared-state")?;
    memfd.as_file().set_len(size_of::<SharedState>() as u64)?;
    
    let fd = memfd.into_raw_fd();
    let ptr = mmap(None, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)?;
    
    std::env::set_var("ZINIT_SHM_FD", fd.to_string());
    (ptr as *mut SharedState, fd)
}
```

**Pros:** Zero serialization overhead, lock-free with atomics.

**Cons:** Fixed-size only (no Vec/String/HashMap), `repr(C)` constraints, version migration painful.

### 4. How systemd does it

Combines approaches:
1. Serializes unit states to `/run/systemd/*.service`
2. Passes socket FDs via `LISTEN_FDS` env (SD_LISTEN_FDS_START = fd 3)
3. `exec()` into new binary
4. New binary restores from files, re-adopts processes by PID

Key insight: systemd doesn't preserve in-flight operations. Mid-restart services get re-evaluated from persisted state.

## Implementation

### Phase 1: Server restart (no state preservation)

Simple version for development:
- pid1 receives SIGUSR1
- pid1 sends shutdown to server
- pid1 spawns new server
- Server reloads config, rediscovers running processes via `/proc`

### Phase 2: Server restart with state

- Add serialization before shutdown
- Add restoration on startup
- Verify PID validity after restore

### Phase 3: FD preservation

- Track which FDs to preserve (RPC socket, log pipes)
- Clear CLOEXEC, encode to env
- Restore and verify after restart

### Phase 4: pid1 self-update

- SIGUSR2 triggers self-exec
- Pass server PID via argv
- New pid1 adopts server

## Detecting stale state

After restore, verify PIDs are still valid:

```rust
fn process_exists(pid: i32) -> bool {
    // kill with signal 0 checks existence without sending signal
    nix::sys::signal::kill(Pid::from_raw(pid), None).is_ok()
}

fn validate_restored_state(state: &mut SupervisorState) {
    for (name, svc) in &mut state.services {
        if let Some(pid) = svc.pid {
            if !process_exists(pid) {
                eprintln!("Service {} pid {} gone, marking as failed", name, pid);
                svc.pid = None;
                svc.state = ServiceState::Failed;
            }
        }
    }
}
```

## Open questions

1. **Timeout for graceful server shutdown?** Suggest 30s, then SIGKILL.

2. **What if new binary crashes immediately?** pid1 could keep old binary as fallback, but adds complexity. Initial version: just restart new binary with backoff.

3. **Config changes during update?** New server re-reads config. Services added/removed get started/stopped as normal.

4. **Socket activation FDs?** These are the most important to preserve. Clients connected to RPC socket shouldn't notice the restart.

## References

- systemd daemon-reexec: https://www.freedesktop.org/software/systemd/man/systemd.html
- memfd_create(2): https://man7.org/linux/man-pages/man2/memfd_create.2.html
- execve(2) and file descriptors: https://man7.org/linux/man-pages/man2/execve.2.html