# ADR-006: Binary Upgrade Flow for zinit-pid1
## Status
Draft
## Context
### Current State
- `SIGUSR1` to zinit-pid1 triggers `soft_restart_server()`:
1. Calls `prepare_restart` RPC to save state
2. Waits for zinit-server to exit
3. Spawns new server (same binary path)
- `SIGUSR2` placeholder exists but not implemented
### Reference: zinit.vibed approach
In `zinit.vibed`, the updater IS pid1 (renamed). It:
- Reads config with update URL
- Periodically checks Forgejo API for new commits
- Downloads via `curl | bash` install script
- Sends SIGUSR1, waits, respawns
### Key Insight: `mv` works on running binaries
"Text file busy" (ETXTBSY) only applies to `open()` with write mode. The `rename()` syscall works fine - the running process keeps using the old inode until exit.
## Decision
### Enhance zinit-pid1 with update capabilities
pid1 gains:
1. Config file reading (`/etc/zinit/update.toml`)
2. Periodic update checking (configurable interval)
3. Binary download from URL
4. Swap + restart on SIGUSR1
### Config File: `/etc/zinit/update.toml`
```toml
# Update configuration
[update]
# URL to check for new binary (direct download link)
url = "https://releases.example.com/zinit/zinit-server"
# Optional: URL for checksum file
checksum_url = "https://releases.example.com/zinit/zinit-server.sha256"
# Check interval in seconds (0 = disable periodic checks)
interval = 300
# Staging directory
staging_dir = "/var/cache/system"
```
### Upgrade Flow
**On SIGUSR1:**
1. Read config file
2. If `url` configured: download new binary to staging
3. Verify checksum (if `checksum_url` configured)
4. Backup: `mv /usr/bin/zinit-server → .old`
5. Install: `mv staging/zinit-server → /usr/bin/zinit-server`
6. `prepare_restart` RPC to save state
7. Wait for server exit
8. Spawn new server
**On failure at any step:**
- Restore `.old` backup if it exists
- Log error
- Restart existing binary (don't leave system without supervisor)
**Crash detection:**
- Track crashes since last upgrade
- If new binary crashes > 3 times in 60 seconds → rollback to `.old`
### Periodic Checks (optional)
If `interval > 0`:
- Every N seconds, fetch checksum URL
- Compare with installed binary checksum
- If different → trigger upgrade (same as SIGUSR1)
### pid1 self-update: Reboot
pid1 updates require reboot. Too risky for live update.
## Implementation
### File: `zinit-pid1/Cargo.toml`
Add dependencies:
```toml
[dependencies]
ureq = { version = "2.12", features = ["tls"] } # HTTP client
sha2 = "0.10" # Checksum
toml = "0.8" # Config parsing
serde = { version = "1.0", features = ["derive"] }
```
### File: `zinit-pid1/src/main.rs`
#### 1. Add config types
```rust
use serde::Deserialize;
const UPDATE_CONFIG_PATH: &str = "/etc/zinit/update.toml";
const SERVER_INSTALL_PATH: &str = "/usr/bin/zinit-server";
#[derive(Debug, Default, Deserialize)]
struct UpdateConfig {
#[serde(default)]
update: UpdateSettings,
}
#[derive(Debug, Deserialize)]
struct UpdateSettings {
/// URL to download zinit-server binary
url: Option<String>,
/// URL to download checksum file
checksum_url: Option<String>,
/// Check interval in seconds (0 = disable)
#[serde(default = "default_interval")]
interval: u64,
/// Staging directory
#[serde(default = "default_staging_dir")]
staging_dir: String,
}
fn default_interval() -> u64 { 0 } // Disabled by default
fn default_staging_dir() -> String { "/var/cache/system".to_string() }
impl Default for UpdateSettings {
fn default() -> Self {
Self {
url: None,
checksum_url: None,
interval: 0,
staging_dir: default_staging_dir(),
}
}
}
```
#### 2. Add crash tracking for rollback
```rust
struct CrashTracker {
/// Time of last upgrade
last_upgrade: Option<Instant>,
/// Crash count since upgrade
crashes_since_upgrade: u32,
}
impl CrashTracker {
fn new() -> Self {
Self { last_upgrade: None, crashes_since_upgrade: 0 }
}
fn record_upgrade(&mut self) {
self.last_upgrade = Some(Instant::now());
self.crashes_since_upgrade = 0;
}
fn record_crash(&mut self) -> bool {
self.crashes_since_upgrade += 1;
// If >3 crashes within 60s of upgrade, trigger rollback
if let Some(t) = self.last_upgrade {
if t.elapsed().as_secs() < 60 && self.crashes_since_upgrade > 3 {
return true; // Should rollback
}
}
false
}
}
```
#### 3. Add download and upgrade functions
```rust
fn load_update_config() -> UpdateConfig {
std::fs::read_to_string(UPDATE_CONFIG_PATH)
.ok()
.and_then(|s| toml::from_str(&s).ok())
.unwrap_or_default()
}
fn download_file(url: &str, dest: &Path) -> Result<(), String> {
tracing::info!(url = url, dest = %dest.display(), "downloading");
let response = ureq::get(url)
.call()
.map_err(|e| format!("HTTP request failed: {}", e))?;
let mut reader = response.into_reader();
let mut file = std::fs::File::create(dest)
.map_err(|e| format!("failed to create file: {}", e))?;
std::io::copy(&mut reader, &mut file)
.map_err(|e| format!("failed to write file: {}", e))?;
// Make executable
use std::os::unix::fs::PermissionsExt;
let mut perms = file.metadata().map_err(|e| e.to_string())?.permissions();
perms.set_mode(0o755);
std::fs::set_permissions(dest, perms).map_err(|e| e.to_string())?;
Ok(())
}
fn verify_checksum(binary: &Path, expected: &str) -> Result<(), String> {
use sha2::{Sha256, Digest};
use std::io::Read;
let mut file = std::fs::File::open(binary)
.map_err(|e| format!("failed to open binary: {}", e))?;
let mut hasher = Sha256::new();
let mut buffer = [0u8; 8192];
loop {
let n = file.read(&mut buffer).map_err(|e| e.to_string())?;
if n == 0 { break; }
hasher.update(&buffer[..n]);
}
let actual = format!("{:x}", hasher.finalize());
// Expected format: "abc123... filename" or just "abc123..."
let expected = expected.trim().split_whitespace().next().unwrap_or("");
if actual != expected {
return Err(format!("checksum mismatch: expected {}, got {}", expected, actual));
}
Ok(())
}
/// Download, verify, and install new binary. Returns true if upgraded.
fn try_upgrade(config: &UpdateSettings) -> Result<bool, String> {
let url = match &config.url {
Some(u) => u,
None => return Ok(false), // No URL configured
};
let staging_dir = Path::new(&config.staging_dir);
let staged_binary = staging_dir.join("zinit-server");
let installed = Path::new(SERVER_INSTALL_PATH);
let backup = installed.with_extension("old");
// Download new binary
download_file(url, &staged_binary)?;
// Verify checksum if configured
if let Some(checksum_url) = &config.checksum_url {
let checksum_response = ureq::get(checksum_url)
.call()
.map_err(|e| format!("failed to fetch checksum: {}", e))?;
let expected = checksum_response.into_string()
.map_err(|e| format!("failed to read checksum: {}", e))?;
verify_checksum(&staged_binary, &expected)?;
tracing::info!("checksum verified");
}
// Backup current binary
if installed.exists() {
std::fs::rename(installed, &backup)
.map_err(|e| format!("failed to backup: {}", e))?;
tracing::info!(backup = %backup.display(), "backed up current binary");
}
// Install new binary
match std::fs::rename(&staged_binary, installed) {
Ok(()) => {
tracing::info!("installed new zinit-server binary");
Ok(true)
}
Err(e) => {
// Restore backup
if backup.exists() {
let _ = std::fs::rename(&backup, installed);
}
Err(format!("failed to install: {}", e))
}
}
}
/// Rollback to .old binary
fn rollback_binary() -> bool {
let installed = Path::new(SERVER_INSTALL_PATH);
let backup = installed.with_extension("old");
if backup.exists() {
tracing::warn!("rolling back to previous binary");
if std::fs::rename(&backup, installed).is_ok() {
return true;
}
}
false
}
```
#### 4. Modify soft_restart_server
```rust
fn soft_restart_server(
server_pid: &mut Option<Pid>,
pid1_mode: bool,
crash_tracker: &mut CrashTracker,
) {
if let Some(pid) = *server_pid {
tracing::info!(pid = pid.as_raw(), "Initiating soft restart of zinit-server");
// Load config and try upgrade
let config = load_update_config();
match try_upgrade(&config.update) {
Ok(true) => {
tracing::info!("new binary installed, proceeding with restart");
crash_tracker.record_upgrade();
}
Ok(false) => {
tracing::debug!("no upgrade available or configured");
}
Err(e) => {
tracing::error!(error = %e, "upgrade failed, restarting existing binary");
}
}
// Call prepare_restart RPC to save state
// ... (existing code)
}
// Spawn new server
*server_pid = spawn_server(pid1_mode);
}
```
#### 5. Modify main loop for crash detection
```rust
// In reap_zombies or main loop, when server dies:
if server_pid.is_none() {
if crash_tracker.record_crash() {
// Too many crashes after upgrade, rollback
if rollback_binary() {
tracing::warn!("rolled back after repeated crashes");
crash_tracker.crashes_since_upgrade = 0;
}
}
tracing::error!("zinit-server died, respawning in 1s...");
thread::sleep(Duration::from_secs(1));
server_pid = spawn_server(pid1_server_mode);
}
```
#### 6. Optional: Periodic update checks
```rust
// In main loop:
let config = load_update_config();
let mut last_update_check = Instant::now();
loop {
// ... existing signal handling ...
// Periodic update check
if config.update.interval > 0
&& last_update_check.elapsed().as_secs() >= config.update.interval
{
// Check if update available (compare checksums)
if let Some(ref url) = config.update.checksum_url {
// ... fetch and compare, trigger SIGUSR1 if different
}
last_update_check = Instant::now();
}
}
```
## Protected Services (restart_all)
Services that should NOT be affected by bulk restart operations:
| `class = "system"` | Already protected by class system |
| `oneshot = true` | Already completed, no point restarting |
| udevd | Kernel device management |
| network/dhcpcd | Network connectivity |
| myfs/fuse mounts | Remote filesystem access |
| sysvol | Local storage |
**Already implemented**: `start_all`, `stop_all`, `delete_all` skip `class = "system"` services.
**If adding `restart_all`**: Same pattern - skip system class + skip oneshots.
## Files to Modify
| `zinit-pid1/Cargo.toml` | Add `ureq`, `sha2`, `toml`, `serde` |
| `zinit-pid1/src/main.rs` | Add config, download, upgrade, rollback, crash tracking |
## What We Might Be Forgetting
1. **State file format changes** - New binary might not read old state file correctly
- Mitigation: Version field in state file, backwards compat
2. **Filesystem not mounted** - `/var/cache/system` might not exist if sysvol hasn't run
- Mitigation: Check mount before upgrade, fallback to `/tmp`
3. **Network not up** - Can't download if network isn't ready
- Mitigation: Retry logic, or only upgrade after network service is running
4. **Disk full** - Download fails partway through
- Mitigation: Download to temp file, rename atomically
5. **Signature verification** - SHA256 only verifies integrity, not authenticity
- Future: Add GPG/minisign signature verification
## Consequences
### Positive
- Self-updating system from URL
- Automatic rollback on crash
- Config-driven, flexible
- No manual intervention needed
### Negative
- Adds HTTP client to pid1 (~1MB increase)
- More complexity in pid1
- Network dependency for updates
### Mitigations
- Updates are optional (disabled by default)
- All failures gracefully fall back to existing binary
- Crash detection prevents boot loops