zinit 0.3.6

Process supervisor with dependency management
Documentation
# zinit-server - State Machine

Service state machine for zinit-server. Uses types from `zinit-common`.

## State Diagram

```
                              +----------------------------------+
                              |                                  |
                              v                                  |
    +----------+    +---------+    +----------+    +---------+  |
    | Inactive |───>| Blocked |───>| Starting |───>| Running |──+
    +----------+    +---------+    +----------+    +---------+  |
         ^               |              |              |         |
         |               |              v              v         |
         |               |         +---------+   +----------+   |
         |               +────────>| Failed  |<──| Stopping |   |
         |                         +---------+   +----------+   |
         |                              |              |         |
         |                              v              v         |
         |                         +-------------------------+  |
         +─────────────────────────|        Exited          |<─+
                                   +-------------------------+
```

## States (Server-Side)

The server uses richer state types than zinit-common (includes timestamps):

```rust
use std::time::Instant;
use petgraph::graph::NodeIndex;

pub type ServiceId = NodeIndex;

#[derive(Debug, Clone, PartialEq)]
pub enum ServiceState {
    /// Configured but never started
    Inactive,

    /// Waiting on dependencies
    Blocked {
        waiting_on: Vec<ServiceId>,
        conflicts_with: Vec<ServiceId>,
    },

    /// Process spawned, waiting for ready signal
    Starting {
        pid: u32,
        started_at: Instant,
    },

    /// Process running and healthy
    Running {
        pid: u32,
        ready_at: Instant,
    },

    /// SIGTERM sent, waiting for exit
    Stopping {
        pid: u32,
        signal_sent_at: Instant,
    },

    /// Process exited (clean or via stop request)
    Exited {
        exit_code: Option<i32>,
        exited_at: Instant,
    },

    /// Process failed
    Failed {
        reason: FailureReason,
        failed_at: Instant,
    },
}

#[derive(Debug, Clone, PartialEq)]
pub enum FailureReason {
    ExitCode(i32),
    Signal(i32),
    StartTimeout,
    StopTimeout,
    HealthCheckFailed { attempts: u32 },
    DependencyFailed { service: String },
    SpawnError { message: String },
}
```

## State Queries

```rust
impl ServiceState {
    pub fn name(&self) -> &'static str {
        match self {
            Self::Inactive => "inactive",
            Self::Blocked { .. } => "blocked",
            Self::Starting { .. } => "starting",
            Self::Running { .. } => "running",
            Self::Stopping { .. } => "stopping",
            Self::Exited { .. } => "exited",
            Self::Failed { .. } => "failed",
        }
    }

    pub fn symbol(&self) -> &'static str {
        match self {
            Self::Inactive => "[-]",
            Self::Blocked { .. } => "[?]",
            Self::Starting { .. } => "[>]",
            Self::Running { .. } => "[+]",
            Self::Stopping { .. } => "[!]",
            Self::Exited { .. } => "[.]",
            Self::Failed { .. } => "[X]",
        }
    }

    /// Can this state satisfy a "requires" dependency?
    pub fn is_satisfied(&self) -> bool {
        matches!(self, Self::Running { .. })
    }

    /// Is a process currently running?
    pub fn is_active(&self) -> bool {
        matches!(self, Self::Starting { .. } | Self::Running { .. } | Self::Stopping { .. })
    }

    /// Can we attempt to start from this state?
    pub fn can_attempt_start(&self) -> bool {
        matches!(self, Self::Inactive | Self::Exited { .. } | Self::Failed { .. })
    }

    /// Get PID if process is running
    pub fn pid(&self) -> Option<u32> {
        match self {
            Self::Starting { pid, .. } => Some(*pid),
            Self::Running { pid, .. } => Some(*pid),
            Self::Stopping { pid, .. } => Some(*pid),
            _ => None,
        }
    }
}
```

## Conversion to zinit-common Types

```rust
impl From<&ServiceState> for zinit_common::ServiceState {
    fn from(state: &ServiceState) -> Self {
        match state {
            ServiceState::Inactive => Self::Inactive,
            ServiceState::Blocked { waiting_on, .. } => Self::Blocked {
                waiting_on: vec![], // resolve names in caller
            },
            ServiceState::Starting { pid, .. } => Self::Starting { pid: *pid },
            ServiceState::Running { pid, .. } => Self::Running { pid: *pid },
            ServiceState::Stopping { pid, .. } => Self::Stopping { pid: *pid },
            ServiceState::Exited { exit_code, .. } => Self::Exited { exit_code: *exit_code },
            ServiceState::Failed { reason, .. } => Self::Failed {
                reason: reason.into(),
            },
        }
    }
}
```

## Targets (Virtual Services)

Targets have no process - they're dependency anchors:

```rust
#[derive(Debug, Clone)]
pub struct Service {
    pub name: String,
    pub config: ServiceConfig,
    pub state: ServiceState,
    pub is_target: bool,
    
    // Restart tracking (exponential backoff)
    pub restart_count: u32,
    pub current_restart_delay_ms: u64,
}

impl Service {
    pub fn new(name: String, config: ServiceConfig, is_target: bool) -> Self {
        let initial_delay = config.lifecycle.restart_delay_ms;
        Self {
            name,
            config,
            state: ServiceState::Inactive,
            is_target,
            restart_count: 0,
            current_restart_delay_ms: initial_delay,
        }
    }

    /// Targets transition directly to Running when deps satisfied
    pub fn target_check_ready(&mut self, graph: &ServiceGraph) {
        if !self.is_target {
            return;
        }
        
        if graph.all_requires_satisfied(self.id) {
            self.state = ServiceState::Running {
                pid: 0,  // no actual process
                ready_at: Instant::now(),
            };
        }
    }
    
    /// Get next restart delay with exponential backoff.
    /// Returns None if max_restarts exceeded or policy says no restart.
    pub fn next_restart_delay(&mut self) -> Option<Duration> {
        if !self.should_restart() {
            return None;
        }
        
        let delay = self.current_restart_delay_ms;
        
        // Exponential backoff: double for next time, capped at max
        self.current_restart_delay_ms = (self.current_restart_delay_ms * 2)
            .min(self.config.lifecycle.restart_delay_max_ms);
        
        self.restart_count += 1;
        
        Some(Duration::from_millis(delay))
    }
    
    /// Reset backoff when service becomes healthy (reaches Running state)
    pub fn reset_backoff(&mut self) {
        self.restart_count = 0;
        self.current_restart_delay_ms = self.config.lifecycle.restart_delay_ms;
    }
}
```

## State Transitions

| From | Event | To | Condition |
|------|-------|----|-----------| 
| Inactive | StartRequested | Blocked | Dependencies not satisfied |
| Inactive | StartRequested | Starting | Dependencies satisfied |
| Blocked | DependencySatisfied | Starting | All deps now satisfied |
| Blocked | DependencyFailed | Failed | Hard dep failed |
| Starting | ProcessSpawned | Starting | Update with PID |
| Starting | HealthCheckPassed | Running | Ready |
| Starting | ProcessExited(0) | Exited | Oneshot success |
| Starting | ProcessExited(n) | Failed | Non-zero exit |
| Starting | Timeout | Failed | Start timeout |
| Running | StopRequested | Stopping | SIGTERM sent |
| Running | ProcessExited(0) | Exited | Clean exit |
| Running | ProcessExited(n) | Failed | Crash |
| Running | HealthCheckFailed | Failed | After retries |
| Stopping | ProcessExited | Exited | Any exit |
| Stopping | Timeout | Exited | SIGKILL sent, force exit |
| Exited | StartRequested | Starting/Blocked | Restart |
| Failed | StartRequested | Starting/Blocked | Manual restart |

## Events

```rust
#[derive(Debug)]
pub enum ServiceEvent {
    StartRequested,
    StopRequested,
    ProcessSpawned { pid: u32 },
    ProcessExited { exit_code: Option<i32>, signal: Option<i32> },
    HealthCheckPassed,
    HealthCheckFailed { attempt: u32 },
    DependencySatisfied { service: ServiceId },
    DependencyFailed { service: ServiceId },
    Timeout { kind: TimeoutKind },
}

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum TimeoutKind {
    Start,
    Stop,
    HealthCheck,
    RestartDelay,
}
```

## Restart Policy

```rust
impl Service {
    fn should_restart(&self) -> bool {
        // Check policy first
        let dominated = match (&self.state, self.config.lifecycle.restart) {
            (ServiceState::Exited { exit_code: Some(0), .. }, RestartPolicy::OnFailure) => false,
            (_, RestartPolicy::Never) => false,
            (ServiceState::Exited { .. }, RestartPolicy::Always) => true,
            (ServiceState::Exited { .. }, RestartPolicy::OnFailure) => true,
            (ServiceState::Failed { .. }, RestartPolicy::Always) => true,
            (ServiceState::Failed { .. }, RestartPolicy::OnFailure) => true,
            _ => false,
        };
        
        if !dominated {
            return false;
        }
        
        // Check max restarts (0 = unlimited)
        let max = self.config.lifecycle.max_restarts;
        if max > 0 && self.restart_count >= max {
            return false;
        }
        
        true
    }
}
```

**Exponential backoff behavior:**

With defaults: `restart_delay_ms: 1000`, `restart_delay_max_ms: 300000`, `max_restarts: 10`

```
Crash #1  → wait 1s
Crash #2  → wait 2s
Crash #3  → wait 4s
Crash #4  → wait 8s
Crash #5  → wait 16s
Crash #6  → wait 32s
Crash #7  → wait 64s
Crash #8  → wait 128s (~2 min)
Crash #9  → wait 256s (~4 min)
Crash #10 → wait 300s (capped at 5 min)
Crash #11 → give up, stay Failed
```

Total time before giving up: ~13 minutes.

**Reset behavior:**
- When service reaches `Running` state → `reset_backoff()` is called
- Restart count goes to 0, delay resets to initial value
- Service gets fresh set of retry attempts

**State transitions with restart:**

| Condition | Result |
|-----------|--------|
| Exit, should_restart=true | Schedule restart after `next_restart_delay()` |
| Exit, count >= max_restarts | Stay in Failed, log "max restarts exceeded" |
| Exit, should_restart=false | Stay in Exited/Failed |
| Reaches Running | Call `reset_backoff()` |
```