# zinit-server - State Machine
Service state machine for zinit-server. Uses types from `zinit-common`.
## State Diagram
```
+----------------------------------+
| |
v |
+----------+ +---------+ +----------+ +---------+ |
| Inactive |───>| Blocked |───>| Starting |───>| Running |──+
+----------+ +---------+ +----------+ +---------+ |
^ | | | |
| | v v |
| | +---------+ +----------+ |
| +────────>| Failed |<──| Stopping | |
| +---------+ +----------+ |
| | | |
| v v |
| +-------------------------+ |
+─────────────────────────| Exited |<─+
+-------------------------+
```
## States (Server-Side)
The server uses richer state types than zinit-common (includes timestamps):
```rust
use std::time::Instant;
use petgraph::graph::NodeIndex;
pub type ServiceId = NodeIndex;
#[derive(Debug, Clone, PartialEq)]
pub enum ServiceState {
/// Configured but never started
Inactive,
/// Waiting on dependencies
Blocked {
waiting_on: Vec<ServiceId>,
conflicts_with: Vec<ServiceId>,
},
/// Process spawned, waiting for ready signal
Starting {
pid: u32,
started_at: Instant,
},
/// Process running and healthy
Running {
pid: u32,
ready_at: Instant,
},
/// SIGTERM sent, waiting for exit
Stopping {
pid: u32,
signal_sent_at: Instant,
},
/// Process exited (clean or via stop request)
Exited {
exit_code: Option<i32>,
exited_at: Instant,
},
/// Process failed
Failed {
reason: FailureReason,
failed_at: Instant,
},
}
#[derive(Debug, Clone, PartialEq)]
pub enum FailureReason {
ExitCode(i32),
Signal(i32),
StartTimeout,
StopTimeout,
HealthCheckFailed { attempts: u32 },
DependencyFailed { service: String },
SpawnError { message: String },
}
```
## State Queries
```rust
impl ServiceState {
pub fn name(&self) -> &'static str {
match self {
Self::Inactive => "inactive",
Self::Blocked { .. } => "blocked",
Self::Starting { .. } => "starting",
Self::Running { .. } => "running",
Self::Stopping { .. } => "stopping",
Self::Exited { .. } => "exited",
Self::Failed { .. } => "failed",
}
}
pub fn symbol(&self) -> &'static str {
match self {
Self::Inactive => "[-]",
Self::Blocked { .. } => "[?]",
Self::Starting { .. } => "[>]",
Self::Running { .. } => "[+]",
Self::Stopping { .. } => "[!]",
Self::Exited { .. } => "[.]",
Self::Failed { .. } => "[X]",
}
}
/// Can this state satisfy a "requires" dependency?
pub fn is_satisfied(&self) -> bool {
matches!(self, Self::Running { .. })
}
/// Is a process currently running?
pub fn is_active(&self) -> bool {
matches!(self, Self::Starting { .. } | Self::Running { .. } | Self::Stopping { .. })
}
/// Can we attempt to start from this state?
pub fn can_attempt_start(&self) -> bool {
matches!(self, Self::Inactive | Self::Exited { .. } | Self::Failed { .. })
}
/// Get PID if process is running
pub fn pid(&self) -> Option<u32> {
match self {
Self::Starting { pid, .. } => Some(*pid),
Self::Running { pid, .. } => Some(*pid),
Self::Stopping { pid, .. } => Some(*pid),
_ => None,
}
}
}
```
## Conversion to zinit-common Types
```rust
impl From<&ServiceState> for zinit_common::ServiceState {
fn from(state: &ServiceState) -> Self {
match state {
ServiceState::Inactive => Self::Inactive,
ServiceState::Blocked { waiting_on, .. } => Self::Blocked {
waiting_on: vec![], // resolve names in caller
},
ServiceState::Starting { pid, .. } => Self::Starting { pid: *pid },
ServiceState::Running { pid, .. } => Self::Running { pid: *pid },
ServiceState::Stopping { pid, .. } => Self::Stopping { pid: *pid },
ServiceState::Exited { exit_code, .. } => Self::Exited { exit_code: *exit_code },
ServiceState::Failed { reason, .. } => Self::Failed {
reason: reason.into(),
},
}
}
}
```
## Targets (Virtual Services)
Targets have no process - they're dependency anchors:
```rust
#[derive(Debug, Clone)]
pub struct Service {
pub name: String,
pub config: ServiceConfig,
pub state: ServiceState,
pub is_target: bool,
// Restart tracking (exponential backoff)
pub restart_count: u32,
pub current_restart_delay_ms: u64,
}
impl Service {
pub fn new(name: String, config: ServiceConfig, is_target: bool) -> Self {
let initial_delay = config.lifecycle.restart_delay_ms;
Self {
name,
config,
state: ServiceState::Inactive,
is_target,
restart_count: 0,
current_restart_delay_ms: initial_delay,
}
}
/// Targets transition directly to Running when deps satisfied
pub fn target_check_ready(&mut self, graph: &ServiceGraph) {
if !self.is_target {
return;
}
if graph.all_requires_satisfied(self.id) {
self.state = ServiceState::Running {
pid: 0, // no actual process
ready_at: Instant::now(),
};
}
}
/// Get next restart delay with exponential backoff.
/// Returns None if max_restarts exceeded or policy says no restart.
pub fn next_restart_delay(&mut self) -> Option<Duration> {
if !self.should_restart() {
return None;
}
let delay = self.current_restart_delay_ms;
// Exponential backoff: double for next time, capped at max
self.current_restart_delay_ms = (self.current_restart_delay_ms * 2)
.min(self.config.lifecycle.restart_delay_max_ms);
self.restart_count += 1;
Some(Duration::from_millis(delay))
}
/// Reset backoff when service becomes healthy (reaches Running state)
pub fn reset_backoff(&mut self) {
self.restart_count = 0;
self.current_restart_delay_ms = self.config.lifecycle.restart_delay_ms;
}
}
```
## State Transitions
| Inactive | StartRequested | Blocked | Dependencies not satisfied |
| Inactive | StartRequested | Starting | Dependencies satisfied |
| Blocked | DependencySatisfied | Starting | All deps now satisfied |
| Blocked | DependencyFailed | Failed | Hard dep failed |
| Starting | ProcessSpawned | Starting | Update with PID |
| Starting | HealthCheckPassed | Running | Ready |
| Starting | ProcessExited(0) | Exited | Oneshot success |
| Starting | ProcessExited(n) | Failed | Non-zero exit |
| Starting | Timeout | Failed | Start timeout |
| Running | StopRequested | Stopping | SIGTERM sent |
| Running | ProcessExited(0) | Exited | Clean exit |
| Running | ProcessExited(n) | Failed | Crash |
| Running | HealthCheckFailed | Failed | After retries |
| Stopping | ProcessExited | Exited | Any exit |
| Stopping | Timeout | Exited | SIGKILL sent, force exit |
| Exited | StartRequested | Starting/Blocked | Restart |
| Failed | StartRequested | Starting/Blocked | Manual restart |
## Events
```rust
#[derive(Debug)]
pub enum ServiceEvent {
StartRequested,
StopRequested,
ProcessSpawned { pid: u32 },
ProcessExited { exit_code: Option<i32>, signal: Option<i32> },
HealthCheckPassed,
HealthCheckFailed { attempt: u32 },
DependencySatisfied { service: ServiceId },
DependencyFailed { service: ServiceId },
Timeout { kind: TimeoutKind },
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum TimeoutKind {
Start,
Stop,
HealthCheck,
RestartDelay,
}
```
## Restart Policy
```rust
impl Service {
fn should_restart(&self) -> bool {
// Check policy first
let dominated = match (&self.state, self.config.lifecycle.restart) {
(ServiceState::Exited { exit_code: Some(0), .. }, RestartPolicy::OnFailure) => false,
(_, RestartPolicy::Never) => false,
(ServiceState::Exited { .. }, RestartPolicy::Always) => true,
(ServiceState::Exited { .. }, RestartPolicy::OnFailure) => true,
(ServiceState::Failed { .. }, RestartPolicy::Always) => true,
(ServiceState::Failed { .. }, RestartPolicy::OnFailure) => true,
_ => false,
};
if !dominated {
return false;
}
// Check max restarts (0 = unlimited)
let max = self.config.lifecycle.max_restarts;
if max > 0 && self.restart_count >= max {
return false;
}
true
}
}
```
**Exponential backoff behavior:**
With defaults: `restart_delay_ms: 1000`, `restart_delay_max_ms: 300000`, `max_restarts: 10`
```
Crash #1 → wait 1s
Crash #2 → wait 2s
Crash #3 → wait 4s
Crash #4 → wait 8s
Crash #5 → wait 16s
Crash #6 → wait 32s
Crash #7 → wait 64s
Crash #8 → wait 128s (~2 min)
Crash #9 → wait 256s (~4 min)
Crash #10 → wait 300s (capped at 5 min)
Crash #11 → give up, stay Failed
```
Total time before giving up: ~13 minutes.
**Reset behavior:**
- When service reaches `Running` state → `reset_backoff()` is called
- Restart count goes to 0, delay resets to initial value
- Service gets fresh set of retry attempts
**State transitions with restart:**
| Exit, should_restart=true | Schedule restart after `next_restart_delay()` |
| Exit, count >= max_restarts | Stay in Failed, log "max restarts exceeded" |
| Exit, should_restart=false | Stay in Exited/Failed |
| Reaches Running | Call `reset_backoff()` |
```