zinit 0.3.6 - Docs.rs

# zinit Internals: Graph and Supervisor Integration

This document explains how the dependency graph integrates with the supervisor,
the event flow for service startup, and the design rationale behind key decisions.

For the graph API specification, see [spec/03-dependency-graph.md](../spec/03-dependency-graph.md).

## Core Data Structures

### Graph Structure

```rust
pub struct ServiceGraph {
    graph: DiGraph<Service, DepType>,    // petgraph directed graph
    by_name: HashMap<String, ServiceId>, // name → node index
}
```

- **Nodes**: `Service` structs containing state, config, restart info
- **Edges**: `DepType` enum (After, Requires, Wants, Conflicts)
- **Direction**: dependency → dependent ("A must complete before B")

### Dependency Types

| Type       | Blocks Start? | Satisfaction Predicate      | Notes                          |
|------------|---------------|----------------------------|--------------------------------|
| `Requires` | Yes (hard)    | `is_satisfied()` - Running OR Exited(0) | Allows oneshot services |
| `After`    | Soft          | Not Inactive/Blocked (attempted) | Ordering only            |
| `Wants`    | Never         | N/A                        | Soft preference                |
| `Conflicts`| If active     | `!is_active()`             | Bidirectional edges            |

### State Predicates

```rust
// Can satisfy a "requires" dependency
fn is_satisfied(&self) -> bool {
    matches!(self, Running { .. } | Exited { exit_code: Some(0) })
}

// Currently has a process (Starting, Running, or Stopping)
fn is_active(&self) -> bool {
    matches!(self, Starting { .. } | Running { .. } | Stopping { .. })
}
```

## The `can_start()` Gate

The supervisor calls `can_start()` before spawning any service:

```rust
pub fn can_start(&self, id: ServiceId) -> Result<(), BlockedReason> {
    let mut waiting_on = Vec::new();
    let mut conflicts_with = Vec::new();

    // Check incoming edges (our dependencies)
    for edge in self.graph.edges_directed(id, Direction::Incoming) {
        let dep = &self.graph[edge.source()];

        match edge.weight() {
            DepType::Requires => {
                if !dep.state.is_satisfied() {
                    waiting_on.push(dep.name.clone());
                }
            }
            DepType::After => {
                // Only blocks if dependency hasn't been attempted yet
                if matches!(dep.state, Inactive | Blocked { .. }) {
                    waiting_on.push(dep.name.clone());
                }
            }
            DepType::Wants => { /* never blocks */ }
            DepType::Conflicts => { /* checked separately */ }
        }
    }

    // Check outgoing conflict edges
    for conflict_id in self.conflicts(id) {
        let conflict = &self.graph[conflict_id];
        if conflict.state.is_active() {
            conflicts_with.push(conflict.name.clone());
        }
    }

    // Return combined result
    match (waiting_on.is_empty(), conflicts_with.is_empty()) {
        (true, true) => Ok(()),
        (false, true) => Err(BlockedReason::WaitingOn(waiting_on)),
        (true, false) => Err(BlockedReason::ConflictsWith(conflicts_with)),
        (false, false) => Err(BlockedReason::Both { waiting_on, conflicts_with }),
    }
}
```

## Supervisor Event Flow

### 1. Initial Startup

```rust
async fn start_all(&mut self) {
    // Get topologically sorted order (dependencies before dependents)
    let order = self.graph.read().await.start_order();

    for id in order {
        self.try_start_service(id).await;
    }
}
```

### 2. Try Start Service

```rust
async fn try_start_service(&mut self, id: ServiceId) {
    let action = {
        let mut graph = self.graph.write().await;
        let service = graph.get(id)?;

        // Guard 1: Check if autostart enabled
        if !service.should_autostart() { return; }

        // Guard 2: Check state allows start attempts
        if !service.state.can_attempt_start() { return; }

        // Guard 3: Check dependencies via can_start()
        match graph.can_start(id) {
            Ok(()) => {
                if service.is_target() {
                    // Targets: instant transition to Running (pid=0)
                    graph.set_state(id, ServiceState::Running { pid: 0 });
                    StartAction::TargetReady { dependents: graph.dependents(id) }
                } else {
                    // Real services: spawn process
                    graph.set_state(id, ServiceState::Starting { pid: 0 });
                    StartAction::SpawnProcess { config }
                }
            }
            Err(reason) => {
                // Dependencies not met → mark as Blocked
                graph.set_state(id, ServiceState::Blocked {
                    waiting_on: reason.waiting_on()
                });
                StartAction::Blocked { reason }
            }
        }
    }; // Lock released

    // Execute action outside lock
    match action {
        StartAction::TargetReady { dependents, .. } => {
            self.queue_reevaluate(dependents).await;
        }
        StartAction::SpawnProcess { config } => {
            self.spawn_service(id, config).await;
        }
        StartAction::Blocked { .. } => { /* logged */ }
    }
}
```

### 3. Re-evaluation Loop

When a service transitions state, its dependents are queued for re-evaluation:

```rust
async fn handle_reevaluate(&mut self, id: ServiceId) {
    let action = {
        let graph = self.graph.read().await;
        let service = graph.get(id)?;

        match &service.state {
            // Blocked or Inactive → try to start
            ServiceState::Blocked { .. } | ServiceState::Inactive => {
                ReevaluateAction::TryStart
            }

            // Running target → verify deps still satisfied
            ServiceState::Running { .. } if service.is_target() => {
                match graph.can_start(id) {
                    Ok(()) => ReevaluateAction::None,
                    Err(reason) => ReevaluateAction::DowngradeTarget {
                        waiting_on: reason.waiting_on()
                    },
                }
            }
            _ => ReevaluateAction::None,
        }
    };

    match action {
        ReevaluateAction::TryStart => {
            self.try_start_service(id).await;
        }
        ReevaluateAction::DowngradeTarget { waiting_on, .. } => {
            // Target's deps failed → downgrade to Blocked, cascade
            let dependents = {
                let mut graph = self.graph.write().await;
                graph.set_state(id, ServiceState::Blocked { waiting_on });
                graph.dependents(id)
            };
            self.queue_reevaluate(dependents).await;
        }
        ReevaluateAction::None => {}
    }
}
```

## Event Flow Diagram

```
                    ┌──────────────────┐
                    │    start_all()   │
                    │(topological order)│
                    └────────┬─────────┘
                             │
              ┌──────────────▼──────────────┐
              │     try_start_service()     │
              │  ┌─────────────────────┐    │
              │  │   can_start(id)?    │    │
              │  └──────────┬──────────┘    │
              │      Ok     │    Err        │
              │      ▼      │     ▼         │
              │   Spawn/    │  Blocked      │
              │   Target    │               │
              └──────┬──────┴───────────────┘
                     │
                     │ Service becomes Running/Exited(0)
                     ▼
              ┌─────────────────────────────┐
              │ queue_reevaluate(dependents)│
              └──────────────┬──────────────┘
                             │
              ┌──────────────▼──────────────┐
              │    handle_reevaluate()      │
              │  ┌─────────────────────┐    │
              │  │ Blocked? → TryStart │    │
              │  │ Target?  → Recheck  │    │
              │  └─────────────────────┘    │
              └──────────────┬──────────────┘
                             │
                             ▼
                   (back to try_start_service)
```

## Ordering Algorithms

### Start Order (Topological Sort)

```rust
pub fn start_order(&self) -> Vec<ServiceId> {
    // Filter out conflict edges (bidirectional, don't affect ordering)
    let filtered = self.graph.filter_map(
        |_, _| Some(()),
        |_, weight| {
            if *weight != DepType::Conflicts { Some(()) } else { None }
        },
    );

    toposort(&filtered, None).unwrap_or_default()
}
```

### Shutdown Order

Simply reverse topological order - dependents stop before dependencies:

```rust
pub fn shutdown_order(&self) -> Vec<ServiceId> {
    let mut order = self.start_order();
    order.reverse();
    order
}
```

### Transitive Dependents (for cascading stops)

DFS post-order traversal returns most-dependent-first:

```rust
pub fn all_dependents_ordered(&self, id: ServiceId) -> Vec<ServiceId> {
    let mut result = Vec::new();
    let mut visited = HashSet::new();
    self.collect_dependents_recursive(id, &mut result, &mut visited);
    result
}

fn collect_dependents_recursive(&self, id: ServiceId, result: &mut Vec<ServiceId>, visited: &mut HashSet<ServiceId>) {
    for dependent_id in self.dependents(id) {
        if visited.insert(dependent_id) {
            // Recurse first (post-order)
            self.collect_dependents_recursive(dependent_id, result, visited);
            result.push(dependent_id);
        }
    }
}
```

## Cycle Detection

```rust
pub fn validate(&self) -> GraphResult<()> {
    // Filter out conflict edges - bidirectional edges would
    // create false-positive cycles
    let filtered = self.graph.filter_map(
        |_, _| Some(()),
        |_, weight| {
            if *weight != DepType::Conflicts { Some(()) } else { None }
        },
    );

    if is_cyclic_directed(&filtered) {
        let cycle = self.find_cycle(&filtered);
        return Err(GraphError::CyclicDependency(cycle));
    }
    Ok(())
}
```

## Design Rationale

### Why separate `Requires` vs `After`?

- **Requires**: "I need this running to function" (database → app)
- **After**: "Just order me, don't hard-fail if it's stuck" (logging → app)

The `After` type allows services to proceed even if a soft dependency is stuck
in Failed state - it just needs to have been *attempted*.

### Why does `Exited(0)` satisfy `Requires`?

This supports **oneshot services** - initialization tasks that run once and exit:

```
mount-filesystems (exits 0) → app (can now start)
```

Without this, you'd need artificial "ready" signals for setup tasks.

### Why bidirectional conflict edges?

Symmetry: if A conflicts with B, B conflicts with A. The conflict relationship
is inherently bidirectional. Creating edges in both directions ensures:
- `graph.conflicts(A)` returns B
- `graph.conflicts(B)` returns A

### Why filter conflicts for cycle detection?

A↔B bidirectional conflict edges form a "cycle" in graph terms, but it's not
a dependency cycle - it just means "don't run together". Filtering them out
prevents false-positive cycle detection.

### Why preserve state on reload?

Hot-reload allows config changes without restarting services. State preservation
copies runtime state (Running, restart_count, timestamps) from the old graph
to the new one for unchanged services.

## Key Implementation Files

| File | Purpose |
|------|---------|
| `zinit-server/src/graph.rs` | Dependency graph, `can_start()`, ordering |
| `zinit-server/src/supervisor/mod.rs` | Main supervisor struct and event loop |
| `zinit-server/src/supervisor/spawning.rs` | `try_start_service()`, process spawning |
| `zinit-server/src/supervisor/handlers.rs` | Re-evaluation, state transitions |
| `zinit-common/src/state.rs` | `ServiceState` enum, `is_satisfied()`, `is_active()` |
| `zinit-common/src/responses.rs` | `DepType` enum definition |