Terraphim Agent Supervisor

OTP-inspired supervision trees for fault-tolerant AI agent management.

Overview

This crate provides Erlang/OTP-style supervision patterns for managing AI agents, including automatic restart strategies, fault isolation, and hierarchical supervision. It implements the "let it crash" philosophy with fast failure detection and supervisor recovery.

Core Concepts

Supervision Trees

Hierarchical fault tolerance with automatic restart strategies:

OneForOne: Restart only the failed agent
OneForAll: Restart all agents if one fails
RestForOne: Restart the failed agent and all agents started after it

Agent Lifecycle

Complete agent lifecycle management:

Spawn: Create and initialize new agents
Monitor: Health checks and status monitoring
Restart: Automatic restart on failure with configurable policies
Terminate: Graceful shutdown and cleanup

Fault Tolerance

Built-in resilience patterns:

Fast failure detection with supervisor recovery
Configurable restart intensity limits
Circuit breaker patterns for cascading failure prevention
Comprehensive error categorization and recovery strategies

Quick Start

use std::sync::Arc;
use terraphim_agent_supervisor::{
    AgentSupervisor, SupervisorConfig, AgentSpec, TestAgentFactory,
    RestartStrategy, RestartPolicy
};
use serde_json::json;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create supervisor configuration
    let mut config = SupervisorConfig::default();
    config.restart_policy.strategy = RestartStrategy::OneForOne;

    // Create agent factory
    let factory = Arc::new(TestAgentFactory);

    // Create and start supervisor
    let mut supervisor = AgentSupervisor::new(config, factory);
    supervisor.start().await?;

    // Spawn an agent
    let spec = AgentSpec::new("test".to_string(), json!({}))
        .with_name("my-agent".to_string());
    let agent_id = supervisor.spawn_agent(spec).await?;

    println!("Agent {} spawned successfully", agent_id);

    // Simulate agent failure and restart
    supervisor.handle_agent_exit(
        agent_id,
        terraphim_agent_supervisor::ExitReason::Error("test failure".to_string())
    ).await?;

    // Stop supervisor
    supervisor.stop().await?;

    Ok(())
}

Restart Strategies

OneForOne

Restart only the failed agent. Best for independent agents.

let mut config = SupervisorConfig::default();
config.restart_policy.strategy = RestartStrategy::OneForOne;

OneForAll

Restart all agents if one fails. Best for tightly coupled agents.

let mut config = SupervisorConfig::default();
config.restart_policy.strategy = RestartStrategy::OneForAll;

RestForOne

Restart the failed agent and all agents started after it. Best for pipeline-style workflows.

let mut config = SupervisorConfig::default();
config.restart_policy.strategy = RestartStrategy::RestForOne;

Restart Intensity

Control how aggressively agents are restarted:

use terraphim_agent_supervisor::{RestartPolicy, RestartIntensity};
use std::time::Duration;

// Lenient policy: 10 restarts in 2 minutes
let lenient = RestartPolicy::new(
    RestartStrategy::OneForOne,
    RestartIntensity::new(10, Duration::from_secs(120))
);

// Strict policy: 3 restarts in 30 seconds
let strict = RestartPolicy::new(
    RestartStrategy::OneForOne,
    RestartIntensity::new(3, Duration::from_secs(30))
);

// Never restart
let never = RestartPolicy::never_restart();

Custom Agents

Implement the SupervisedAgent trait for custom agent types:

use async_trait::async_trait;
use terraphim_agent_supervisor::{
    SupervisedAgent, AgentPid, SupervisorId, AgentStatus,
    TerminateReason, SystemMessage, InitArgs, SupervisionResult
};

struct MyAgent {
    pid: AgentPid,
    supervisor_id: SupervisorId,
    status: AgentStatus,
}

#[async_trait]
impl SupervisedAgent for MyAgent {
    async fn init(&mut self, args: InitArgs) -> SupervisionResult<()> {
        self.pid = args.agent_id;
        self.supervisor_id = args.supervisor_id;
        self.status = AgentStatus::Starting;
        Ok(())
    }

    async fn start(&mut self) -> SupervisionResult<()> {
        self.status = AgentStatus::Running;
        // Start your agent logic here
        Ok(())
    }

    async fn stop(&mut self) -> SupervisionResult<()> {
        self.status = AgentStatus::Stopping;
        // Cleanup logic here
        self.status = AgentStatus::Stopped;
        Ok(())
    }

    // Implement other required methods...
}

Monitoring and Observability

Get supervisor and agent status:

// Get supervisor status
let status = supervisor.status();
println!("Supervisor status: {:?}", status);

// Get all child agents
let children = supervisor.get_children().await;
for (pid, info) in children {
    println!("Agent {}: {:?} (restarts: {})",
        pid, info.status, info.restart_count);
}

// Get restart history
let history = supervisor.get_restart_history().await;
for entry in history {
    println!("Agent {} restarted at {} due to {:?}",
        entry.agent_id, entry.timestamp, entry.reason);
}

Error Handling

The supervision system provides comprehensive error categorization:

use terraphim_agent_supervisor::{SupervisionError, ErrorCategory};

match supervisor.spawn_agent(spec).await {
    Ok(agent_id) => println!("Agent spawned: {}", agent_id),
    Err(e) => {
        println!("Error: {}", e);
        println!("Category: {:?}", e.category());
        println!("Recoverable: {}", e.is_recoverable());
    }
}

Features

Fault Tolerance: Automatic restart with configurable strategies
Health Monitoring: Built-in health checks and status tracking
Resource Management: Configurable limits and timeouts
Observability: Comprehensive monitoring and restart history
Extensibility: Custom agent types and factories
Performance: Efficient async implementation with minimal overhead

Integration

This crate integrates with the broader Terraphim ecosystem:

terraphim_persistence: Agent state persistence
terraphim_types: Common type definitions
Future: Integration with knowledge graph-based agent coordination

License

Licensed under the Apache License, Version 2.0.

terraphim_agent_supervisor 1.19.2