Terraphim Agent Supervisor
OTP-inspired supervision trees for fault-tolerant AI agent management.
Overview
This crate provides Erlang/OTP-style supervision patterns for managing AI agents, including automatic restart strategies, fault isolation, and hierarchical supervision. It implements the "let it crash" philosophy with fast failure detection and supervisor recovery.
Core Concepts
Supervision Trees
Hierarchical fault tolerance with automatic restart strategies:
- OneForOne: Restart only the failed agent
- OneForAll: Restart all agents if one fails
- RestForOne: Restart the failed agent and all agents started after it
Agent Lifecycle
Complete agent lifecycle management:
- Spawn: Create and initialize new agents
- Monitor: Health checks and status monitoring
- Restart: Automatic restart on failure with configurable policies
- Terminate: Graceful shutdown and cleanup
Fault Tolerance
Built-in resilience patterns:
- Fast failure detection with supervisor recovery
- Configurable restart intensity limits
- Circuit breaker patterns for cascading failure prevention
- Comprehensive error categorization and recovery strategies
Quick Start
use Arc;
use ;
use json;
async
Restart Strategies
OneForOne
Restart only the failed agent. Best for independent agents.
let mut config = default;
config.restart_policy.strategy = OneForOne;
OneForAll
Restart all agents if one fails. Best for tightly coupled agents.
let mut config = default;
config.restart_policy.strategy = OneForAll;
RestForOne
Restart the failed agent and all agents started after it. Best for pipeline-style workflows.
let mut config = default;
config.restart_policy.strategy = RestForOne;
Restart Intensity
Control how aggressively agents are restarted:
use ;
use Duration;
// Lenient policy: 10 restarts in 2 minutes
let lenient = new;
// Strict policy: 3 restarts in 30 seconds
let strict = new;
// Never restart
let never = never_restart;
Custom Agents
Implement the SupervisedAgent trait for custom agent types:
use async_trait;
use ;
Monitoring and Observability
Get supervisor and agent status:
// Get supervisor status
let status = supervisor.status;
println!;
// Get all child agents
let children = supervisor.get_children.await;
for in children
// Get restart history
let history = supervisor.get_restart_history.await;
for entry in history
Error Handling
The supervision system provides comprehensive error categorization:
use ;
match supervisor.spawn_agent.await
Features
- Fault Tolerance: Automatic restart with configurable strategies
- Health Monitoring: Built-in health checks and status tracking
- Resource Management: Configurable limits and timeouts
- Observability: Comprehensive monitoring and restart history
- Extensibility: Custom agent types and factories
- Performance: Efficient async implementation with minimal overhead
Integration
This crate integrates with the broader Terraphim ecosystem:
- terraphim_persistence: Agent state persistence
- terraphim_types: Common type definitions
- Future: Integration with knowledge graph-based agent coordination
License
Licensed under the Apache License, Version 2.0.