# Tutorial: AI-Assisted Failure Recovery
> ๐งช **EXPERIMENTAL**: This feature is new and not yet well-tested. The API and behavior may change
> based on user feedback.
This tutorial shows how to use AI agents (Claude Code, GitHub Copilot, or custom MCP clients) to
intelligently classify and recover from workflow failures that can't be handled by rule-based
mechanisms.
## Learning Objectives
By the end of this tutorial, you will:
- Understand the `pending_failed` job status
- Configure workflows for AI-assisted recovery
- Use the torc MCP server with AI agents
- Classify transient vs permanent errors
## Prerequisites
- Torc installed with the client feature
- A running Torc server
- An MCP-compatible AI agent (Claude Code, GitHub Copilot, or custom)
- `torc-mcp-server` installed
## The Problem
Traditional recovery mechanisms have limitations:
| **Failure handlers** | Require predefined exit codes |
| **`torc watch --recover`** | Only handles OOM and timeout patterns |
| **`--retry-unknown`** | Blindly retries everything |
Some failures require **intelligent classification**:
- **Transient errors**: Network timeouts, node failures, service outages - should retry
- **Permanent errors**: Code bugs, missing files, invalid inputs - should fail
AI agents can analyze error logs and make these distinctions.
## How It Works
```mermaid
flowchart TD
JOB["Job exits with<br/>non-zero code"]
HANDLER{"Failure handler<br/>matches?"}
RETRY["Retry via<br/>failure handler"]
PENDING["Status: pending_failed<br/>Awaiting classification"]
WATCH["torc watch/recover<br/>+ AI agent"]
CLASSIFY["AI analyzes stderr"]
TRANSIENT["Transient error<br/>โ Retry"]
PERMANENT["Permanent error<br/>โ Fail"]
JOB --> HANDLER
HANDLER -->|Yes| RETRY
HANDLER -->|No| PENDING
PENDING --> WATCH
WATCH --> CLASSIFY
CLASSIFY --> TRANSIENT
CLASSIFY --> PERMANENT
style JOB fill:#dc3545,color:#fff
style PENDING fill:#ffc107,color:#000
style CLASSIFY fill:#4a9eff,color:#fff
style TRANSIENT fill:#28a745,color:#fff
style PERMANENT fill:#6c757d,color:#fff
```
When a job fails without a matching failure handler rule, it enters the `pending_failed` status
instead of `failed`. This prevents immediate downstream job cancellation and gives the AI agent time
to classify the error.
## Quick Start
### Option A: Automatic AI Agent Invocation (Recommended)
Use the `--ai-recovery` flag to automatically invoke the Claude CLI for classification:
```bash
# One-shot recovery with AI classification
torc recover 123 --ai-recovery
# Continuous monitoring with AI classification
torc watch 123 --ai-recovery
# Specify a different AI agent
torc recover 123 --ai-recovery --ai-agent claude # Default
torc recover 123 --ai-recovery --ai-agent copilot # GitHub Copilot
```
When `--ai-recovery` is enabled:
1. Torc detects jobs in `pending_failed` status
2. Automatically invokes the AI agent CLI with the torc MCP server
3. AI agent analyzes stderr and classifies each job as transient (retry) or permanent (fail)
4. Classifications are applied via MCP tools
5. Recovery continues with the newly classified jobs
**Requirements:**
- **Claude**: Claude Code CLI installed (`claude` command available)
- **GitHub Copilot**: GitHub CLI with Copilot installed (`gh copilot` command available)
- Torc MCP server configured in your AI agent's MCP settings
### Option B: Manual AI Agent Invocation
If you prefer manual control, configure your AI agent and invoke it yourself.
#### 1. Start the MCP Server
```bash
torc-mcp-server --url http://localhost:8080/torc-service/v1
```
#### 2. Configure Your AI Agent
Add the torc MCP server to your agent's configuration:
**Claude Code (`~/.claude/mcp_servers.json`):**
```json
{
"mcpServers": {
"torc": {
"command": "torc-mcp-server",
"args": ["--url", "http://localhost:8080/torc-service/v1"]
}
}
}
```
**GitHub Copilot (`.github/copilot/mcp-config.json` or global config):**
```json
{
"mcpServers": {
"torc": {
"command": "torc-mcp-server",
"args": ["--url", "http://localhost:8080/torc-service/v1"]
}
}
}
```
#### 3. Run a Workflow
```bash
torc run my_workflow.yaml
```
#### 4. Monitor with AI Recovery
When jobs fail, use your AI agent to:
1. List pending failures:
```
Agent: Use list_pending_failed_jobs with workflow_id=123
```
2. Analyze the errors:
```
Agent: The stderr shows "Connection refused to storage.example.com:443"
This is a transient network error - the storage server was temporarily down.
```
3. Classify and resolve:
```
Agent: Use classify_and_resolve_failures to retry these jobs
```
## MCP Tools
The torc MCP server provides these tools for AI-assisted recovery:
### list_pending_failed_jobs
Lists jobs with `pending_failed` status, including their stderr output.
**Input:**
```json
{
"workflow_id": 123
}
```
**Output:**
```json
{
"workflow_id": 123,
"pending_failed_count": 2,
"pending_failed_jobs": [
{
"job_id": 456,
"name": "process_data",
"return_code": 1,
"stderr_tail": "ConnectionError: Connection refused..."
}
],
"guidance": "Analyze the stderr output to classify each failure..."
}
```
### classify_and_resolve_failures
Applies classifications to pending_failed jobs.
**Input:**
```json
{
"workflow_id": 123,
"classifications": [
{
"job_id": 456,
"action": "retry",
"reason": "Transient network error - storage server was down"
},
{
"job_id": 789,
"action": "fail",
"reason": "SyntaxError in user code - requires fix"
}
],
"dry_run": true
}
```
**Actions:**
- `retry`: Reset to `ready` status with bumped `attempt_id`
- `fail`: Set to `failed` status (triggers downstream cancellation)
**Optional resource adjustments:**
```json
{
"job_id": 456,
"action": "retry",
"memory": "16g",
"runtime": "PT4H",
"reason": "OOM detected in stderr, increasing memory"
}
```
## Error Classification Guide
### Transient Errors (Should Retry)
| `Connection refused`, `Connection timed out` | Network |
| `NCCL timeout`, `GPU communication error` | GPU/HPC |
| `EIO`, `Input/output error` | Hardware |
| `Slurm: node failure`, `PREEMPTED` | HPC scheduling |
| `Service Unavailable`, `503` | External service |
### Permanent Errors (Should Fail)
| `SyntaxError`, `IndentationError` | Code bug |
| `ModuleNotFoundError`, `ImportError` | Missing dependency |
| `FileNotFoundError` (for input files) | Missing data |
| `IndexError`, `KeyError` | Logic error |
| `PermissionDenied` (consistent) | Configuration issue |
## Integration with Existing Recovery
AI-assisted recovery works alongside other mechanisms:
```yaml
failure_handlers:
- name: known_errors
rules:
# Known recoverable exit codes handled immediately
- exit_codes: [10, 11]
recovery_script: ./recover.sh
max_retries: 3
# Unknown errors go to pending_failed for AI classification
```
When a job fails with an exit code not covered by the failure handler, it becomes `pending_failed`
instead of `failed`.
## Using with torc recover
The `torc recover` command can work with AI agents:
```bash
# Check for pending_failed jobs
torc jobs list $WORKFLOW_ID --status pending_failed
# Use AI agent to classify and resolve
# (via MCP tools list_pending_failed_jobs and classify_and_resolve_failures)
# After AI classification, reinitialize and continue
torc workflows reinit $WORKFLOW_ID
torc run $WORKFLOW_ID
```
## Custom MCP Servers
You can create domain-specific MCP servers that the AI agent uses alongside torc:
```mermaid
flowchart LR
subgraph agent["AI Agent"]
LLM["Claude/Copilot"]
end
subgraph mcp["MCP Servers"]
TORC["torc-mcp-server"]
CUSTOM["custom-mcp-server"]
end
subgraph systems["Systems"]
API["Torc API"]
DOMAIN["Domain APIs"]
end
LLM --> TORC
LLM --> CUSTOM
TORC --> API
CUSTOM --> DOMAIN
style LLM fill:#4a9eff,color:#fff
style TORC fill:#28a745,color:#fff
style CUSTOM fill:#ffc107,color:#000
```
For example, a custom MCP server could:
- Check external service status (was the storage server down?)
- Look up known error patterns in a database
- Query infrastructure monitoring systems
## Workflow Reset
If AI classification doesn't happen (e.g., no agent is running), jobs stay in `pending_failed`
status indefinitely. To reset them:
```bash
# Reset all failed/pending_failed jobs
torc workflows reset-status $WORKFLOW_ID --failed-only
```
## Example Session
Here's an example conversation with an AI agent:
```
User: My workflow 123 has some failures. Can you help?
Agent: Let me check what's happening.
[Uses get_workflow_status tool]
The workflow has 3 jobs in pending_failed status.
Agent: [Uses list_pending_failed_jobs tool]
I found 3 jobs awaiting classification:
Job 456 (process_batch_1): stderr shows
"ConnectionError: Connection refused to storage.internal:443"
Job 457 (process_batch_2): stderr shows the same connection error
Job 458 (validate_data): stderr shows
"ValueError: Column 'timestamp' has invalid format"
Agent: Analysis:
- Jobs 456, 457: Transient network error. The storage server was
temporarily unreachable. These should be retried.
- Job 458: Permanent error. There's a data format issue that
requires fixing the input data.
Agent: [Uses classify_and_resolve_failures with dry_run=true]
Here's my proposed classification:
- Job 456: retry (transient network error)
- Job 457: retry (transient network error)
- Job 458: fail (invalid input data)
Should I apply these classifications?
User: Yes, go ahead.
Agent: [Uses classify_and_resolve_failures with dry_run=false]
Done! Jobs 456 and 457 are now ready to retry.
Job 458 has been marked as failed.
Note: You'll need to fix the timestamp format in your input data
before rerunning job 458.
```
## Summary
AI-assisted recovery provides intelligent failure classification:
- **`pending_failed` status**: Defers classification for AI analysis
- **MCP tools**: `list_pending_failed_jobs`, `classify_and_resolve_failures`
- **Error patterns**: Transient vs permanent classification
- **Integration**: Works with failure handlers and `torc recover`