torc 0.23.0

Workflow management system
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
# Tutorial: AI-Assisted Failure Recovery

> ๐Ÿงช **EXPERIMENTAL**: This feature is new and not yet well-tested. The API and behavior may change
> based on user feedback.

This tutorial shows how to use AI agents (Claude Code, GitHub Copilot, or custom MCP clients) to
intelligently classify and recover from workflow failures that can't be handled by rule-based
mechanisms.

## Learning Objectives

By the end of this tutorial, you will:

- Understand the `pending_failed` job status
- Configure workflows for AI-assisted recovery
- Use the torc MCP server with AI agents
- Classify transient vs permanent errors

## Prerequisites

- Torc installed with the client feature
- A running Torc server
- An MCP-compatible AI agent (Claude Code, GitHub Copilot, or custom)
- `torc-mcp-server` installed

## The Problem

Traditional recovery mechanisms have limitations:

| Mechanism                  | Limitation                            |
| -------------------------- | ------------------------------------- |
| **Failure handlers**       | Require predefined exit codes         |
| **`torc watch --recover`** | Only handles OOM and timeout patterns |
| **`--retry-unknown`**      | Blindly retries everything            |

Some failures require **intelligent classification**:

- **Transient errors**: Network timeouts, node failures, service outages - should retry
- **Permanent errors**: Code bugs, missing files, invalid inputs - should fail

AI agents can analyze error logs and make these distinctions.

## How It Works

```mermaid
flowchart TD
    JOB["Job exits with<br/>non-zero code"]
    HANDLER{"Failure handler<br/>matches?"}
    RETRY["Retry via<br/>failure handler"]
    PENDING["Status: pending_failed<br/>Awaiting classification"]
    WATCH["torc watch/recover<br/>+ AI agent"]
    CLASSIFY["AI analyzes stderr"]
    TRANSIENT["Transient error<br/>โ†’ Retry"]
    PERMANENT["Permanent error<br/>โ†’ Fail"]

    JOB --> HANDLER
    HANDLER -->|Yes| RETRY
    HANDLER -->|No| PENDING
    PENDING --> WATCH
    WATCH --> CLASSIFY
    CLASSIFY --> TRANSIENT
    CLASSIFY --> PERMANENT

    style JOB fill:#dc3545,color:#fff
    style PENDING fill:#ffc107,color:#000
    style CLASSIFY fill:#4a9eff,color:#fff
    style TRANSIENT fill:#28a745,color:#fff
    style PERMANENT fill:#6c757d,color:#fff
```

When a job fails without a matching failure handler rule, it enters the `pending_failed` status
instead of `failed`. This prevents immediate downstream job cancellation and gives the AI agent time
to classify the error.

## Quick Start

### Option A: Automatic AI Agent Invocation (Recommended)

Use the `--ai-recovery` flag to automatically invoke the Claude CLI for classification:

```bash
# One-shot recovery with AI classification
torc recover 123 --ai-recovery

# Continuous monitoring with AI classification
torc watch 123 --ai-recovery

# Specify a different AI agent
torc recover 123 --ai-recovery --ai-agent claude     # Default
torc recover 123 --ai-recovery --ai-agent copilot    # GitHub Copilot
```

When `--ai-recovery` is enabled:

1. Torc detects jobs in `pending_failed` status
2. Automatically invokes the AI agent CLI with the torc MCP server
3. AI agent analyzes stderr and classifies each job as transient (retry) or permanent (fail)
4. Classifications are applied via MCP tools
5. Recovery continues with the newly classified jobs

**Requirements:**

- **Claude**: Claude Code CLI installed (`claude` command available)
- **GitHub Copilot**: GitHub CLI with Copilot installed (`gh copilot` command available)
- Torc MCP server configured in your AI agent's MCP settings

### Option B: Manual AI Agent Invocation

If you prefer manual control, configure your AI agent and invoke it yourself.

#### 1. Start the MCP Server

```bash
torc-mcp-server --url http://localhost:8080/torc-service/v1
```

#### 2. Configure Your AI Agent

Add the torc MCP server to your agent's configuration:

**Claude Code (`~/.claude/mcp_servers.json`):**

```json
{
  "mcpServers": {
    "torc": {
      "command": "torc-mcp-server",
      "args": ["--url", "http://localhost:8080/torc-service/v1"]
    }
  }
}
```

**GitHub Copilot (`.github/copilot/mcp-config.json` or global config):**

```json
{
  "mcpServers": {
    "torc": {
      "command": "torc-mcp-server",
      "args": ["--url", "http://localhost:8080/torc-service/v1"]
    }
  }
}
```

#### 3. Run a Workflow

```bash
torc run my_workflow.yaml
```

#### 4. Monitor with AI Recovery

When jobs fail, use your AI agent to:

1. List pending failures:
   ```
   Agent: Use list_pending_failed_jobs with workflow_id=123
   ```

2. Analyze the errors:
   ```
   Agent: The stderr shows "Connection refused to storage.example.com:443"
   This is a transient network error - the storage server was temporarily down.
   ```

3. Classify and resolve:
   ```
   Agent: Use classify_and_resolve_failures to retry these jobs
   ```

## MCP Tools

The torc MCP server provides these tools for AI-assisted recovery:

### list_pending_failed_jobs

Lists jobs with `pending_failed` status, including their stderr output.

**Input:**

```json
{
  "workflow_id": 123
}
```

**Output:**

```json
{
  "workflow_id": 123,
  "pending_failed_count": 2,
  "pending_failed_jobs": [
    {
      "job_id": 456,
      "name": "process_data",
      "return_code": 1,
      "stderr_tail": "ConnectionError: Connection refused..."
    }
  ],
  "guidance": "Analyze the stderr output to classify each failure..."
}
```

### classify_and_resolve_failures

Applies classifications to pending_failed jobs.

**Input:**

```json
{
  "workflow_id": 123,
  "classifications": [
    {
      "job_id": 456,
      "action": "retry",
      "reason": "Transient network error - storage server was down"
    },
    {
      "job_id": 789,
      "action": "fail",
      "reason": "SyntaxError in user code - requires fix"
    }
  ],
  "dry_run": true
}
```

**Actions:**

- `retry`: Reset to `ready` status with bumped `attempt_id`
- `fail`: Set to `failed` status (triggers downstream cancellation)

**Optional resource adjustments:**

```json
{
  "job_id": 456,
  "action": "retry",
  "memory": "16g",
  "runtime": "PT4H",
  "reason": "OOM detected in stderr, increasing memory"
}
```

## Error Classification Guide

### Transient Errors (Should Retry)

| Error Pattern                                | Category         |
| -------------------------------------------- | ---------------- |
| `Connection refused`, `Connection timed out` | Network          |
| `NCCL timeout`, `GPU communication error`    | GPU/HPC          |
| `EIO`, `Input/output error`                  | Hardware         |
| `Slurm: node failure`, `PREEMPTED`           | HPC scheduling   |
| `Service Unavailable`, `503`                 | External service |

### Permanent Errors (Should Fail)

| Error Pattern                         | Category            |
| ------------------------------------- | ------------------- |
| `SyntaxError`, `IndentationError`     | Code bug            |
| `ModuleNotFoundError`, `ImportError`  | Missing dependency  |
| `FileNotFoundError` (for input files) | Missing data        |
| `IndexError`, `KeyError`              | Logic error         |
| `PermissionDenied` (consistent)       | Configuration issue |

## Integration with Existing Recovery

AI-assisted recovery works alongside other mechanisms:

```yaml
failure_handlers:
  - name: known_errors
    rules:
      # Known recoverable exit codes handled immediately
      - exit_codes: [10, 11]
        recovery_script: ./recover.sh
        max_retries: 3
      # Unknown errors go to pending_failed for AI classification
```

When a job fails with an exit code not covered by the failure handler, it becomes `pending_failed`
instead of `failed`.

## Using with torc recover

The `torc recover` command can work with AI agents:

```bash
# Check for pending_failed jobs
torc jobs list $WORKFLOW_ID --status pending_failed

# Use AI agent to classify and resolve
# (via MCP tools list_pending_failed_jobs and classify_and_resolve_failures)

# After AI classification, reinitialize and continue
torc workflows reinit $WORKFLOW_ID
torc run $WORKFLOW_ID
```

## Custom MCP Servers

You can create domain-specific MCP servers that the AI agent uses alongside torc:

```mermaid
flowchart LR
    subgraph agent["AI Agent"]
        LLM["Claude/Copilot"]
    end

    subgraph mcp["MCP Servers"]
        TORC["torc-mcp-server"]
        CUSTOM["custom-mcp-server"]
    end

    subgraph systems["Systems"]
        API["Torc API"]
        DOMAIN["Domain APIs"]
    end

    LLM --> TORC
    LLM --> CUSTOM
    TORC --> API
    CUSTOM --> DOMAIN

    style LLM fill:#4a9eff,color:#fff
    style TORC fill:#28a745,color:#fff
    style CUSTOM fill:#ffc107,color:#000
```

For example, a custom MCP server could:

- Check external service status (was the storage server down?)
- Look up known error patterns in a database
- Query infrastructure monitoring systems

## Workflow Reset

If AI classification doesn't happen (e.g., no agent is running), jobs stay in `pending_failed`
status indefinitely. To reset them:

```bash
# Reset all failed/pending_failed jobs
torc workflows reset-status $WORKFLOW_ID --failed-only
```

## Example Session

Here's an example conversation with an AI agent:

```
User: My workflow 123 has some failures. Can you help?

Agent: Let me check what's happening.
       [Uses get_workflow_status tool]
       The workflow has 3 jobs in pending_failed status.

Agent: [Uses list_pending_failed_jobs tool]
       I found 3 jobs awaiting classification:

       Job 456 (process_batch_1): stderr shows
       "ConnectionError: Connection refused to storage.internal:443"

       Job 457 (process_batch_2): stderr shows the same connection error

       Job 458 (validate_data): stderr shows
       "ValueError: Column 'timestamp' has invalid format"

Agent: Analysis:
       - Jobs 456, 457: Transient network error. The storage server was
         temporarily unreachable. These should be retried.
       - Job 458: Permanent error. There's a data format issue that
         requires fixing the input data.

Agent: [Uses classify_and_resolve_failures with dry_run=true]
       Here's my proposed classification:
       - Job 456: retry (transient network error)
       - Job 457: retry (transient network error)
       - Job 458: fail (invalid input data)

       Should I apply these classifications?

User: Yes, go ahead.

Agent: [Uses classify_and_resolve_failures with dry_run=false]
       Done! Jobs 456 and 457 are now ready to retry.
       Job 458 has been marked as failed.

       Note: You'll need to fix the timestamp format in your input data
       before rerunning job 458.
```

## Summary

AI-assisted recovery provides intelligent failure classification:

- **`pending_failed` status**: Defers classification for AI analysis
- **MCP tools**: `list_pending_failed_jobs`, `classify_and_resolve_failures`
- **Error patterns**: Transient vs permanent classification
- **Integration**: Works with failure handlers and `torc recover`