torc 0.21.0

Workflow management system
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
# Automatic Failure Recovery

This guide explains how to use Torc's automatic recovery features to handle workflow failures
without manual intervention.

## Overview

Torc provides **automatic failure recovery** through two commands:

- **`torc recover`** - One-shot recovery for Slurm workflows
- **`torc watch --recover`** - Continuous monitoring with automatic recovery

When jobs fail, the system:

1. Diagnoses the failure cause (OOM, timeout, or unknown)
2. Applies heuristics to adjust resource requirements
3. Resets failed jobs and submits new Slurm allocations
4. (watch only) Resumes monitoring until completion or max retries

This deterministic approach handles the majority of HPC failures without human intervention.

## Why Deterministic Recovery?

Most HPC job failures fall into predictable categories:

| Failure Type     | Frequency | Solution                   |
| ---------------- | --------- | -------------------------- |
| Out of Memory    | ~60%      | Increase memory allocation |
| Timeout          | ~25%      | Increase runtime limit     |
| Transient errors | ~10%      | Simple retry               |
| Code bugs        | ~5%       | Manual intervention        |

For 85-90% of failures, the solution is mechanical: increase resources and retry. This doesn't
require AI judgment—simple heuristics work well.

## Recovery Architecture

```mermaid
flowchart LR
    A[torc watch<br/>polling] --> B{Workflow<br/>complete?}
    B -->|No| A
    B -->|Yes, with failures| C[Diagnose failures<br/>check resources]
    C --> D[Apply heuristics<br/>adjust resources]
    D --> E[Submit new<br/>allocations]
    E --> A
    B -->|Yes, success| F[Exit 0]

    style A fill:#4a9eff,color:#fff
    style B fill:#6c757d,color:#fff
    style C fill:#ffc107,color:#000
    style D fill:#ffc107,color:#000
    style E fill:#28a745,color:#fff
    style F fill:#28a745,color:#fff
```

### Failure Detection

Torc tracks resource usage during job execution:

- Memory usage (RSS and peak)
- CPU utilization
- Execution time

This data is analyzed to determine failure causes:

**OOM Detection:**

- Peak memory exceeds specified limit
- Exit code 137 (SIGKILL from OOM killer)
- Flag: `likely_oom: true`

**Timeout Detection:**

- Execution time within 10% of runtime limit
- Job was killed (not graceful exit)
- Flag: `likely_timeout: true`

### Recovery Heuristics

| Failure Type  | Detection                          | Default Action           |
| ------------- | ---------------------------------- | ------------------------ |
| Out of Memory | Peak memory > limit, exit code 137 | Increase memory by 1.5x  |
| Timeout       | Execution time near limit          | Increase runtime by 1.5x |
| Unknown       | Other exit codes                   | **Skip** (likely bug)    |

> **Note:** By default, jobs with unknown failure causes are **not** retried, since they likely have
> script or data bugs that won't be fixed by retrying. Use `--retry-unknown` to also retry these
> jobs (e.g., to handle transient errors like network issues).

## The `torc recover` Command

For one-shot recovery when a workflow has failed:

```bash
# Preview what would be done (recommended first step)
torc recover 42 --dry-run

# Execute the recovery
torc recover 42
```

This command:

1. Detects and cleans up orphaned jobs from terminated Slurm allocations
2. Checks that the workflow is complete and no workers are active
3. Diagnoses failure causes (OOM, timeout, etc.)
4. Adjusts resource requirements based on heuristics
5. Runs optional recovery hook for custom logic
6. Resets failed jobs and regenerates Slurm schedulers
7. Submits new allocations

> **Note:** Step 1 (orphan cleanup) handles the case where Slurm terminated an allocation
> unexpectedly, leaving jobs stuck in "running" status. This is done automatically before checking
> preconditions.

### Options

```bash
torc recover <workflow_id> \
  --memory-multiplier 1.5 \     # Memory increase factor for OOM (default: 1.5)
  --runtime-multiplier 1.4 \    # Runtime increase factor for timeout (default: 1.4)
  --retry-unknown \             # Also retry jobs with unknown failure causes
  --recovery-hook "bash fix.sh" \  # Custom script for unknown failures
  --dry-run                     # Preview without making changes
```

### Example Output

```
Diagnosing failures...
Applying recovery heuristics...
  Job 107 (train_model): OOM detected, increasing memory 8g -> 12g
  Applied fixes: 1 OOM, 0 timeout
Resetting 1 job(s) for retry...
  Reset 1 job(s)
Reinitializing workflow...
Regenerating Slurm schedulers...
  Submitted Slurm allocation with 1 job

Recovery complete for workflow 42
  - 1 job(s) had memory increased
Reset 1 job(s). Slurm schedulers regenerated and submitted.
```

## The `torc watch --recover` Command

The `torc watch` command can automatically recover from common failures:

```bash
torc watch 42 --recover
```

This will:

1. Poll the workflow until completion
2. On failure, diagnose the cause (OOM, timeout, etc.)
3. Adjust resource requirements based on heuristics
4. Reset failed jobs and submit new Slurm allocations
5. Resume monitoring
6. Repeat until success (or max retries exceeded, if `--max-retries` is set)

### Options

```bash
torc watch <workflow_id> \
  -r \                          # Enable automatic recovery (--recover)
  -m 5 \                        # Optional: limit recovery attempts (--max-retries)
  --memory-multiplier 1.5 \     # Memory increase factor for OOM
  --runtime-multiplier 1.5 \    # Runtime increase factor for timeout
  --retry-unknown \             # Also retry jobs with unknown failures
  --recovery-hook "bash fix.sh" \  # Custom recovery script
  -p 60 \                       # Seconds between status checks (--poll-interval)
  -o torc_output \                   # Directory for job output files (--output-dir)
  -s \                          # Display job counts during polling (--show-job-counts)
  --auto-schedule \             # Automatically schedule nodes for stranded jobs
  --auto-schedule-threshold 5 \ # Min retry jobs before scheduling (default: 5)
  --auto-schedule-cooldown 1800 \      # Seconds between auto-schedule attempts (default: 1800)
  --auto-schedule-stranded-timeout 7200 \ # Schedule stranded jobs after this time (default: 7200)
  --partition standard \        # Fixed Slurm partition (bypass auto-detection)
  --walltime 04:00:00           # Fixed walltime (bypass auto-calculation)
```

### Custom Recovery Hooks

For failures that torc can't handle automatically (not OOM or timeout), you can provide a custom
recovery script using `--recovery-hook`. This is useful for domain-specific recovery logic, such as
adjusting Apache Spark cluster sizes or fixing configuration issues.

```bash
torc watch 42 --recover --recovery-hook "bash fix-spark-cluster.sh"
```

The hook receives the workflow ID in two ways:

- **As an argument**: `bash fix-spark-cluster.sh 42`
- **As an environment variable**: `TORC_WORKFLOW_ID=42`

Your script can use torc CLI commands to query and modify the workflow:

```bash
#!/bin/bash
# fix-spark-cluster.sh - Example recovery hook for Spark jobs

WORKFLOW_ID=$1  # or use $TORC_WORKFLOW_ID

# Find failed jobs
FAILED_JOBS=$(torc jobs list $WORKFLOW_ID --status failed -f json | jq -r '.[].id')

for JOB_ID in $FAILED_JOBS; do
    # Get current resource requirements
    JOB_INFO=$(torc jobs get $JOB_ID -f json)
    RR_ID=$(echo "$JOB_INFO" | jq -r '.resource_requirements_id')

    # Check if this is a Spark job that needs more nodes
    # (your logic here - parse logs, check error messages, etc.)

    # Update resource requirements
    torc resource-requirements update $RR_ID --num-nodes 16

    echo "Updated job $JOB_ID to use 16 nodes"
done
```

When a recovery hook is provided:

1. Jobs with unknown failures are automatically included for retry
2. The hook runs **before** `reset-status` is called
3. If the hook fails (non-zero exit), auto-recovery stops with an error
4. After the hook succeeds, failed jobs are reset and retried

## Auto-Scheduling for Failure Handlers

When using [failure handlers](./failure-handlers.md) that create retry jobs, the originally planned
compute capacity may not be sufficient. The `--auto-schedule` option enables automatic scheduling of
additional Slurm nodes when:

1. **No schedulers available**: If there are ready jobs but no active or pending Slurm allocations,
   new schedulers are immediately regenerated and submitted.

2. **Retry jobs accumulating**: If there are active schedulers but retry jobs (jobs with
   `attempt_id > 1`) are accumulating beyond the threshold, additional schedulers are submitted
   after the cooldown period.

This is particularly useful for workflows with failure handlers that retry failed jobs, ensuring
those retries get scheduled without manual intervention.

### Example: Failure Handler with Auto-Scheduling

```bash
# Submit a workflow with failure handlers
torc submit-slurm --account my_project workflow.yaml

# Watch with auto-scheduling enabled (uses defaults)
torc watch $WORKFLOW_ID --auto-schedule
```

With default settings:

- If all Slurm allocations complete but retry jobs remain, new allocations are submitted
- If 5+ retry jobs accumulate while allocations are running, additional capacity is scheduled
- After scheduling, the system waits 30 minutes before considering another auto-schedule
- If fewer than 5 retry jobs are waiting for 2 hours, they're scheduled anyway (stranded timeout)

## Choosing the Right Command

| Use Case                          | Command                  |
| --------------------------------- | ------------------------ |
| One-shot recovery after failure   | `torc recover`           |
| Continuous monitoring             | `torc watch -r`          |
| Preview what recovery would do    | `torc recover --dry-run` |
| Production long-running workflows | `torc watch -r`          |
| Manual investigation, then retry  | `torc recover`           |

## Complete Workflow Example

### 1. Submit a Workflow

```bash
torc submit-slurm --account myproject workflow.yaml
```

Output:

```
Created workflow 42 with 100 jobs
Submitted to Slurm with 10 allocations
```

### 2. Start Watching with Auto-Recovery

```bash
torc watch 42 --recover --show-job-counts
```

> **Note:** The `--show-job-counts` flag is optional. Without it, the command polls silently until
> completion, which reduces server load for large workflows.

Output:

```
Watching workflow 42 (poll interval: 60s, recover enabled, unlimited retries, job counts enabled)
  completed=0, running=10, pending=0, failed=0, blocked=90
  completed=25, running=10, pending=0, failed=0, blocked=65
  ...
  completed=95, running=0, pending=0, failed=5, blocked=0
Workflow 42 is complete

Workflow completed with failures:
  - Failed: 5
  - Canceled: 0
  - Terminated: 0
  - Completed: 95

Attempting automatic recovery (attempt 1)

Diagnosing failures...
Applying recovery heuristics...
  Job 107 (train_model_7): OOM detected, increasing memory 8g -> 12g
  Job 112 (train_model_12): OOM detected, increasing memory 8g -> 12g
  Job 123 (train_model_23): OOM detected, increasing memory 8g -> 12g
  Job 131 (train_model_31): OOM detected, increasing memory 8g -> 12g
  Job 145 (train_model_45): OOM detected, increasing memory 8g -> 12g
  Applied fixes: 5 OOM, 0 timeout

Resetting failed jobs...
Regenerating Slurm schedulers and submitting...

Recovery initiated. Resuming monitoring...

Watching workflow 42 (poll interval: 60s, recover enabled, unlimited retries, job counts enabled)
  completed=95, running=5, pending=0, failed=0, blocked=0
  ...
Workflow 42 is complete

Workflow completed successfully (100 jobs)
```

### 3. If No Recoverable Jobs Found

If all failures are from unknown causes (not OOM or timeout):

```
Applying recovery heuristics...
  2 job(s) with unknown failure cause (skipped, use --retry-unknown to include)

No recoverable jobs found. 2 job(s) failed with unknown causes.
Use --retry-unknown to retry jobs with unknown failure causes.
Or use the Torc MCP server with your AI assistant to investigate.
```

This prevents wasting allocation time on jobs that likely have script or data bugs.

### 4. If Max Retries Exceeded

If `--max-retries` is set and failures persist after that many attempts:

```
Max retries (3) exceeded. Manual intervention required.
Use the Torc MCP server with your AI assistant to investigate.
```

At this point, you can use the MCP server with an AI assistant to investigate the root cause.

## Log Files

All `torc watch` output is logged to both the terminal and a log file:

```
<output-dir>/watch_<hostname>_<workflow_id>.log
```

For example: `torc_output/watch_myhost_42.log`

This ensures you have a complete record of the watch session even if your terminal disconnects.

## When to Use Manual Recovery

Automatic recovery works well for resource-related failures, but some situations require manual
intervention:

### Use Manual Recovery When:

1. **Jobs keep failing after max retries**
   - The heuristics aren't solving the problem
   - Need to investigate root cause

2. **Unknown failure modes**
   - Exit codes that don't indicate OOM/timeout
   - Application-specific errors

3. **Code bugs**
   - Jobs fail consistently with same error
   - No resource issue detected

4. **Cost optimization**
   - Want to analyze actual usage before increasing
   - Need to decide whether job is worth more resources

### MCP Server for Manual Recovery

The Torc MCP server provides tools for AI-assisted investigation:

| Tool                         | Purpose                          |
| ---------------------------- | -------------------------------- |
| `get_workflow_status`        | Get overall workflow status      |
| `list_failed_jobs`           | List failed jobs with error info |
| `get_job_logs`               | Read stdout/stderr logs          |
| `check_resource_utilization` | Detailed resource analysis       |
| `update_job_resources`       | Manually adjust resources        |
| `resubmit_workflow`          | Regenerate Slurm schedulers      |

## Best Practices

### 1. Start with Conservative Resources

Set initial resource requests lower and let auto-recovery increase them:

- Jobs that succeed keep their original allocation
- Only failing jobs get increased resources
- Avoids wasting HPC resources on over-provisioned jobs

### 2. Set Max Retries When Appropriate

By default, `torc watch` retries indefinitely until the workflow succeeds. Use `--max-retries` to
limit recovery attempts if needed:

```bash
--max-retries 5  # Limit to 5 recovery attempts
```

This can prevent wasting allocation time on jobs that will never succeed.

### 3. Use Appropriate Multipliers

For memory-bound jobs:

```bash
--memory-multiplier 2.0  # Double on OOM
```

For time-sensitive jobs where you want larger increases:

```bash
--runtime-multiplier 2.0  # Double runtime on timeout
```

### 4. Run in tmux or screen

**Always run `torc watch` inside tmux or screen** for long-running workflows. HPC workflows can run
for hours or days, and you don't want to lose your monitoring session if:

- Your SSH connection drops
- Your laptop goes to sleep
- You need to disconnect and reconnect later

Using [tmux](https://github.com/tmux/tmux/wiki) (recommended):

```bash
# Start a new tmux session
tmux new -s torc-watch

# Run the watch command
torc watch 42 --recover --poll-interval 300 --show-job-counts

# Detach from session: press Ctrl+b, then d
# Reattach later: tmux attach -t torc-watch
```

Using screen:

```bash
screen -S torc-watch
torc watch 42 --recover --poll-interval 300 --show-job-counts
# Detach: Ctrl+a, then d
# Reattach: screen -r torc-watch
```

### 5. Check Resource Utilization Afterward

After completion, review actual usage:

```bash
torc reports check-resource-utilization 42
```

This helps tune future job specifications.

## Troubleshooting

### Jobs Stuck in "Running" Status

If jobs appear stuck in "running" status after a Slurm allocation ended:

1. This usually means the allocation was terminated unexpectedly (timeout, node failure, etc.)
2. The `torc recover` command automatically handles this as its first step
3. To manually clean up without triggering recovery, use:
   ```bash
   torc workflows sync-status <workflow_id>
   ```
4. To preview what would be cleaned up:
   ```bash
   torc workflows sync-status <workflow_id> --dry-run
   ```

See [Debugging Slurm Workflows](../hpc/debugging-slurm.md#orphaned-jobs-and-status-synchronization)
for more details.

### Jobs Keep Failing After Recovery

If jobs fail repeatedly with the same error:

1. Check if the error is resource-related (OOM/timeout)
2. Review job logs: `torc jobs logs <job_id>`
3. Check if there's a code bug
4. Use MCP server with AI assistant to investigate

### No Slurm Schedulers Generated

If `slurm regenerate` fails:

1. Ensure workflow was created with `--account` option
2. Check HPC profile is detected: `torc hpc detect`
3. Specify profile explicitly: `--profile kestrel`

### Resource Limits Too High

If jobs are requesting more resources than partitions allow:

1. Check partition limits: `torc hpc partitions <profile>`
2. Use smaller multipliers
3. Consider splitting jobs into smaller pieces

## Comparison: Automatic vs Manual Recovery

| Feature                | Automatic            | Manual/AI-Assisted      |
| ---------------------- | -------------------- | ----------------------- |
| Human involvement      | None                 | Interactive             |
| Speed                  | Fast                 | Depends on human        |
| Handles OOM/timeout    | Yes                  | Yes                     |
| Handles unknown errors | Retry only           | Full investigation      |
| Cost optimization      | Basic                | Can be sophisticated    |
| Use case               | Production workflows | Debugging, optimization |

## Implementation Details

### The Watch Command Flow

1. Poll `is_workflow_complete` API
2. Print status updates
3. On completion, check for failures
4. If failures and recover enabled:
   - Run `torc reports check-resource-utilization --include-failed`
   - Parse results for `likely_oom` and `likely_timeout` flags
   - Update resource requirements via API
   - Run `torc workflows reset-status --failed-only --reinitialize`
   - Run `torc slurm regenerate --submit`
   - Increment retry counter
   - Resume polling
5. Exit 0 on success, exit 1 on max retries exceeded (if `--max-retries` is set)

### The Regenerate Command Flow

1. Query jobs with status uninitialized/ready/blocked
2. Group by resource requirements
3. For each group:
   - Find best partition using HPC profile
   - Calculate jobs per node
   - Determine number of allocations needed
   - Create scheduler config
4. Update jobs with new scheduler reference
5. Submit allocations via sbatch

## See Also

- [Configurable Failure Handlers]./failure-handlers.md - Per-job retry with exit-code-specific
  recovery
- [Resource Monitoring]../../core/monitoring/resource-monitoring.md - Understanding resource
  tracking