# Workflow Recovery
Torc provides mechanisms for recovering workflows when Slurm allocations are preempted or fail
before completing all jobs. The `torc slurm regenerate` command creates new schedulers and
allocations for pending jobs.
## The Recovery Problem
When running workflows on Slurm, allocations can fail or be preempted before all jobs complete. This
leaves workflows in a partial state with:
1. **Ready/uninitialized jobs** - Jobs that were waiting to run but never got scheduled
2. **Blocked jobs** - Jobs whose dependencies haven't completed yet
3. **Orphaned running jobs** - Jobs still marked as "running" in the database even though their
Slurm allocation has terminated
Simply creating new Slurm schedulers and submitting allocations isn't enough because:
1. **Orphaned jobs block recovery**: Jobs stuck in "running" status prevent the workflow from being
considered complete, blocking recovery precondition checks
2. **Duplicate allocations**: If the workflow had `on_workflow_start` actions to schedule nodes,
those actions would fire again when the workflow is reinitialized, creating duplicate allocations
3. **Missing allocations for blocked jobs**: Blocked jobs will eventually become ready, but there's
no mechanism to schedule new allocations for them
## Orphan Detection
Before recovery can proceed, orphaned jobs must be detected and their status corrected. This is
handled by the **orphan detection** module (`src/client/commands/orphan_detection.rs`).
### How It Works
The orphan detection logic checks for three types of orphaned resources:
1. **Active allocations with terminated Slurm jobs**: ScheduledComputeNodes marked as "active" in
the database, but whose Slurm job is no longer running (verified via `squeue`)
2. **Pending allocations that disappeared**: ScheduledComputeNodes marked as "pending" whose Slurm
job no longer exists (cancelled or failed before starting)
3. **Running jobs with no active compute nodes**: Jobs marked as "running" but with no active
compute nodes to process them (fallback for non-Slurm cases)
```mermaid
flowchart TD
A[Start Orphan Detection] --> B[List active ScheduledComputeNodes]
B --> C{For each Slurm allocation}
C --> D[Check squeue for job status]
D --> E{Job still running?}
E -->|Yes| C
E -->|No| F[Find jobs on this allocation]
F --> G[Mark jobs as failed]
G --> H[Update ScheduledComputeNode to complete]
H --> C
C --> I[List pending ScheduledComputeNodes]
I --> J{For each pending allocation}
J --> K[Check squeue for job status]
K --> L{Job exists?}
L -->|Yes| J
L -->|No| M[Update ScheduledComputeNode to complete]
M --> J
J --> N[Check for running jobs with no active nodes]
N --> O[Mark orphaned jobs as failed]
O --> P[Done]
style A fill:#4a9eff,color:#fff
style B fill:#4a9eff,color:#fff
style C fill:#6c757d,color:#fff
style D fill:#4a9eff,color:#fff
style E fill:#6c757d,color:#fff
style F fill:#4a9eff,color:#fff
style G fill:#dc3545,color:#fff
style H fill:#4a9eff,color:#fff
style I fill:#4a9eff,color:#fff
style J fill:#6c757d,color:#fff
style K fill:#4a9eff,color:#fff
style L fill:#6c757d,color:#fff
style M fill:#4a9eff,color:#fff
style N fill:#4a9eff,color:#fff
style O fill:#dc3545,color:#fff
style P fill:#28a745,color:#fff
```
### Integration Points
Orphan detection is integrated into two commands:
1. **`torc recover`**: Runs orphan detection automatically as the first step before checking
preconditions. This ensures that orphaned jobs don't block recovery.
2. **`torc workflows sync-status`**: Standalone command to run orphan detection without triggering a
full recovery. Useful for debugging or when you want to clean up orphaned jobs without submitting
new allocations.
### The `torc watch` Command
The `torc watch` command also performs orphan detection during its polling loop. When it detects
that no valid Slurm allocations exist (via a quick `squeue` check), it runs the full orphan
detection logic to clean up any orphaned jobs before checking if the workflow can make progress.
## Recovery Actions
The recovery system uses **ephemeral recovery actions** to solve these problems.
### How It Works
When `torc slurm regenerate` runs:
```mermaid
flowchart TD
A[torc slurm regenerate] --> B[Fetch pending jobs]
B --> C{Has pending jobs?}
C -->|No| D[Exit - nothing to do]
C -->|Yes| E[Build WorkflowGraph from pending jobs]
E --> F[Mark existing schedule_nodes actions as executed]
F --> G[Group jobs using scheduler_groups]
G --> H[Create schedulers for each group]
H --> I[Update jobs with scheduler assignments]
I --> J[Create on_jobs_ready recovery actions for deferred groups]
J --> K{Submit allocations?}
K -->|Yes| L[Submit Slurm allocations]
K -->|No| M[Done]
L --> M
style A fill:#4a9eff,color:#fff
style B fill:#4a9eff,color:#fff
style C fill:#6c757d,color:#fff
style D fill:#6c757d,color:#fff
style E fill:#4a9eff,color:#fff
style F fill:#4a9eff,color:#fff
style G fill:#4a9eff,color:#fff
style H fill:#4a9eff,color:#fff
style I fill:#4a9eff,color:#fff
style J fill:#ffc107,color:#000
style K fill:#6c757d,color:#fff
style L fill:#ffc107,color:#000
style M fill:#28a745,color:#fff
```
### Step 1: Mark Existing Actions as Executed
All existing `schedule_nodes` actions are marked as executed using the `claim_action` API. This
prevents them from firing again and creating duplicate allocations:
```mermaid
sequenceDiagram
participant R as regenerate
participant S as Server
participant A as workflow_action table
R->>S: get_workflow_actions(workflow_id)
S-->>R: [action1, action2, ...]
loop For each schedule_nodes action
R->>S: claim_action(action_id)
S->>A: UPDATE executed=1, executed_at=NOW()
end
```
### Step 2: Group Jobs Using WorkflowGraph
The system builds a `WorkflowGraph` from pending jobs and uses `scheduler_groups()` to group them by
`(resource_requirements, has_dependencies)`. This aligns with the behavior of `torc slurm generate`:
- **Jobs without dependencies**: Can be scheduled immediately with `on_workflow_start`
- **Jobs with dependencies** (deferred): Need `on_jobs_ready` recovery actions to schedule when they
become ready
```mermaid
flowchart TD
subgraph pending["Pending Jobs"]
A[Job A: no deps, rr=default]
B[Job B: no deps, rr=default]
C[Job C: depends on A, rr=default]
D[Job D: no deps, rr=gpu]
end
subgraph groups["Scheduler Groups"]
G1[Group 1: default, no deps<br/>Jobs: A, B]
G2[Group 2: default, has deps<br/>Jobs: C]
G3[Group 3: gpu, no deps<br/>Jobs: D]
end
A --> G1
B --> G1
C --> G2
D --> G3
style A fill:#4a9eff,color:#fff
style B fill:#4a9eff,color:#fff
style C fill:#ffc107,color:#000
style D fill:#17a2b8,color:#fff
style G1 fill:#28a745,color:#fff
style G2 fill:#28a745,color:#fff
style G3 fill:#28a745,color:#fff
```
### Step 3: Create Recovery Actions for Deferred Groups
For groups with `has_dependencies = true`, the system creates `on_jobs_ready` recovery actions.
These actions:
- Have `is_recovery = true` to mark them as ephemeral
- Use a `_deferred` suffix in the scheduler name
- Trigger when the blocked jobs become ready
- Schedule additional Slurm allocations for those jobs
```mermaid
flowchart LR
subgraph workflow["Original Workflow"]
A[Job A: blocked] --> C[Job C: blocked]
B[Job B: blocked] --> C
end
subgraph actions["Recovery Actions"]
RA["on_jobs_ready: schedule_nodes<br/>job_ids: (A, B)<br/>is_recovery: true"]
RC["on_jobs_ready: schedule_nodes<br/>job_ids: (C)<br/>is_recovery: true"]
end
style A fill:#6c757d,color:#fff
style B fill:#6c757d,color:#fff
style C fill:#6c757d,color:#fff
style RA fill:#ffc107,color:#000
style RC fill:#ffc107,color:#000
```
## Recovery Action Lifecycle
Recovery actions are ephemeral - they exist only during the recovery period:
```mermaid
stateDiagram-v2
[*] --> Created: regenerate creates action
Created --> Executed: Jobs become ready, action triggers
Executed --> Deleted: Workflow reinitialized
Created --> Deleted: Workflow reinitialized
classDef created fill:#ffc107,color:#000
classDef executed fill:#28a745,color:#fff
classDef deleted fill:#6c757d,color:#fff
class Created created
class Executed executed
class Deleted deleted
```
When a workflow is reinitialized (e.g., after resetting jobs), all recovery actions are deleted and
original actions are reset to their initial state. This ensures a clean slate for the next run.
## Database Schema
Recovery actions are tracked using the `is_recovery` column in the `workflow_action` table:
| `is_recovery` | INTEGER | 0 = normal action, 1 = recovery action |
### Behavior Differences
| On `reset_actions_for_reinitialize` | Reset `executed` to 0 | Deleted entirely |
| Created by | Workflow spec | `torc slurm regenerate` |
| Purpose | Configured behavior | Temporary recovery |
## Usage
```bash
# Regenerate schedulers for pending jobs
torc slurm regenerate <workflow_id> --account <account>
# With automatic submission
torc slurm regenerate <workflow_id> --account <account> --submit
# Using a specific HPC profile
torc slurm regenerate <workflow_id> --account <account> --profile kestrel
```
## Implementation Details
The recovery logic is implemented in:
- `src/client/commands/orphan_detection.rs`: Shared orphan detection logic used by `recover`,
`watch`, and `workflows sync-status`
- `src/client/commands/recover.rs`: Main recovery command implementation
- `src/client/commands/slurm.rs`: `handle_regenerate` function for Slurm scheduler regeneration
- `src/client/workflow_graph.rs`: `WorkflowGraph::from_jobs()` and `scheduler_groups()` methods
- `src/server/api/workflow_actions.rs`: `reset_actions_for_reinitialize` function
- `migrations/20251225000000_add_is_recovery_to_workflow_action.up.sql`: Schema migration
Key implementation notes:
1. **WorkflowGraph construction**: A `WorkflowGraph` is built from pending jobs using `from_jobs()`,
which reconstructs the dependency structure from `depends_on_job_ids`
2. **Scheduler grouping**: Jobs are grouped using `scheduler_groups()` by
`(resource_requirements, has_dependencies)`, matching `slurm generate` behavior
3. **Deferred schedulers**: Groups with dependencies get a `_deferred` suffix in the scheduler name
4. **Allocation calculation**: Number of allocations is based on job count and resources per node
5. **Recovery actions**: Only deferred groups (jobs with dependencies) get `on_jobs_ready` recovery
actions