torc 0.22.2 - Docs.rs

# Tutorial: Graceful Job Termination on HPC

This tutorial teaches you how to configure Torc workflows so that long-running jobs receive an early
warning signal before Slurm kills them, giving them time to save progress and automatically resume
from the last checkpoint.

## Learning Objectives

By the end of this tutorial, you will:

- Understand how `srun_termination_signal` delivers early SIGTERM to running jobs
- Write a Python job that catches SIGTERM and shuts down gracefully
- Use the shutdown-flag pattern to stop a long-running loop cleanly
- Use a non-zero exit code with a failure handler so Torc automatically retries the job
- Configure a complete Torc workflow with early termination support

## Prerequisites

- Torc server running
- Access to a Slurm cluster
- Basic familiarity with submitting Torc workflows (see
  [Quick Start (HPC/Slurm)](../../getting-started/quick-start-hpc.md))

## Background: Why Graceful Termination Matters

On HPC systems, jobs have a fixed wall-time. When time runs out, Slurm kills the process immediately
with SIGKILL. Any unsaved work—training progress, partial results, intermediate state—is lost.

Torc's `srun_termination_signal` feature tells Slurm to send a catchable signal (SIGTERM) **before**
the hard kill. Your job can trap that signal, finish the current iteration, save a checkpoint, and
exit with a dedicated exit code that tells Torc to retry automatically.

### Timeline of Events

```mermaid
graph LR
    A["Job starts"] -->|normal execution| B["SIGTERM"]
    B -->|"120 seconds"| C["Step timeout"]

    style A fill:#4a9eff,color:#fff
    style B fill:#e8a735,color:#fff
    style C fill:#d9534f,color:#fff
```

With `srun_termination_signal: "TERM@120"`, your job gets 120 seconds of warning before the srun
step's time limit expires.

## Step 1: Write the Python Job

Save this as `simulate.py`:

```python
#!/usr/bin/env python3
"""Long-running simulation that handles SIGTERM for graceful shutdown."""

import json
import os
import signal
import sys
import time

# Exit code that signals "checkpointed, please retry".
# Must match the exit_codes in the workflow's failure_handler.
EXIT_CHECKPOINT = 42

# --- Shutdown flag -----------------------------------------------------------
# The SIGTERM handler sets this flag. The main loop checks it on every
# iteration and breaks out when it becomes True.
shutdown_requested = False


def handle_sigterm(signum, frame):
    """Set the shutdown flag when SIGTERM is received."""
    global shutdown_requested
    print(f"SIGTERM received (signal {signum}). Will stop after current iteration.")
    shutdown_requested = True


# Register the handler BEFORE doing any work.
signal.signal(signal.SIGTERM, handle_sigterm)

# --- Checkpoint helpers ------------------------------------------------------
CHECKPOINT_PATH = os.environ.get("CHECKPOINT_PATH", "checkpoint.json")


def load_checkpoint():
    """Load the last saved iteration, or start from 0."""
    if os.path.exists(CHECKPOINT_PATH):
        with open(CHECKPOINT_PATH) as f:
            data = json.load(f)
        print(f"Resumed from checkpoint at iteration {data['iteration']}")
        return data["iteration"], data["accumulator"]
    return 0, 0.0


def save_checkpoint(iteration, accumulator):
    """Atomically save progress to disk."""
    tmp = CHECKPOINT_PATH + ".tmp"
    with open(tmp, "w") as f:
        json.dump({"iteration": iteration, "accumulator": accumulator}, f)
    os.replace(tmp, CHECKPOINT_PATH)  # atomic on POSIX
    print(f"Checkpoint saved at iteration {iteration}")


# --- Main loop ---------------------------------------------------------------
def main():
    iteration, accumulator = load_checkpoint()
    total_iterations = 100_000

    print(f"Starting simulation from iteration {iteration}")
    while iteration < total_iterations:
        # Check the shutdown flag at the top of every iteration.
        if shutdown_requested:
            print("Shutdown flag is set. Saving checkpoint and exiting.")
            save_checkpoint(iteration, accumulator)
            sys.exit(EXIT_CHECKPOINT)

        # Simulate one unit of work.
        accumulator += iteration * 0.001
        iteration += 1

        # Periodic progress and checkpoint.
        if iteration % 1000 == 0:
            print(f"Iteration {iteration}/{total_iterations}  accumulator={accumulator:.4f}")
            save_checkpoint(iteration, accumulator)

        time.sleep(0.01)  # simulate compute time

    print(f"Simulation complete. Final accumulator={accumulator:.4f}")
    save_checkpoint(iteration, accumulator)


if __name__ == "__main__":
    main()
```

### Key Design Points

1. **Non-zero exit code.** When SIGTERM arrives before the simulation finishes, the script exits
   with code 42 (`EXIT_CHECKPOINT`). This tells Torc the job did not complete successfully, so the
   failure handler can automatically schedule a retry. Exit code 0 is reserved for genuine
   completion.

2. **Global shutdown flag.** The signal handler only sets `shutdown_requested = True`. It does no
   I/O and no cleanup—signal handlers should be minimal.

3. **Loop checks the flag.** Every iteration starts with `if shutdown_requested:`. This guarantees
   the current iteration finishes before the job starts saving state.

4. **Atomic checkpoint.** Writing to a `.tmp` file and calling `os.replace()` prevents a corrupted
   checkpoint if the process is killed during the write.

5. **Handler registered early.** `signal.signal(signal.SIGTERM, handle_sigterm)` runs before the
   main loop so the handler is active from the start.

## Step 2: Create the Workflow Specification

Save as `graceful_termination.yaml`:

```yaml
name: graceful_termination_demo
description: Demonstrates early SIGTERM with automatic checkpoint-and-retry

execution_config:
  srun_termination_signal: "TERM@120"

failure_handlers:
  - name: checkpoint_retry
    rules:
      # Exit code 42 means "checkpointed, please retry"
      - exit_codes: [42]
        max_retries: 100

resource_requirements:
  - name: sim_resources
    num_cpus: 2
    num_nodes: 1
    memory: 4g
    runtime: PT2H

jobs:
  - name: simulate
    command: python3 simulate.py
    resource_requirements: sim_resources
    failure_handler: checkpoint_retry

slurm_schedulers:
  - name: scheduler
    account: my_project
    partition: standard
    nodes: 1
    walltime: "02:00:00"

actions:
  - trigger_type: on_workflow_start
    action_type: schedule_nodes
    scheduler: scheduler
    scheduler_type: slurm
    num_allocations: 1
```

Three pieces work together here:

- **`execution_config.srun_termination_signal: "TERM@120"`** tells Slurm to send SIGTERM 120 seconds
  before the step time limit. Torc passes this to every `srun` invocation as
  `srun --signal=TERM@120`.

- **`failure_handlers`** with exit code 42 tells Torc to automatically retry the job whenever it
  exits with code 42. Each retry picks up from the last checkpoint, so the simulation makes progress
  across multiple Slurm allocations.

- **`max_retries`** controls how many checkpoint cycles the job is allowed. Set this high enough to
  cover the worst case: if each allocation provides ~2 hours of compute and the total job needs ~40
  hours, you need at least 20 retries. Setting it generously (e.g., 100) is safe — the job exits
  with code 0 once it finishes, so unused retries cost nothing.

## Step 3: Submit and Run

Since the spec already includes `slurm_schedulers` and `actions`, submit directly:

```bash
torc submit graceful_termination.yaml
```

Or, if you prefer to auto-generate schedulers from job resource requirements:

```bash
torc slurm generate --account my_project -o graceful_termination.yaml graceful_termination.yaml
torc submit graceful_termination.yaml
```

## Step 4: Observe the Behavior

Monitor the workflow:

```bash
torc tui
```

When the srun step nears its time limit, you will see in the job's stdout:

```
Iteration 47000/100000  accumulator=1104.4530
SIGTERM received (signal 15). Will stop after current iteration.
Shutdown flag is set. Saving checkpoint and exiting.
Checkpoint saved at iteration 47001
```

The job exits with code 42, so Torc marks it as **failed** and the failure handler automatically
schedules a retry. The next attempt loads the checkpoint and continues:

```
Resumed from checkpoint at iteration 47001
Starting simulation from iteration 47001
Iteration 48000/100000  accumulator=1151.4530
...
```

This cycle repeats until the simulation finishes all iterations and exits with code 0, at which
point Torc marks the job as **completed**.

Note that each retry doesn't need to finish the entire remaining work — it only needs to make
**some** forward progress before the next checkpoint. This is the expected behavior for long-running
jobs whose total compute time exceeds a single Slurm allocation. The job spreads its work across as
many allocations as needed, and `max_retries` just needs to be high enough to cover the total number
of checkpoint cycles.

## How It Works Under the Hood

1. **`execution_config.srun_termination_signal: "TERM@120"`** is stored on the workflow record in
   the Torc database.

2. When the job runner launches a job inside a Slurm allocation, it builds an `srun` command that
   includes `--signal=TERM@120`.

3. Slurm's step manager sends SIGTERM to the job's process group 120 seconds before `--time`
   expires.

4. Python's signal handler sets `shutdown_requested = True`.

5. The main loop sees the flag, saves the checkpoint, and calls `sys.exit(42)`.

6. Torc sees exit code 42 (non-zero), marks the job as failed, and consults the failure handler.

7. The failure handler matches exit code 42 and schedules a retry (up to `max_retries` times).

8. On retry, the script calls `load_checkpoint()` and resumes from where it left off.

9. When all iterations finish, the script exits with code 0 and Torc marks it as completed.

## What You Learned

In this tutorial, you learned:

- How to set `srun_termination_signal` in a workflow spec for early warning before timeout
- The shutdown-flag pattern: signal handler sets a flag, main loop checks it each iteration
- How to write atomic checkpoints that survive unexpected kills
- How to use a dedicated exit code with a failure handler for automatic checkpoint-and-retry

## Next Steps

- [Automatic Failure Recovery](./automatic-recovery.md) — Configure Torc to automatically retry or
  recover failed jobs