gflow 0.4.14

A lightweight, single-node job scheduler written in Rust.
Documentation
# Job Lifecycle

This guide explains the complete lifecycle of jobs in gflow, including state transitions, status checking, and recovery operations.

## Job States

gflow jobs can be in one of seven states:

| State | Short | Description |
|-------|-------|-------------|
| **Queued** | PD | Job is waiting to run (pending dependencies or resources) |
| **Hold** | H | Job is on hold by user request |
| **Running** | R | Job is currently executing |
| **Finished** | CD | Job completed successfully |
| **Failed** | F | Job terminated with an error |
| **Cancelled** | CA | Job was cancelled by user or system |
| **Timeout** | TO | Job exceeded its time limit |

### State Categories

**Active States** (job is not yet complete):
- Queued, Hold, Running

**Completed States** (job has finished):
- Finished, Failed, Cancelled, Timeout

## State Transition Diagram

The following diagram keeps only the core transitions. Completed states are terminal.

```mermaid
---
showToolbar: true
---
flowchart LR
    Submit([Submit]) --> Queued[Queued]
    Queued -->|ready| Running[Running]
    Queued -->|hold| Hold[Hold]
    Queued -->|cancel / dependency failed| Cancelled[Cancelled]
    Hold -->|release| Queued
    Hold -->|cancel| Cancelled
    Running -->|exit 0| Finished[Finished]
    Running -->|exit != 0| Failed[Failed]
    Running -->|cancel| Cancelled
    Running -->|time limit| Timeout[Timeout]
```

Use the toolbar in the top-right corner to zoom, fit, download, or enter fullscreen.

### State Transition Rules

**From Queued**:
- **Running**: When dependencies are met AND resources are available
-**Hold**: User runs `gjob hold <job_id>`
-**Cancelled**: User runs `gcancel <job_id>` OR a dependency fails (with auto-cancel enabled)

**From Hold**:
- **Queued**: User runs `gjob release <job_id>`
-**Cancelled**: User runs `gcancel <job_id>`

**From Running**:
- **Finished**: Job script/command exits with code 0
-**Failed**: Job script/command exits with non-zero code
-**Cancelled**: User runs `gcancel <job_id>`
-**Timeout**: Job exceeds its time limit (set with `--time`)

**From Completed States**:
- No transitions (final states)
- Use `gjob redo <job_id>` to create a new job with the same parameters

## Job State Reasons

Jobs in certain states have an associated reason that provides more context:

| State | Reason | Description |
|-------|--------|-------------|
| Queued | `WaitingForDependency` | Job is waiting for parent jobs to finish |
| Queued | `WaitingForResources` | Job is waiting for available GPUs/memory |
| Hold | `JobHeldUser` | Job was put on hold by user request |
| Cancelled | `CancelledByUser` | User explicitly cancelled the job |
| Cancelled | `DependencyFailed:<job_id>` | Job was auto-cancelled because job `<job_id>` failed |
| Cancelled | `SystemError:<msg>` | Job was cancelled due to a system error |

View the reason with `gjob show <job_id>` or `gqueue -f JOBID,ST,REASON`.

## Status Checking Workflow

The following diagram shows a simplified check -> action -> recheck loop:

```mermaid
---
showToolbar: true
---
flowchart TD
    Check([Run gqueue -f JOBID,ST,REASON]) --> State{State?}

    State -->|Queued| QueuedReason{Reason?}
    QueuedReason -->|WaitingForDependency| Dep[Check parent jobs<br/>gqueue -t]
    QueuedReason -->|WaitingForResources| Res[Check resources<br/>ginfo]
    Dep --> Recheck([Recheck later])
    Res --> Recheck

    State -->|Hold| Release[Release job<br/>gjob release ID]
    Release --> Recheck

    State -->|Running| Monitor[Monitor logs or attach<br/>gjob log ID / gjob attach ID]
    Monitor --> Recheck

    State -->|Finished| Done([Done])

    State -->|Failed| Retry[Inspect logs and redo if fixed<br/>gjob log ID / gjob redo ID]
    Retry --> Recheck

    State -->|Cancelled| CancelReason{Reason?}
    CancelReason -->|CancelledByUser| Stop([No further action])
    CancelReason -->|DependencyFailed| Cascade[Fix parent and redo<br/>gjob redo PARENT_ID --cascade]
    Cascade --> Recheck

    State -->|Timeout| MoreTime[Redo with more time<br/>gjob redo ID --time HH:MM:SS]
    MoreTime --> Recheck
```

## See Also

- [Job Dependencies]./job-dependencies - Complete guide to job dependencies
- [Job Submission]./job-submission - Job submission options
- [Time Limits]./time-limits - Managing job timeouts