gflow 0.4.13

A lightweight, single-node job scheduler written in Rust.
Documentation
# GPU Management

gflow detects NVIDIA GPUs (via NVML) and allocates them to jobs by setting `CUDA_VISIBLE_DEVICES`.

## Quick Start

```bash
# Start the daemon (if not already running)
gflowd up

# See availability + current allocations
ginfo

# Submit a GPU job
gbatch --gpus 1 python train.py

# Track jobs and allocations
gqueue -s Running,Queued -f JOBID,NAME,ST,NODES,NODELIST(REASON)
```

## Inspect GPUs

```bash
ginfo
```

Example output:
```
PARTITION  GPUS  NODES  STATE      JOB(REASON)
gpu        1     1      idle
gpu        1     0      allocated  5 (train-resnet)
```

- `NODES` shows the physical GPU indices.
- If a GPU is busy but not allocated by gflow, it may appear with a reason (when available).

Non-gflow GPU usage:
- If NVML reports running compute processes on a GPU, gflow treats it as unavailable (often shown as `Unmanaged`) and will not allocate it.
- gflow does not preempt/kill non-gflow processes; jobs wait until the GPU becomes idle.

If you need per-GPU restriction status (allowed vs restricted):

```bash
gctl show-gpus
```

### Requirements

- NVIDIA GPU(s) + driver
- NVML library available (`libnvidia-ml.so`)

Quick check:

```bash
nvidia-smi
gflowd up
ginfo
```

On systems without GPUs, gflow still works; only GPU allocation is unavailable.

## Request GPUs

```bash
gbatch --gpus 1 python train.py
gbatch --gpus 2 python multi_gpu_train.py
```

When a job starts, gflow assigns **physical GPU indices** and exports them via `CUDA_VISIBLE_DEVICES` (which frameworks typically renumber starting from `0`).

To see allocated GPU IDs:

```bash
gqueue -s Running -f JOBID,NAME,ST,NODES,NODELIST(REASON)
gjob show <job_id>
```

### Shared GPU Mode

Use shared mode when you want multiple jobs to co-locate on one physical GPU.

```bash
gbatch --gpus 1 --shared --gpu-memory 20G python train.py
```

- `--shared` jobs only share with other `--shared` jobs.
- `--shared` requires a per-GPU VRAM limit via `--gpu-memory` (alias: `--max-gpu-mem`).
- `--memory` (`--max-mem`) is still host RAM, not GPU VRAM.

### GPU Visibility

```bash
#!/bin/bash
# GFLOW --gpus 2

echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
python train.py
```

## Restrict Which GPUs gflow Uses

Limit which physical GPUs the scheduler is allowed to allocate (affects new allocations only):

```bash
gctl set-gpus 0,2
gctl show-gpus

# Or via daemon CLI flag (overrides config)
gflowd restart --gpus 0-3
```

See also: [Configuration -> GPU Selection](./configuration#gpu-selection).

## Choose GPU Allocation Strategy

When multiple GPUs are available for a job, you can choose how gflow selects them:

- `sequential` (default): picks lower indices first.
- `random`: randomizes GPU selection order.

```toml
[daemon]
gpu_allocation_strategy = "sequential"
# gpu_allocation_strategy = "random"
```

Or override on daemon startup:

```bash
gflowd up --gpu-allocation-strategy random
```

## Troubleshooting

### Job not getting GPU

```bash
ginfo
gqueue -j <job_id> -f JOBID,ST,NODES,NODELIST(REASON)
gctl show-gpus
```

### Job sees wrong GPUs

```bash
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
gqueue -f JOBID,NODELIST(REASON)
```

### Out of memory

```bash
nvidia-smi --query-gpu=memory.free,memory.used --format=csv
```

If shared jobs fail with OOM, verify `--gpu-memory` is set and sized appropriately for each job.

## See Also

- [Job Submission]./job-submission - Complete job submission guide
- [Job Dependencies]./job-dependencies - Workflow management
- [Time Limits]./time-limits - Job timeout management
- [Quick Reference]../reference/quick-reference - Command cheat sheet