torc 0.23.0 - Docs.rs

# Resource Monitoring Reference

Technical reference for Torc's resource monitoring system.

## Configuration Options

The `resource_monitor` section has one shared sampling interval and separate nested scopes for jobs
and compute nodes:

```yaml
resource_monitor:
  sample_interval_seconds: 5
  jobs:
    enabled: true
    granularity: summary
  compute_node:
    enabled: true
    granularity: time_series
```

| Field                     | Type                     | Default | Description                                |
| ------------------------- | ------------------------ | ------- | ------------------------------------------ |
| `sample_interval_seconds` | integer                  | `10`    | Seconds between all resource samples       |
| `generate_plots`          | boolean                  | `false` | Emit HTML plots after the job runner exits |
| `jobs`                    | JobMonitorConfig         | none    | Per-job CPU and memory monitoring          |
| `compute_node`            | ComputeNodeMonitorConfig | none    | Overall compute-node CPU/memory monitoring |

For backwards compatibility, top-level `enabled` and `granularity` fields are still accepted and
apply to job monitoring when `jobs` is omitted:

```yaml
resource_monitor:
  enabled: true
  granularity: time_series
  sample_interval_seconds: 5
```

New workflow specs should use the explicit `jobs` block.

### Job Monitoring

`resource_monitor.jobs` controls per-job CPU and memory monitoring. Summary mode stores peak and
average values on job results. Time-series mode also stores per-sample values in a resource metrics
database.

### Compute Node Monitoring

To opt in to overall compute-node CPU and memory monitoring, add a nested `compute_node` block. The
compute-node monitor supports `granularity: "summary"` and `granularity: "time_series"`. Summary
mode stores peak and average values for the runner lifetime on the compute node record. Time-series
mode also stores per-sample values. The current compute-node monitor records CPU and memory; GPU
monitoring is reserved for a future extension.

### Granularity Modes

**Summary mode** (`"summary"`):

- Stores only peak and average values per job
- Metrics stored in the main database results table
- Minimal storage overhead

**Time series mode** (`"time_series"`):

- Stores samples at regular intervals
- Creates separate SQLite database per workflow run
- Database location:
  `<output_dir>/resource_utilization/resource_metrics_<hostname>_<workflow_id>_<run_id>.db`

### Sample Interval Guidelines

| Job Duration | Recommended Interval |
| ------------ | -------------------- |
| < 1 hour     | 1-2 seconds          |
| 1-4 hours    | 5 seconds (default)  |
| > 4 hours    | 10-30 seconds        |

## Time Series Database Schema

### `job_resource_samples` Table

| Column          | Type    | Description                      |
| --------------- | ------- | -------------------------------- |
| `id`            | INTEGER | Primary key                      |
| `job_id`        | INTEGER | Torc job ID                      |
| `timestamp`     | REAL    | Unix timestamp                   |
| `cpu_percent`   | REAL    | CPU utilization percentage       |
| `memory_bytes`  | INTEGER | Memory usage in bytes            |
| `num_processes` | INTEGER | Process count including children |

### `job_metadata` Table

| Column     | Type    | Description              |
| ---------- | ------- | ------------------------ |
| `job_id`   | INTEGER | Primary key, Torc job ID |
| `job_name` | TEXT    | Human-readable job name  |

### `system_resource_samples` Table

This table is always created in the resource metrics database, but rows are only written when
`resource_monitor.compute_node` is enabled with `granularity` set to `"time_series"`. If
compute-node monitoring is disabled or summary-only, the table remains empty.

| Column               | Type    | Description                  |
| -------------------- | ------- | ---------------------------- |
| `timestamp`          | INTEGER | Unix timestamp               |
| `cpu_percent`        | REAL    | Overall CPU utilization      |
| `memory_bytes`       | INTEGER | Used system memory in bytes  |
| `total_memory_bytes` | INTEGER | Total system memory in bytes |

### `system_resource_summary` Table

This table is always created in the resource metrics database, but a row is only written when
compute-node time-series monitoring is enabled. Summary-only compute-node monitoring stores these
values on the compute node record instead, leaving this table empty.

| Column              | Type    | Description                  |
| ------------------- | ------- | ---------------------------- |
| `sample_count`      | INTEGER | Number of system samples     |
| `peak_cpu_percent`  | REAL    | Peak overall CPU utilization |
| `avg_cpu_percent`   | REAL    | Average CPU utilization      |
| `peak_memory_bytes` | INTEGER | Peak used system memory      |
| `avg_memory_bytes`  | INTEGER | Average used system memory   |

### Compute Node Summary Fields

When `resource_monitor.compute_node.enabled` is true, Torc stores overall summary metrics on the
compute node record:

| Field               | Description                  |
| ------------------- | ---------------------------- |
| `sample_count`      | Number of system samples     |
| `peak_cpu_percent`  | Peak overall CPU utilization |
| `avg_cpu_percent`   | Average CPU utilization      |
| `peak_memory_bytes` | Peak used system memory      |
| `avg_memory_bytes`  | Average used system memory   |

These fields are shown by `torc compute-nodes get`, `torc compute-nodes list`, the TUI compute nodes
view, and the dashboard compute nodes table.

## Summary Metrics in Results

When using summary mode, the following fields are added to job results:

| Field              | Type  | Description                     |
| ------------------ | ----- | ------------------------------- |
| `peak_cpu_percent` | float | Maximum CPU percentage observed |
| `avg_cpu_percent`  | float | Average CPU percentage          |
| `peak_memory_gb`   | float | Maximum memory in GB            |
| `avg_memory_gb`    | float | Average memory in GB            |

## check-resource-utilization JSON Output

When using `--format json`:

```json
{
  "workflow_id": 123,
  "run_id": null,
  "total_results": 10,
  "over_utilization_count": 3,
  "violations": [
    {
      "job_id": 15,
      "job_name": "train_model",
      "resource_type": "Memory",
      "specified": "8.00 GB",
      "peak_used": "10.50 GB",
      "over_utilization": "+31.3%"
    }
  ]
}
```

| Field                    | Description                                              |
| ------------------------ | -------------------------------------------------------- |
| `workflow_id`            | Workflow being analyzed                                  |
| `run_id`                 | Specific run ID if provided, otherwise `null` for latest |
| `total_results`          | Total number of completed jobs analyzed                  |
| `over_utilization_count` | Number of violations found                               |
| `violations`             | Array of violation details                               |

### Violation Object

| Field              | Description                             |
| ------------------ | --------------------------------------- |
| `job_id`           | Job ID with violation                   |
| `job_name`         | Human-readable job name                 |
| `resource_type`    | `"Memory"`, `"CPU"`, or `"Runtime"`     |
| `specified`        | Resource requirement from workflow spec |
| `peak_used`        | Actual peak usage observed              |
| `over_utilization` | Percentage over/under specification     |

## correct-resources JSON Output

When using `torc -f json workflows correct-resources`:

```json
{
  "status": "success",
  "workflow_id": 123,
  "dry_run": false,
  "no_downsize": false,
  "memory_multiplier": 1.2,
  "cpu_multiplier": 1.2,
  "runtime_multiplier": 1.2,
  "resource_requirements_updated": 2,
  "jobs_analyzed": 5,
  "memory_corrections": 1,
  "runtime_corrections": 1,
  "cpu_corrections": 1,
  "downsize_memory_corrections": 2,
  "downsize_runtime_corrections": 2,
  "downsize_cpu_corrections": 0,
  "adjustments": [
    {
      "resource_requirements_id": 10,
      "direction": "upscale",
      "job_ids": [15],
      "job_names": ["train_model"],
      "memory_adjusted": true,
      "original_memory": "8g",
      "new_memory": "13g",
      "max_peak_memory_bytes": 10500000000
    },
    {
      "resource_requirements_id": 11,
      "direction": "downscale",
      "job_ids": [20, 21],
      "job_names": ["preprocess_a", "preprocess_b"],
      "memory_adjusted": true,
      "original_memory": "32g",
      "new_memory": "3g",
      "max_peak_memory_bytes": 2147483648,
      "runtime_adjusted": true,
      "original_runtime": "PT4H",
      "new_runtime": "PT12M"
    }
  ]
}
```

### Top-Level Fields

| Field                           | Description                                          |
| ------------------------------- | ---------------------------------------------------- |
| `memory_multiplier`             | Memory safety multiplier used                        |
| `cpu_multiplier`                | CPU safety multiplier used                           |
| `runtime_multiplier`            | Runtime safety multiplier used                       |
| `resource_requirements_updated` | Number of resource requirements changed              |
| `jobs_analyzed`                 | Number of jobs with violations analyzed              |
| `memory_corrections`            | Jobs affected by memory upscaling                    |
| `runtime_corrections`           | Jobs affected by runtime upscaling                   |
| `cpu_corrections`               | Jobs affected by CPU upscaling                       |
| `downsize_memory_corrections`   | Jobs affected by memory downsizing                   |
| `downsize_runtime_corrections`  | Jobs affected by runtime downsizing                  |
| `downsize_cpu_corrections`      | Jobs affected by CPU downsizing                      |
| `adjustments`                   | Array of per-resource-requirement adjustment details |

### Adjustment Object

| Field                      | Description                                        |
| -------------------------- | -------------------------------------------------- |
| `resource_requirements_id` | ID of the resource requirement being adjusted      |
| `direction`                | `"upscale"` or `"downscale"`                       |
| `job_ids`                  | Job IDs sharing this resource requirement          |
| `job_names`                | Human-readable job names                           |
| `memory_adjusted`          | Whether memory was changed                         |
| `original_memory`          | Previous memory setting (if adjusted)              |
| `new_memory`               | New memory setting (if adjusted)                   |
| `max_peak_memory_bytes`    | Maximum peak memory observed across jobs           |
| `runtime_adjusted`         | Whether runtime was changed                        |
| `original_runtime`         | Previous runtime setting (if adjusted)             |
| `new_runtime`              | New runtime setting (if adjusted)                  |
| `cpu_adjusted`             | Whether CPU count was changed (omitted when false) |
| `original_cpus`            | Previous CPU count (if adjusted)                   |
| `new_cpus`                 | New CPU count (if adjusted)                        |
| `max_peak_cpu_percent`     | Maximum peak CPU percentage observed across jobs   |

## plot-resources Output Files

| File                                 | Description                                      |
| ------------------------------------ | ------------------------------------------------ |
| `resource_plot_job_<id>.html`        | Per-job timeline with CPU, memory, process count |
| `resource_plot_cpu_all_jobs.html`    | CPU comparison across all jobs                   |
| `resource_plot_memory_all_jobs.html` | Memory comparison across all jobs                |
| `resource_plot_summary.html`         | Bar chart dashboard of peak vs average           |
| `resource_plot_system_timeline.html` | Overall system CPU and memory over time          |
| `resource_plot_system_summary.html`  | Overall system peak and average values           |

All plots are self-contained HTML files using Plotly.js with:

- Interactive hover tooltips
- Zoom and pan controls
- Legend toggling
- Export options (PNG, SVG)

## Monitored Metrics

| Metric         | Unit  | Description                               |
| -------------- | ----- | ----------------------------------------- |
| CPU percentage | %     | Total CPU utilization across all cores    |
| Memory usage   | bytes | Resident memory consumption               |
| Process count  | count | Number of processes in job's process tree |

### Process Tree Tracking

The monitoring system automatically tracks child processes spawned by jobs. When a job creates
worker processes (e.g., Python multiprocessing), all descendants are included in the aggregated
metrics.

## Slurm Accounting Stats

When running inside a Slurm allocation, Torc calls `sacct` after each job step completes and stores
the results in the `slurm_stats` table. These complement the sysinfo-based metrics above with
Slurm-native cgroup measurements.

### Fields

| Field                  | sacct source   | Description                           |
| ---------------------- | -------------- | ------------------------------------- |
| `max_rss_bytes`        | `MaxRSS`       | Peak resident-set size (from cgroups) |
| `max_vm_size_bytes`    | `MaxVMSize`    | Peak virtual memory size              |
| `max_disk_read_bytes`  | `MaxDiskRead`  | Peak disk read bytes                  |
| `max_disk_write_bytes` | `MaxDiskWrite` | Peak disk write bytes                 |
| `ave_cpu_seconds`      | `AveCPU`       | Average CPU time in seconds           |
| `node_list`            | `NodeList`     | Nodes used by the job step            |

Additional identifying fields stored per record: `workflow_id`, `job_id`, `run_id`, `attempt_id`,
`slurm_job_id`.

Fields are `null` when:

- The job ran locally (no `SLURM_JOB_ID` in the environment)
- `sacct` is not available on the node
- The step was not found in the Slurm accounting database at collection time

### Viewing Stats

```bash
torc slurm stats <workflow_id>
torc slurm stats <workflow_id> --job-id <job_id>
torc -f json slurm stats <workflow_id>
```

## Performance Characteristics

- Single background monitoring thread regardless of job count
- Typical overhead: <1% CPU even with 1-second sampling
- Uses native OS APIs via the `sysinfo` crate
- Non-blocking async design