# Resource Monitoring Reference
Technical reference for Torc's resource monitoring system.
## Configuration Options
The `resource_monitor` section has one shared sampling interval and separate nested scopes for jobs
and compute nodes:
```yaml
resource_monitor:
sample_interval_seconds: 5
jobs:
enabled: true
granularity: summary
compute_node:
enabled: true
granularity: time_series
```
| `sample_interval_seconds` | integer | `10` | Seconds between all resource samples |
| `generate_plots` | boolean | `false` | Emit HTML plots after the job runner exits |
| `jobs` | JobMonitorConfig | none | Per-job CPU and memory monitoring |
| `compute_node` | ComputeNodeMonitorConfig | none | Overall compute-node CPU/memory monitoring |
For backwards compatibility, top-level `enabled` and `granularity` fields are still accepted and
apply to job monitoring when `jobs` is omitted:
```yaml
resource_monitor:
enabled: true
granularity: time_series
sample_interval_seconds: 5
```
New workflow specs should use the explicit `jobs` block.
### Job Monitoring
`resource_monitor.jobs` controls per-job CPU and memory monitoring. Summary mode stores peak and
average values on job results. Time-series mode also stores per-sample values in a resource metrics
database.
### Compute Node Monitoring
To opt in to overall compute-node CPU and memory monitoring, add a nested `compute_node` block. The
compute-node monitor supports `granularity: "summary"` and `granularity: "time_series"`. Summary
mode stores peak and average values for the runner lifetime on the compute node record. Time-series
mode also stores per-sample values. The current compute-node monitor records CPU and memory; GPU
monitoring is reserved for a future extension.
### Granularity Modes
**Summary mode** (`"summary"`):
- Stores only peak and average values per job
- Metrics stored in the main database results table
- Minimal storage overhead
**Time series mode** (`"time_series"`):
- Stores samples at regular intervals
- Creates separate SQLite database per workflow run
- Database location:
`<output_dir>/resource_utilization/resource_metrics_<hostname>_<workflow_id>_<run_id>.db`
### Sample Interval Guidelines
| < 1 hour | 1-2 seconds |
| 1-4 hours | 5 seconds (default) |
| > 4 hours | 10-30 seconds |
## Time Series Database Schema
### `job_resource_samples` Table
| `id` | INTEGER | Primary key |
| `job_id` | INTEGER | Torc job ID |
| `timestamp` | REAL | Unix timestamp |
| `cpu_percent` | REAL | CPU utilization percentage |
| `memory_bytes` | INTEGER | Memory usage in bytes |
| `num_processes` | INTEGER | Process count including children |
### `job_metadata` Table
| `job_id` | INTEGER | Primary key, Torc job ID |
| `job_name` | TEXT | Human-readable job name |
### `system_resource_samples` Table
This table is always created in the resource metrics database, but rows are only written when
`resource_monitor.compute_node` is enabled with `granularity` set to `"time_series"`. If
compute-node monitoring is disabled or summary-only, the table remains empty.
| `timestamp` | INTEGER | Unix timestamp |
| `cpu_percent` | REAL | Overall CPU utilization |
| `memory_bytes` | INTEGER | Used system memory in bytes |
| `total_memory_bytes` | INTEGER | Total system memory in bytes |
### `system_resource_summary` Table
This table is always created in the resource metrics database, but a row is only written when
compute-node time-series monitoring is enabled. Summary-only compute-node monitoring stores these
values on the compute node record instead, leaving this table empty.
| `sample_count` | INTEGER | Number of system samples |
| `peak_cpu_percent` | REAL | Peak overall CPU utilization |
| `avg_cpu_percent` | REAL | Average CPU utilization |
| `peak_memory_bytes` | INTEGER | Peak used system memory |
| `avg_memory_bytes` | INTEGER | Average used system memory |
### Compute Node Summary Fields
When `resource_monitor.compute_node.enabled` is true, Torc stores overall summary metrics on the
compute node record:
| `sample_count` | Number of system samples |
| `peak_cpu_percent` | Peak overall CPU utilization |
| `avg_cpu_percent` | Average CPU utilization |
| `peak_memory_bytes` | Peak used system memory |
| `avg_memory_bytes` | Average used system memory |
These fields are shown by `torc compute-nodes get`, `torc compute-nodes list`, the TUI compute nodes
view, and the dashboard compute nodes table.
## Summary Metrics in Results
When using summary mode, the following fields are added to job results:
| `peak_cpu_percent` | float | Maximum CPU percentage observed |
| `avg_cpu_percent` | float | Average CPU percentage |
| `peak_memory_gb` | float | Maximum memory in GB |
| `avg_memory_gb` | float | Average memory in GB |
## check-resource-utilization JSON Output
When using `--format json`:
```json
{
"workflow_id": 123,
"run_id": null,
"total_results": 10,
"over_utilization_count": 3,
"violations": [
{
"job_id": 15,
"job_name": "train_model",
"resource_type": "Memory",
"specified": "8.00 GB",
"peak_used": "10.50 GB",
"over_utilization": "+31.3%"
}
]
}
```
| `workflow_id` | Workflow being analyzed |
| `run_id` | Specific run ID if provided, otherwise `null` for latest |
| `total_results` | Total number of completed jobs analyzed |
| `over_utilization_count` | Number of violations found |
| `violations` | Array of violation details |
### Violation Object
| `job_id` | Job ID with violation |
| `job_name` | Human-readable job name |
| `resource_type` | `"Memory"`, `"CPU"`, or `"Runtime"` |
| `specified` | Resource requirement from workflow spec |
| `peak_used` | Actual peak usage observed |
| `over_utilization` | Percentage over/under specification |
## correct-resources JSON Output
When using `torc -f json workflows correct-resources`:
```json
{
"status": "success",
"workflow_id": 123,
"dry_run": false,
"no_downsize": false,
"memory_multiplier": 1.2,
"cpu_multiplier": 1.2,
"runtime_multiplier": 1.2,
"resource_requirements_updated": 2,
"jobs_analyzed": 5,
"memory_corrections": 1,
"runtime_corrections": 1,
"cpu_corrections": 1,
"downsize_memory_corrections": 2,
"downsize_runtime_corrections": 2,
"downsize_cpu_corrections": 0,
"adjustments": [
{
"resource_requirements_id": 10,
"direction": "upscale",
"job_ids": [15],
"job_names": ["train_model"],
"memory_adjusted": true,
"original_memory": "8g",
"new_memory": "13g",
"max_peak_memory_bytes": 10500000000
},
{
"resource_requirements_id": 11,
"direction": "downscale",
"job_ids": [20, 21],
"job_names": ["preprocess_a", "preprocess_b"],
"memory_adjusted": true,
"original_memory": "32g",
"new_memory": "3g",
"max_peak_memory_bytes": 2147483648,
"runtime_adjusted": true,
"original_runtime": "PT4H",
"new_runtime": "PT12M"
}
]
}
```
### Top-Level Fields
| `memory_multiplier` | Memory safety multiplier used |
| `cpu_multiplier` | CPU safety multiplier used |
| `runtime_multiplier` | Runtime safety multiplier used |
| `resource_requirements_updated` | Number of resource requirements changed |
| `jobs_analyzed` | Number of jobs with violations analyzed |
| `memory_corrections` | Jobs affected by memory upscaling |
| `runtime_corrections` | Jobs affected by runtime upscaling |
| `cpu_corrections` | Jobs affected by CPU upscaling |
| `downsize_memory_corrections` | Jobs affected by memory downsizing |
| `downsize_runtime_corrections` | Jobs affected by runtime downsizing |
| `downsize_cpu_corrections` | Jobs affected by CPU downsizing |
| `adjustments` | Array of per-resource-requirement adjustment details |
### Adjustment Object
| `resource_requirements_id` | ID of the resource requirement being adjusted |
| `direction` | `"upscale"` or `"downscale"` |
| `job_ids` | Job IDs sharing this resource requirement |
| `job_names` | Human-readable job names |
| `memory_adjusted` | Whether memory was changed |
| `original_memory` | Previous memory setting (if adjusted) |
| `new_memory` | New memory setting (if adjusted) |
| `max_peak_memory_bytes` | Maximum peak memory observed across jobs |
| `runtime_adjusted` | Whether runtime was changed |
| `original_runtime` | Previous runtime setting (if adjusted) |
| `new_runtime` | New runtime setting (if adjusted) |
| `cpu_adjusted` | Whether CPU count was changed (omitted when false) |
| `original_cpus` | Previous CPU count (if adjusted) |
| `new_cpus` | New CPU count (if adjusted) |
| `max_peak_cpu_percent` | Maximum peak CPU percentage observed across jobs |
## plot-resources Output Files
| `resource_plot_job_<id>.html` | Per-job timeline with CPU, memory, process count |
| `resource_plot_cpu_all_jobs.html` | CPU comparison across all jobs |
| `resource_plot_memory_all_jobs.html` | Memory comparison across all jobs |
| `resource_plot_summary.html` | Bar chart dashboard of peak vs average |
| `resource_plot_system_timeline.html` | Overall system CPU and memory over time |
| `resource_plot_system_summary.html` | Overall system peak and average values |
All plots are self-contained HTML files using Plotly.js with:
- Interactive hover tooltips
- Zoom and pan controls
- Legend toggling
- Export options (PNG, SVG)
## Monitored Metrics
| CPU percentage | % | Total CPU utilization across all cores |
| Memory usage | bytes | Resident memory consumption |
| Process count | count | Number of processes in job's process tree |
### Process Tree Tracking
The monitoring system automatically tracks child processes spawned by jobs. When a job creates
worker processes (e.g., Python multiprocessing), all descendants are included in the aggregated
metrics.
## Slurm Accounting Stats
When running inside a Slurm allocation, Torc calls `sacct` after each job step completes and stores
the results in the `slurm_stats` table. These complement the sysinfo-based metrics above with
Slurm-native cgroup measurements.
### Fields
| `max_rss_bytes` | `MaxRSS` | Peak resident-set size (from cgroups) |
| `max_vm_size_bytes` | `MaxVMSize` | Peak virtual memory size |
| `max_disk_read_bytes` | `MaxDiskRead` | Peak disk read bytes |
| `max_disk_write_bytes` | `MaxDiskWrite` | Peak disk write bytes |
| `ave_cpu_seconds` | `AveCPU` | Average CPU time in seconds |
| `node_list` | `NodeList` | Nodes used by the job step |
Additional identifying fields stored per record: `workflow_id`, `job_id`, `run_id`, `attempt_id`,
`slurm_job_id`.
Fields are `null` when:
- The job ran locally (no `SLURM_JOB_ID` in the environment)
- `sacct` is not available on the node
- The step was not found in the Slurm accounting database at collection time
### Viewing Stats
```bash
torc slurm stats <workflow_id>
torc slurm stats <workflow_id> --job-id <job_id>
torc -f json slurm stats <workflow_id>
```
## Performance Characteristics
- Single background monitoring thread regardless of job count
- Typical overhead: <1% CPU even with 1-second sampling
- Uses native OS APIs via the `sysinfo` crate
- Non-blocking async design