# Debugging Workflows
When workflows fail or produce unexpected results, Torc provides comprehensive debugging tools to
help you identify and resolve issues. The primary debugging tools are:
- **`torc results list`**: Prints a table of return codes for each job execution (non-zero means
failure)
- **`torc results list --include-logs`**: Generates a detailed JSON report containing job results
and all associated log file paths
- **`torc logs analyze <output-dir>`**: Analyzes log files for known error patterns (see
[Working with Logs](working-with-logs.md))
- **torc-dash Debug tab**: Interactive web interface for visual debugging with log file viewer
## Overview
Torc automatically captures return codes and multiple log files for each job execution:
- **Job stdout/stderr**: Output from your job commands
- **Job runner logs**: Internal logs from the Torc job runner
- **Slurm logs**: Additional logs when using Slurm scheduler (see
[Debugging Slurm Workflows](debugging-slurm.md))
The `results list --include-logs` command consolidates all this information into a single JSON
report, making it easy to locate and examine relevant log files for debugging.
## Quick Start
View the job return codes in a table:
```bash
torc results list <workflow_id>
```
```
Results for workflow ID 2:
╭────┬────────┬───────┬────────┬─────────────┬───────────┬──────────┬────────────┬──────────────────────────┬────────╮
│ ID │ Job ID │ WF ID │ Run ID │ Return Code │ Exec Time │ Peak Mem │ Avg CPU % │ Completion Time │ Status │
├────┼────────┼───────┼────────┼─────────────┼───────────┼──────────┼────────────┼──────────────────────────┼────────┤
│ 4 │ 6 │ 2 │ 1 │ 1 │ 1.01 │ 73.8MB │ 21.9% │ 2025-11-13T13:35:43.289Z │ Done │
│ 5 │ 4 │ 2 │ 1 │ 0 │ 1.01 │ 118.1MB │ 301.3% │ 2025-11-13T13:35:43.393Z │ Done │
│ 6 │ 5 │ 2 │ 1 │ 0 │ 1.01 │ 413.6MB │ 19.9% │ 2025-11-13T13:35:43.499Z │ Done │
╰────┴────────┴───────┴────────┴─────────────┴───────────┴──────────┴────────────┴──────────────────────────┴────────╯
Total: 3 results
```
View only failed jobs:
```bash
torc results list <workflow_id> --failed
```
```
Results for workflow ID 2:
╭────┬────────┬───────┬────────┬─────────────┬───────────┬──────────┬────────────┬──────────────────────────┬────────╮
│ ID │ Job ID │ WF ID │ Run ID │ Return Code │ Exec Time │ Peak Mem │ Avg CPU % │ Completion Time │ Status │
├────┼────────┼───────┼────────┼─────────────┼───────────┼──────────┼────────────┼──────────────────────────┼────────┤
│ 4 │ 6 │ 2 │ 1 │ 1 │ 1.01 │ 73.8MB │ 21.9% │ 2025-11-13T13:35:43.289Z │ Done │
╰────┴────────┴───────┴────────┴─────────────┴───────────┴──────────┴────────────┴──────────────────────────┴────────╯
```
Generate a debugging report for a workflow:
```bash
# Generate report for a specific workflow
torc results list --include-logs <workflow_id>
# Specify custom output directory (default: "torc_output")
torc results list --include-logs <workflow_id> --output-dir /path/to/output
# Include all workflow runs (default: only latest run)
torc results list --include-logs <workflow_id> --all-runs
# Interactive workflow selection (if workflow_id omitted)
torc results list --include-logs
```
The command outputs a comprehensive JSON report to stdout. Redirect it to a file for easier
analysis:
```bash
torc results list --include-logs <workflow_id> > debug_report.json
```
## Report Structure
### Top-Level Fields
The JSON report includes workflow-level information:
```json
{
"workflow_id": 123,
"workflow_name": "my_pipeline",
"workflow_user": "researcher",
"all_runs": false,
"total_results": 5,
"results": [...]
}
```
**Fields**:
- `workflow_id`: Unique identifier for the workflow
- `workflow_name`: Human-readable workflow name
- `workflow_user`: Owner of the workflow
- `all_runs`: Whether report includes all historical runs or just the latest
- `total_results`: Number of job results in the report
- `results`: Array of individual job result records
### Job Result Records
Each entry in the `results` array contains detailed information about a single job execution:
```json
{
"job_id": 456,
"job_name": "preprocess_data",
"status": "Done",
"run_id": 1,
"return_code": 0,
"completion_time": "2024-01-15T14:30:00.000Z",
"exec_time_minutes": 5.2,
"compute_node_id": 789,
"compute_node_type": "local",
"job_stdout": "torc_output/job_stdio/job_456.o",
"job_stderr": "torc_output/job_stdio/job_456.e",
"job_runner_log": "torc_output/job_runner_hostname_123_1.log"
}
```
**Core Fields**:
- `job_id`: Unique identifier for the job
- `job_name`: Human-readable job name from workflow spec
- `status`: Job status (Done, Terminated, Failed, etc.)
- `run_id`: Workflow run number (increments on reinitialization)
- `return_code`: Exit code from job command (0 = success)
- `completion_time`: ISO 8601 timestamp when job completed
- `exec_time_minutes`: Duration of job execution in minutes
**Compute Node Fields**:
- `compute_node_id`: ID of the compute node that executed the job
- `compute_node_type`: Type of compute node ("local" or "slurm")
## Log File Paths
The report includes paths to all log files associated with each job. The specific files depend on
the compute node type.
### Local Runner Log Files
For jobs executed by the local job runner (`compute_node_type: "local"`):
```json
{
"job_stdout": "torc_output/job_stdio/job_456.o",
"job_stderr": "torc_output/job_stdio/job_456.e",
"job_runner_log": "torc_output/job_runner_hostname_123_1.log"
}
```
**Log File Descriptions**:
1. **job_stdout** (`torc_output/job_stdio/job_<workflow_id>_<job_id>_<run_id>.o`):
- Standard output from your job command
- Contains print statements, normal program output
- **Use for**: Checking expected output, debugging logic errors
2. **job_stderr** (`torc_output/job_stdio/job_<workflow_id>_<job_id>_<run_id>.e`):
- Standard error from your job command
- Contains error messages, warnings, stack traces
- **Use for**: Investigating crashes, exceptions, error messages
3. **job_runner_log** (`torc_output/job_runner_<hostname>_<workflow_id>_<run_id>.log`):
- Internal Torc job runner logging
- Shows job lifecycle events, resource monitoring, process management
- **Use for**: Understanding Torc's job execution behavior, timing issues
**Log path format conventions**:
- Job stdio logs use job ID in filename
- Runner logs use hostname, workflow ID, and run ID
- All paths are relative to the specified `--output-dir`
### Slurm Runner Log Files
For jobs executed via Slurm scheduler (`compute_node_type: "slurm"`), additional log files are
available including Slurm stdout/stderr, environment logs, and dmesg logs.
See [Debugging Slurm Workflows](debugging-slurm.md) for detailed information about Slurm-specific
log files and debugging tools.
## Using the torc-dash Debugging Tab
The torc-dash web interface provides an interactive Debugging tab for visual debugging of workflow
jobs. This is often the quickest way to investigate failed jobs without using command-line tools.
### Accessing the Debugging Tab
1. Start torc-dash (standalone mode recommended for quick setup):
```bash
torc-dash --standalone
```
2. Open your browser to `http://localhost:8090`
3. Select a workflow from the dropdown in the sidebar
4. Click the **Debugging** tab in the navigation
### Features
#### Job Results Report
The Debug tab provides a report generator with the following options:
- **Output Directory**: Specify where job logs are stored (default: `torc_output`). This must match
the directory used during workflow execution.
- **Include all runs**: Check this to see results from all workflow runs, not just the latest.
Useful for comparing job behavior across reinitializations.
- **Show only failed jobs**: Filter to display only jobs with non-zero return codes. This is checked
by default to help you focus on problematic jobs.
Click **Generate Report** to fetch job results from the server.
#### Job Results Table
After generating a report, the Debug tab displays an interactive table showing:
- **Job ID**: Unique identifier for the job
- **Job Name**: Human-readable name from the workflow spec
- **Status**: Job completion status (Done, Terminated, etc.)
- **Return Code**: Exit code (0 = success, non-zero = failure)
- **Execution Time**: Duration in minutes
- **Run ID**: Which workflow run the result is from
Click any row to select a job and view its log files.
#### Log File Viewer
When you select a job from the table, the Log File Viewer displays:
- **stdout tab**: Standard output from the job command
- Shows print statements and normal program output
- Useful for checking expected behavior and debugging logic
- **stderr tab**: Standard error from the job command
- Shows error messages, warnings, and stack traces
- Primary location for investigating crashes and exceptions
Each tab includes:
- **Copy Path** button: Copy the full file path to clipboard
- **File path display**: Shows where the log file is located
- **Scrollable content viewer**: Dark-themed viewer for easy reading
### Quick Debugging Workflow with torc-dash
1. Open torc-dash and select your workflow from the sidebar
2. Go to the **Debugging** tab
3. Ensure "Show only failed jobs" is checked
4. Click **Generate Report**
5. Click on a failed job in the results table
6. Review the **stderr** tab for error messages
7. Check the **stdout** tab for context about what the job was doing
### When to Use torc-dash vs CLI
**Use torc-dash Debugging tab when:**
- You want a visual, interactive debugging experience
- You need to quickly scan multiple failed jobs
- You're investigating jobs and want to easily switch between stdout/stderr
- You prefer not to construct `jq` queries manually
**Use CLI tools (`torc results list --include-logs`) when:**
- You need to automate failure detection in CI/CD
- You want to save reports for archival or version control
- You're working on a remote server without browser access
- You need to process results programmatically
## Common Debugging Workflows
### Investigating Failed Jobs
When a job fails, follow these steps:
1. **Generate the debug report**:
```bash
torc results list --include-logs <workflow_id> > debug_report.json
```
2. **Find the failed job** using `jq` or similar tool:
```bash
jq '.results[] | select(.return_code != 0)' debug_report.json
jq '.results[] | select(.status == "Done")' debug_report.json
```
3. **Check the job's stderr** for error messages:
```bash
STDERR_PATH=$(jq -r '.results[] | select(.job_name == "my_failing_job") | .job_stderr' debug_report.json)
cat "$STDERR_PATH"
```
4. **Review job stdout** for context:
```bash
STDOUT_PATH=$(jq -r '.results[] | select(.job_name == "my_failing_job") | .job_stdout' debug_report.json)
cat "$STDOUT_PATH"
```
5. **Check runner logs** for execution issues:
```bash
LOG_PATH=$(jq -r '.results[] | select(.job_name == "my_failing_job") | .job_runner_log' debug_report.json)
cat "$LOG_PATH"
```
### Searching Log Files with Grep
Torc's log messages use a structured `key=value` format that makes them easy to search with `grep`.
This is especially useful for tracing specific jobs or workflows across multiple log files.
**Search for all log entries related to a specific workflow:**
```bash
# Find all log lines for workflow 123
grep -r "workflow_id=123" torc_output/
# Find all log lines for workflow 123 in job runner logs only
grep -r "workflow_id=123" torc_output/job_runner_*.log
```
**Search for a specific job:**
```bash
# Find all log lines for job 456
grep -r "job_id=456" torc_output/
# Find log lines for job 456 with more context (2 lines before/after)
grep -r -C 2 "job_id=456" torc_output/
```
**Combine workflow and job searches:**
```bash
# Find log lines for job 456 in workflow 123
# Alternative using extended regex
grep -rE "workflow_id=123.*job_id=456" torc_output/
```
**Search for specific runs or attempts:**
```bash
# Find all log lines for run 2 of workflow 123
# Find retry attempts for a specific job
# Find entries for a specific compute node
grep -r "compute_node_id=789" torc_output/
```
**Common log message patterns to search for:**
```bash
# Find job start events
grep -r "Job started workflow_id=" torc_output/
# Find job completion events
grep -r "Job completed workflow_id=" torc_output/
# Find failed jobs
grep -r "status=failed" torc_output/
# Find all job process completions with return codes
grep -r "Job process completed" torc_output/ | grep "return_code="
```
**Tip**: Redirect grep output to a file for easier analysis of large result sets:
```bash
grep -r "workflow_id=123" torc_output/ > workflow_123_logs.txt
```
### Example: Complete Debugging Session
```bash
# 1. Generate report
torc results list --include-logs 123 > report.json
# 2. Check overall success/failure counts
echo "Total jobs: $(jq '.total_results' report.json)"
echo "Failed jobs: $(jq '[.results[] | select(.return_code != 0)] | length' report.json)"
# 3. List all failed jobs with their names
# Output:
# 456: process_batch_2 (exit code: 1)
# 789: validate_results (exit code: 2)
# 4. Examine stderr for first failure
# Output might show:
# FileNotFoundError: [Errno 2] No such file or directory: 'input/batch_2.csv'
# 5. Check if job dependencies completed successfully
# (The missing file might be an output from a previous job)
### Debugging Across Multiple Runs
When a workflow has been reinitialized multiple times, compare runs to identify regressions:
```bash
# Generate report with all historical runs
torc results list --include-logs <workflow_id> --all-runs > full_history.json
# Compare return codes across runs for a specific job
# Output:
# Run 1: exit code 0
# Run 2: exit code 1
# Run 3: exit code 0
# Run 4: exit code 1
# Extract stderr paths for failed runs
### Log File Missing Warnings
The `results list --include-logs` command automatically checks for log file existence and prints
warnings to stderr if files are missing:
```
Warning: job stdout log file does not exist for job 456: torc_output/job_stdio/job_456.o
Warning: job runner log file does not exist for job 456: torc_output/job_runner_host1_123_1.log
```
**Common causes of missing log files**:
1. **Wrong output directory**: Ensure `--output-dir` matches the directory used during workflow
execution
2. **Logs not yet written**: Job may still be running or failed to start
3. **Logs cleaned up**: Files may have been manually deleted
4. **Path mismatch**: Output directory moved or renamed after execution
**Solution**: Verify the output directory and ensure it matches what was passed to `torc run` or
`torc slurm schedule-nodes`.
## Output Directory Management
The `--output-dir` parameter must match the directory used during workflow execution:
### Local Runner
```bash
# Execute workflow with specific output directory
torc run <workflow_id> /path/to/my_output
# Generate report using the same directory
torc results list --include-logs <workflow_id> --output-dir /path/to/my_output
```
### Slurm Scheduler
```bash
# Submit jobs to Slurm with output directory
torc slurm schedule-nodes <workflow_id> --output-dir /path/to/my_output
# Generate report using the same directory
torc results list --include-logs <workflow_id> --output-dir /path/to/my_output
```
**Default behavior**: If `--output-dir` is not specified, both the runner and reports command
default to `./output`.
## Best Practices
1. **Generate reports after each run**: Create a debug report immediately after workflow execution
for easier troubleshooting
2. **Archive reports with logs**: Store the JSON report alongside log files for future reference
```bash
torc results list --include-logs "$WF_ID" > "torc_output/report_${WF_ID}_$(date +%Y%m%d_%H%M%S).json"
```
3. **Use version control**: Commit debug reports for important workflow runs to track changes over
time
4. **Automate failure detection**: Use the report in CI/CD pipelines to automatically detect and
report failures
5. **Check warnings**: Pay attention to warnings about missing log files - they often indicate
configuration issues
6. **Combine with resource monitoring**: Use `results list --include-logs` for log files and
`reports check-resource-utilization` for performance issues
```bash
torc workflows check-resources "$WF_ID"
torc results list --include-logs "$WF_ID" > report.json
```
7. **Filter large reports**: For workflows with many jobs, filter the report to focus on relevant
jobs
```bash
jq '{workflow_id, workflow_name, results: [.results[] | select(.return_code != 0)]}' report.json
```
## Troubleshooting Common Issues
### "Output directory does not exist" Error
**Cause**: The specified `--output-dir` path doesn't exist.
**Solution**: Verify the directory exists and the path is correct:
```bash
ls -ld output/ # Check if directory exists
torc results list --include-logs <workflow_id> --output-dir "$(pwd)/torc_output"
```
### Empty Results Array
**Cause**: No job results exist for the workflow (jobs not yet executed or initialized).
**Solution**: Check workflow status and ensure jobs have been completed:
```bash
torc status <workflow_id>
torc results list <workflow_id> # Verify results exist
```
### All Log Paths Show Warnings
**Cause**: Output directory mismatch between execution and report generation.
**Solution**: Verify the output directory used during execution:
```bash
# Check where logs actually are
find . -name "job_*.o" -o -name "job_runner_*.log"
# Use correct output directory in report
torc results list --include-logs <workflow_id> --output-dir <correct_path>
```
## Related Commands
- **`torc results list`**: View summary of job results in table format
- **`torc status`**: Check overall workflow status
- **`torc results list --include-logs`**: Generate debug report with all log file paths
- **`torc workflows check-resources`**: Analyze resource usage and find over-utilized jobs
- **`torc jobs list`**: View all jobs and their current status
- **`torc-dash`**: Launch web interface with interactive Debugging tab
- **`torc tui`**: Launch terminal UI for workflow monitoring
## See Also
- [Working with Logs](working-with-logs.md) — Bundling and analyzing logs
- [Debugging Slurm Workflows](debugging-slurm.md) — Slurm-specific debugging tools