torc 0.12.3 - Docs.rs

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this
repository.

## Project Overview

**Torc** is a distributed workflow orchestration system designed for managing complex computational
pipelines with job dependencies, resource requirements, and distributed execution. The system uses a
client-server architecture where:

- **Server**: REST API service (Rust) that manages workflow state, job dependencies, and provides a
  SQLite database for persistence
- **Unified CLI**: Single `torc` binary providing all client functionality (workflow management, job
  execution, TUI, resource plotting)
- **Feature-Gated Binaries**: Optional specialized binaries (`torc-server`, `torc-dash`,
  `torc-mcp-server`, `torc-slurm-job-runner`) built via feature flags
- **Python Client**: CLI and library for workflow management (primarily for Python-based workflows)

**Core Concepts:**

- **Workflows**: Top-level containers for related computational jobs with a unique ID
- **Jobs**: Individual computational tasks with dependencies, resource requirements, and status
  tracking (uninitialized → ready → scheduled → pending → running → running →
  completed/failed/canceled)
- **Files & User Data**: Input/output artifacts that establish implicit job dependencies
- **Resource Requirements**: CPU, memory, GPU, and runtime specifications per job
- **Schedulers**: Local or Slurm-based job execution environments
- **Compute Nodes**: Available compute resources for job execution

## Code Quality Requirements

**All code changes must pass the following checks before being committed:**

```bash
cargo fmt -- --check                                  # Rust formatting
cargo clippy --all --all-targets --all-features -- -D warnings  # Rust linting
dprint check                                          # Markdown formatting
```

These checks are enforced by pre-commit hooks installed via `cargo-husky`. The hooks run
automatically on every commit attempt.

**Key requirements:**

- **Rust code**: Must compile without clippy warnings. Use
  `cargo clippy --all --all-targets --all-features -- -D warnings` to verify.
- **Markdown files**: Must comply with dprint formatting with a maximum line length of 100
  characters. Run `dprint fmt` to auto-format or `dprint check` to verify.
- **Before committing**: Always run the checks manually if unsure. The pre-commit hook will block
  commits that fail any check.

For detailed style guidelines, see `docs/src/style-guide.md`.

## Repository Structure

```
torc/
├── src/                 # Main torc library (single crate, feature-gated)
│   ├── bin/             # Binary entry points (torc-server, torc-dash, etc.)
│   ├── client/          # Client modules
│   │   ├── commands/    # CLI command handlers
│   │   ├── apis/        # Generated API client
│   │   ├── workflow_spec.rs    # Workflow specification system
│   │   ├── workflow_manager.rs # Workflow lifecycle management
│   │   ├── job_runner.rs       # Local job execution engine
│   │   └── async_cli_command.rs # Non-blocking job execution
│   ├── server/          # Server implementation modules
│   │   └── api/         # Modular API implementations
│   ├── mcp_server/      # MCP server modules (tools, server)
│   ├── tui/             # Interactive terminal UI modules
│   ├── run_jobs_cmd.rs  # Job runner command module
│   ├── tui_runner.rs    # TUI runner command module
│   ├── plot_resources_cmd.rs # Plot resources command module
│   ├── main.rs          # Unified CLI entry point
│   ├── lib.rs           # Library root
│   └── models.rs        # Shared data models
├── torc-server/         # Server database migrations
│   └── migrations/      # SQLx migration files
├── torc-dash/           # Dashboard static assets
│   └── static/          # Web UI files
├── python_client/       # Python CLI client and library
│   ├── src/torc/        # Python package
│   └── pyproject.toml   # Python project configuration
└── examples/            # Example workflow specifications
```

## Component-Specific Guidance

- Server build, test, and run commands
- Database migration management
- API endpoint implementation patterns
- Job status lifecycle and critical operations
- Database schema and concurrency model

**For Rust client development**:

- All client code is in `src/client/`
- Unified CLI is built with: `cargo build --bin torc --features "client,tui,plot_resources"`
- Workflow specification system in `src/client/workflow_spec.rs` (JSON/JSON5/YAML/KDL formats)
- Workflow manager and job runner in `src/client/`
- API client integration patterns in `src/client/apis/`
- Resource management and job execution in `src/client/job_runner.rs`

## Quick Start Commands

### Server Operations

```bash
# Build and run server (requires DATABASE_URL in .env)
cargo build --features server-bin --bin torc-server
cargo run --features server-bin --bin torc-server -- run --host localhost -p 8080

# Database migrations
sqlx migrate run --source torc-server/migrations
sqlx migrate revert --source torc-server/migrations
```

### Unified CLI Operations

```bash
# Build unified torc CLI
cargo build --workspace --release

# Set server URL (optional, defaults to localhost:8080)
export TORC_API_URL="http://localhost:8080/torc-service/v1"

# Quick workflow execution (convenience commands)
./target/release/torc run examples/sample_workflow.yaml    # Create and run locally
./target/release/torc submit examples/sample_workflow.yaml # Submit (requires scheduler actions)
./target/release/torc submit-slurm --account myproject examples/sample_workflow.yaml  # Auto-generate Slurm schedulers

# Or use explicit workflow management
./target/release/torc workflows create examples/sample_workflow.yaml
./target/release/torc workflows create-slurm --account myproject examples/sample_workflow.yaml  # With Slurm schedulers
./target/release/torc workflows submit <workflow_id>  # Submit to scheduler
./target/release/torc workflows run <workflow_id>     # Run locally

# Other commands
./target/release/torc tui                              # Launch interactive TUI
./target/release/torc plot-resources output/resource_metrics.db # Generate plots
./target/release/torc workflows list                   # List workflows
./target/release/torc jobs list <workflow_id>          # View job status

# Run tests
cargo test  -- --test-threads 1

# Run specific test
cargo test test_get_ready_jobs -- --nocapture
```

### Feature-Gated Binaries

```bash
# Build individual binaries
cargo build --release                          # torc CLI (default features)
cargo build --release --features server-bin    # torc-server + torc-htpasswd
cargo build --release --features dash          # torc-dash
cargo build --release --features mcp-server    # torc-mcp-server
cargo build --release --features slurm-runner  # torc-slurm-job-runner
cargo build --release --all-features           # All binaries
```

## Architecture Overview

### Server Architecture

The server uses a **modular API structure** where each resource type (workflows, jobs, files,
events, etc.) has its own module in `server/src/bin/server/api/`. Key architectural decisions:

- **Async Tokio Runtime**: 8-worker-thread runtime handles concurrent HTTP requests
- **SQLite with Write Locks**: `BEGIN IMMEDIATE TRANSACTION` prevents race conditions in job
  selection
- **Foreign Key Cascades**: Deleting a workflow automatically removes all associated resources
- **OpenAPI-Generated Base**: Core types and routing generated from OpenAPI spec

**Critical Thread Safety**: The `claim_next_jobs` endpoint uses database-level write locks to
prevent multiple workers from double-allocating jobs to different clients.

### Client Architecture

The Rust client provides a **unified CLI and library interface** with these key components:

1. **Workflow Specification System** (`src/client/workflow_spec.rs`): Declarative workflow
   definitions in JSON/JSON5/YAML with automatic dependency resolution and name-to-ID mapping

2. **Workflow Manager** (`src/client/workflow_manager.rs`): Handles workflow lifecycle (start,
   restart, initialization, validation)

3. **Job Runner** (`src/client/job_runner.rs`): Local parallel job execution with resource
   management (CPU, memory, GPU tracking) and polling-based status updates

4. **Async CLI Command** (`src/client/async_cli_command.rs`): Non-blocking subprocess execution for
   running jobs without blocking the runner

5. **Command Modules**: Binary-specific command modules (`src/run_jobs_cmd.rs`, `src/tui_runner.rs`,
   `src/plot_resources_cmd.rs`) that are re-used by the unified CLI and feature-gated binaries

6. **Interactive TUI** (`src/tui/`): Terminal-based UI for workflow monitoring and management

### Data Flow

1. **Workflow Creation**: User creates workflow from spec file → Server creates workflow, files,
   jobs, dependencies → Returns workflow_id

2. **Initialization**: Client calls `initialize_jobs` → Server builds dependency graph from
   file/user_data relationships → Jobs with satisfied dependencies marked `ready`

3. **Execution**: Job runner polls server for ready jobs → Checks available resources → Submits jobs
   via AsyncCliCommand → Monitors completion → Updates server status → Triggers dependent jobs

### Logging

In log messages, when referring to database records, use the format `"workflow_id={} job_id={}"` to
enable log parsing tools.

## Testing Strategy

### Rust Client Tests

- Integration tests in `tests/`
- Use `serial_test` attribute for tests that modify shared state
- Test utilities in `tests/common/`
- Run with: `cargo test` from rust-client directory

## Important Notes

### Job Status as Integer

Job status values are stored as INTEGER (0-10) in the database, not strings:

- 0 = uninitialized
- 1 = blocked
- 2 = ready
- 3 = pending
- 4 = running
- 5 = completed
- 6 = failed
- 7 = canceled
- 8 = terminated
- 9 = disabled
- 10 = pending_failed (awaiting AI classification)

### Resource Formats

- **Memory**: String format like "1m", "2g", "512k"
- **Runtime**: ISO8601 duration format (e.g., "PT30M" for 30 minutes, "PT2H" for 2 hours, "P0DT1M"
  for 1 minute)
- **Timestamps**: Unix timestamps as float64 for file modification times

### Dependencies

- **Explicit**: Defined in `job_depends_on` table
- **Implicit**: Derived from file and user_data input/output relationships
- **Resolution**: Job specifications use names (`depends_on`), which are resolved to IDs during
  creation

### Job and File Parameterization

JobSpec and FileSpec support **parameterization** to automatically generate multiple instances from
a single specification:

- **Purpose**: Create parameter sweeps, hyperparameter tuning, or multi-dataset workflows without
  manual duplication
- **Syntax**: Add `parameters` field with parameter names and values (ranges, lists, or single
  values)
- **Expansion**: During `create_workflow_from_spec()`, parameterized specs are expanded via
  Cartesian product before creation
- **Template Substitution**: Use `{param_name}` or `{param_name:format}` in names, commands, paths,
  and dependencies
- **Format Specifiers**:
  - `{i:03d}` for zero-padded integers (e.g., 001, 042, 100)
  - `{lr:.4f}` for float precision (e.g., 0.0010, 0.1000)
- **Parameter Formats**:
  - Integer ranges: `"1:100"` (inclusive) or `"0:100:10"` (with step)
  - Float ranges: `"0.0:1.0:0.1"`
  - Lists: `"[1,5,10]"` or `"['train','test','validation']"`
- **Multi-dimensional**: Multiple parameters create Cartesian product (e.g., 3 learning rates × 3
  batch sizes = 9 jobs)
- **Implementation**: `parameter_expansion.rs` module with `ParameterValue` enum and expansion
  functions
- **Examples**: See `examples/hundred_jobs_parameterized.yaml`, `hyperparameter_sweep.yaml`, and
  `data_pipeline_parameterized.yaml`

### Pagination

List endpoints support `offset` and `limit` query parameters:

- Default limit: 10,000 records
- Maximum limit: 10,000 records (enforced)

### Job Completion and Unblocking

**CRITICAL**: Job completions trigger unblocking of dependent jobs via a background task for
performance reasons.

- When a job completes, `manage_job_status_change` sets `unblocking_processed = 0` and signals the
  background task
- The background task runs periodically and processes all pending unblocks in batch
- **Do NOT add direct calls to `unblock_jobs_waiting_for` in the completion path** - this would hurt
  performance
- The API endpoint `manage_status_change` should NOT be used to set completion statuses; use
  `complete_job` instead
- Tests simulating job completions MUST use `complete_job` (not `manage_status_change`) to ensure
  proper unblocking

### OpenAPI Code Generation

- Server and client originally used OpenAPI-generated code for base types and routing but we are now
  manually updating the code.
- Implement business logic in non-generated modules (e.g., `server/src/bin/server/api/*.rs`)

## Common Tasks

### Adding a New API Endpoint

1. Update OpenAPI spec (api/openapi.yaml)
2. Regenerate API code (`cd api && bash make_api_clients.sh`)
3. Add implementation in appropriate `src/server/api/*.rs` module
4. Update client API in `src/client/apis/`
5. Add CLI command handler if needed in `src/client/commands/`

### Adding a New CLI Subcommand

For a new subcommand (e.g., `torc workflows correct-resources`):

1. **Implement handler function** in `src/client/commands/{module}.rs`
   - Follow existing pattern from other commands in the same file
   - Use `#[command(...)]` attributes for clap configuration

2. **Add to enum variant** in the `#[derive(Subcommand)]` enum
   - Add field struct with `#[arg(...)]` attributes for options/flags
   - Use `#[command(name = "...")]` to set the subcommand name

3. **Update help template** (if applicable)
   - For `workflows` commands: Update `WORKFLOWS_HELP_TEMPLATE` constant at top of file
   - Add entry to the appropriate category with description (format: `command_name   Description`)
   - Use ANSI color codes for consistency: `\x1b[1;36m` for command, `\x1b[1;32m` for category

4. **Remove `hide = true`** if command should be visible
   - Default behavior shows command in help unless explicitly hidden

5. **Add well-formatted help text** in `#[command(...)]` attribute
   - Use `after_long_help = "..."` for detailed examples
   - Examples are shown when user runs `torc workflows command-name --help`

6. **Wire up in match statement**
   - Add case in the match block that calls your handler function (usually around line 3400+)

### Creating a Workflow from Specification

1. Write workflow spec file (JSON/JSON5/YAML) following `WorkflowSpec` format
2. See `examples/sample_workflow.json` for complete example
3. Run: `torc workflows create <spec_file>`
4. The command creates all components (workflow, jobs, files, user_data, schedulers) atomically
5. If any step fails, the entire workflow is rolled back

### Running a Workflow Locally

**Quick method:**

- `torc run <spec_file>` - Create from spec and run locally in one step
- `torc run <workflow_id>` - Run existing workflow locally

**Explicit method:**

1. Create workflow: `torc workflows create <spec_file>`
2. Run workflow: `torc workflows run <workflow_id>`
3. Monitor progress: `torc workflows status <workflow_id>`
4. View job results: `torc jobs list <workflow_id>`
5. Launch interactive UI: `torc tui`

### Submitting a Workflow to Scheduler

**Quick method (Slurm with auto-generated schedulers):**

- `torc submit-slurm --account <account> <spec_file>` - Auto-generate Slurm schedulers, create
  workflow, and submit

**Quick method (pre-configured schedulers):**

- `torc submit <spec_file>` - Create from spec and submit (requires on_workflow_start/schedule_nodes
  action in spec)
- `torc submit <workflow_id>` - Submit existing workflow to scheduler

**Explicit method:**

1. Create workflow: `torc workflows create <spec_file>` or
   `torc workflows create-slurm --account <account> <spec_file>`
2. Submit workflow: `torc workflows submit <workflow_id>`

### Debugging

**Server SQL Queries**:

```bash
RUST_LOG=sqlx=debug cargo run --bin torc-server -- run
```

**Client Verbose Output**:

```bash
RUST_LOG=debug torc workflows list
```

**Database Inspection**:

```bash
sqlite3 server/db/sqlite/dev.db
```

## Configuration

### Server Configuration

- `DATABASE_URL`: SQLite database path (configured in `.env`)
- Default: `sqlite:db/sqlite/dev.db`

### Client Configuration

- `TORC_API_URL`: Torc service URL (env var or `--url` flag)
- Default: `http://localhost:8080/torc-service/v1`
- `USER` or `USERNAME`: Workflow owner (auto-detected from environment)

## Development Workflow

1. **Start Server**: `cargo run --features server-bin --bin torc-server -- run`
2. **Build Unified CLI**: `cargo build --release --bin torc --features "client,tui,plot_resources"`
3. **Quick Execution**: `torc run examples/sample_workflow.yaml` OR
   `torc submit examples/sample_workflow.yaml`
4. **Or Explicit**: `torc workflows create examples/sample_workflow.yaml` →
   `torc workflows run <id>`

**Note**: The server is run as a separate binary (`torc-server run`), not through the unified CLI.

## CLI Commands Quick Reference

**Quick Workflow Execution** (convenience commands):

- `torc run <spec_file|id>` - Create from spec and run locally, or run existing workflow
- `torc submit <spec_file|id>` - Submit workflow to scheduler (requires pre-configured scheduler
  actions)

**Workflow Management**:

- `torc workflows create <file>` - Create workflow from specification
- `torc workflows new` - Create empty workflow interactively
- `torc workflows list` - List all workflows
- `torc workflows submit <id>` - Submit workflow to scheduler (requires
  on_workflow_start/schedule_nodes action)
- `torc workflows run <id>` - Run workflow locally
- `torc workflows initialize <id>` - Initialize workflow (set up dependencies without execution)
- `torc workflows status <id>` - Check workflow status

**Job Management**:

- `torc jobs list <workflow_id>` - List jobs for workflow
- `torc jobs get <job_id>` - Get job details
- `torc jobs update <job_id>` - Update job status

**Reports**:

- `torc reports summary <workflow_id>` - Workflow execution summary and job statistics
- `torc reports results <workflow_id>` - Job execution results with resource metrics
- `torc reports check-resource-utilization <workflow_id>` - Check for resource violations

**Execution**:

- `torc run <workflow_spec_or_id>` - Run workflow locally (top-level command)
- `torc submit <workflow_spec_or_id>` - Submit workflow to scheduler (top-level command)
- `torc submit-slurm --account <account> <spec_file>` - Submit with auto-generated Slurm schedulers
- `torc tui` - Interactive terminal UI

**Global Options** (available on all commands):

- `--url <URL>` - Torc server URL (can also use `TORC_API_URL` env var)
- `-f, --format <FORMAT>` - Output format (table or json)

## Additional Resources

- Example workflow specifications: `examples/`