fleche
A CLI tool for submitting and managing jobs on remote Slurm clusters via SSH. Eliminates the need for manual SSH, rsync, and sbatch boilerplate by providing a single command interface.
Features
- Submit jobs to remote Slurm clusters via SSH
- Sync project code respecting
.gitignore, plus explicit input files - Stream output in real-time by default
- Track job status and download outputs
- Direct SSH execution for quick tests without Slurm
- Exec mode for configured jobs that bypass Slurm (
exec = true) - Local execution for running jobs on your machine
- Job chaining via shared workspace
- Job dependencies with
--afterfor sequential workflows - Automatic retries with exponential backoff
- Parameterized jobs via environment variable overrides
- Job tagging for organization and filtering
- Job notes for annotating experiments (with search)
- Job archiving to hide completed jobs without deletion
- Resource statistics via sacct integration
- SOCKS proxy for routing traffic through the cluster
- Shell completions for bash, zsh, and fish
Installation
# Build from source
# The binary is at target/release/fleche
# Or install globally
Quick Start
# Initialize a new project
# Edit fleche.toml to configure your remote host and jobs
# Then validate your config
# Preview what would be submitted
# Submit a job (streams output by default)
# Submit without streaming
# Check status
# View logs (defaults to most recent job)
# Download results
Configuration
fleche looks for fleche.toml in the current directory or parent directories.
Minimal Example
[]
= "cluster" # SSH host from ~/.ssh/config
= "~/fleche" # Where projects are stored on remote
[]
= "python train.py"
Full Example
[]
= "my-project" # Optional, defaults to directory name
[]
= "cluster"
= "~/fleche"
[] # Environment variables for all jobs
= "/scratch/cache"
= "1"
[] # Default Slurm settings
= "gpu"
= "4:00:00"
= 1
[]
= "python train.py"
= ["data/"] # gitignored files to copy to workspace
= ["checkpoints/"]# files to download after completion
[] # Override Slurm settings for this job
= 4
= "24:00:00"
= "64G"
[] # Additional env vars for this job
= "default"
[]
= "bash setup.sh"
= true # Run directly via SSH, skip Slurm
Environment Variable Substitution
Config values support ${VAR} patterns that expand to built-in variables, system environment variables, previously-defined [env] entries, or variables from a .env file:
[]
= "graphmind"
[]
= "/scratch/${USER}/fleche"
[]
= "/scratch/${USER}/cache"
= "${CACHE}/uv"
# ${PROJECT} expands to project.name
= "${CACHE}/${PROJECT}/.venv"
For project-specific variables, create a .env file (gitignored):
# .env
SSH_USER=k21220155
# fleche.toml
[]
= "/scratch/users/${SSH_USER}/fleche"
This enables user-agnostic configs that can be committed to version control. Use ${VAR:-default} for optional variables with fallbacks.
Separate Job Files
Jobs can also be defined in fleche/*.toml. The filename becomes the job name:
fleche/
train.toml
eval.toml
experiments/ablation.toml # -> job name: experiments/ablation
Commands
| Command | Description |
|---|---|
fleche run [job|cmd] [opts] |
Submit a job to the cluster |
fleche rerun <job-id> |
Re-run a previous job with same settings |
fleche exec <cmd> |
Run command directly via SSH (no Slurm) |
fleche status [job-id] |
Show job status (defaults to listing all) |
fleche logs [job-id] |
View job output (defaults to most recent) |
fleche download [job-id] |
Pull output files (defaults to most recent) |
fleche cancel [job-id] |
Cancel a job (defaults to most recent active) |
fleche clean [job-id] |
Remove job and remote files |
fleche tags |
List all unique tags across jobs |
fleche stats [job-id] |
Show resource usage (time, CPU, memory) |
fleche note <job-id> [text] |
View or set job note |
fleche wait [job-id] |
Wait for job to complete |
fleche proxy -- <cmd> |
Run command through SOCKS proxy to remote |
fleche jobs |
List available jobs from configuration |
fleche init |
Create starter config |
fleche check |
Validate config |
fleche guide |
Print comprehensive usage guide |
fleche completions <shell> |
Generate shell completions |
All commands except run, rerun, exec, tags, jobs, init, check, and guide support --tag for filtering.
Run Options
)
)
)
)
Status Options
)
)
)
Filtering Options
Most commands support --tag for filtering:
Common Workflows
Parameterized Jobs
Use --env to pass parameters:
In your command, reference as $CONFIG and $EPOCHS.
Quick GPU Test
Override command to test environment:
Uses train's Slurm config but runs a different command.
Ad-hoc Commands
Run without a job definition:
Direct SSH Execution
For quick tests without waiting in the Slurm queue:
This syncs your project and runs the command directly over SSH.
Exec Mode (Configured Direct Execution)
For jobs that should always run directly via SSH (bypassing Slurm), set exec = true.
Unlike fleche exec, exec mode jobs are tracked with full status/logs/cancel support:
[]
= "bash setup.sh"
= true
Local Execution
Run jobs on your local machine instead of a remote cluster:
# Run locally via CLI flag
# Or configure in fleche.toml
Local jobs run directly in the project directory with logs in .fleche/jobs/{id}/.
Use --host local with fleche exec for quick local command execution:
Job Chaining
Jobs share a workspace, so outputs from one job are available to the next:
No need for explicit dependencies - files persist in the shared workspace.
Tagging Jobs
Add tags to track and filter experiments:
# Tag jobs when submitting
# Filter status by tag
# View logs from most recent job with specific tag
# Download outputs from most recent job with tag
# Cancel all jobs with a specific tag
# Clean up old experiment jobs
Tags are shown in status output below each job that has them.
Monitoring
# View logs (defaults to most recent job)
# Show only the last 50 lines
# Show only stdout or only stderr
# Stream logs in real-time (Ctrl+C to disconnect; job keeps running)
# Pull outputs while job is still running
# Download a specific path
# Download only specific file types (searches inside directories)
# Download everything except checkpoints
# Preview what would be downloaded
Cleanup
# Remove a specific job
# Remove all completed/failed jobs
# Remove jobs older than 7 days
# Also delete the shared workspace
# Archive a job (hide without deleting)
# Restore an archived job
Listing Available Jobs
Shows all configured jobs with their commands.
SOCKS Proxy
Route traffic through the cluster's network:
Opens an SSH SOCKS tunnel, sets proxy environment variables, runs the command, and tears down the tunnel.
Architecture
fleche runs entirely on your local machine. All cluster interaction happens via standard Unix tools:
- ssh for remote command execution (sbatch, squeue, scancel, sacct)
- rsync for file synchronization
There is no agent or daemon on the remote server. This approach leverages your existing SSH configuration (~/.ssh/config), ssh-agent, ProxyJump, etc.
Remote Directory Structure
All jobs share a workspace directory:
<base_path>/<project>/
.fleche/
workspace/ # Shared workspace (project code + inputs)
train.py
data/
checkpoints/
jobs/ # Per-job logs and metadata
train-abc123/
job.sbatch
job.out
job.err
eval-def456/
...
- Project code is synced to
workspace/, respecting.gitignore - Files in
inputsare copied toworkspace/(for gitignored data) - Job commands run with
workspace/as their working directory - Job logs go to
jobs/<job-id>/ fleche downloadcopiesoutputsfromworkspace/to local
File Locations
| Purpose | Location |
|---|---|
| Project config | fleche.toml in repository root |
| Job definitions | fleche/*.toml in repository root |
| Job registry | ~/.config/fleche/jobs.db (SQLite) |
| Remote workspace | <base_path>/<project>/.fleche/workspace/ |
| Remote job logs | <base_path>/<project>/.fleche/jobs/<id>/ |
Job Lifecycle
- Config loaded from
fleche.tomlandfleche/*.toml - Job resolved with merged settings (global -> job -> CLI)
- Job ID generated with timestamp and random suffix
- Remote directories created (workspace + job dir)
- Project code synced to workspace via rsync (respects
.gitignore) - Input files synced to workspace
- sbatch script generated and uploaded to job dir (or exec script if
exec = true) - Job submitted to Slurm (or started directly via SSH if exec mode)
- Job recorded in local registry
- Output streamed (unless
--bg)
Slurm Options
These can be set in config or passed via CLI:
| Option | sbatch flag | Example |
|---|---|---|
partition |
--partition | gpu |
time |
--time | 8:00:00 |
gpus |
--gpus | 1 |
cpus |
--cpus-per-task | 16 |
memory |
--mem | 32G |
constraint |
--constraint | a100 |
nodes |
--nodes | 2 |
exclude |
--exclude | node01 |
Job Status Values
| Status | Description |
|---|---|
| pending | Submitted, waiting in queue |
| running | Currently executing |
| completed | Finished successfully |
| failed | Finished with error |
| cancelled | Cancelled by user |
Requirements
- Rust 1.85+ (for building, required by Rust 2024 edition)
- SSH access to the remote cluster
- rsync installed locally and on the cluster
- Slurm scheduler on the remote cluster (not required for
exec = truejobs)
Platform Support
| Platform | Remote Jobs | Local Jobs |
|---|---|---|
| Linux | Full | Full |
| macOS | Full | Full |
| WSL | Full | Full |
| Windows | Full* | Foreground only |
*Requires ssh and rsync in PATH.
Numeric Index Aliases
fleche status shows a # column with 1-based indices. Use these numbers
anywhere a job ID is accepted:
Tips
- Use
--dry-runto preview the sbatch script before submitting - Use
fleche checkto validate config after editing - Job IDs look like
train-20260115-153042-847-x7k2— or just usefleche logs 1 - Ctrl+C during streaming disconnects but doesn't cancel the job
- Use
fleche execfor quick ad-hoc tests without Slurm queue wait - Use
exec = truein config for jobs that should always bypass Slurm - Jobs share workspace, so chained jobs can read each other's outputs
- Use
--retryfor flaky jobs that may fail due to transient issues - Use
--noteto document experiment parameters for future reference - Use
fleche logs --note <pattern>to find jobs by note content - Use
--archiveto hide old jobs without deleting them - Use
fleche jobsto see what jobs are available in the project - Use
fleche proxyto route traffic through the cluster's network - Enable shell completions:
fleche completions bash >> ~/.bashrc - The job registry is at
~/.config/fleche/jobs.db
License
GPLv3