fleche-6.14.1 is not a library.

fleche

A CLI tool for submitting and managing jobs on remote Slurm clusters via SSH. Eliminates the need for manual SSH, rsync, and sbatch boilerplate by providing a single command interface.

Features

Submit jobs to remote Slurm clusters via SSH
Sync project code respecting .gitignore, plus explicit input files
Stream output in real-time by default
Track job status and download outputs
Direct SSH execution for quick tests without Slurm
Exec mode for configured jobs that bypass Slurm (exec = true)
Local execution for running jobs on your machine
Job chaining via shared workspace
Job dependencies with --after for sequential workflows
Automatic retries with exponential backoff
Parameterized jobs via environment variable overrides
Job tagging for organization and filtering
Job notes for annotating experiments (with search)
Job archiving to hide completed jobs without deletion
Resource statistics via sacct integration
SOCKS proxy for routing traffic through the cluster
Shell completions for bash, zsh, and fish

Installation

# Build from source
cargo build --release
# The binary is at target/release/fleche

# Or install globally
cargo install --path .

Quick Start

# Initialize a new project
fleche init

# Edit fleche.toml to configure your remote host and jobs
# Then validate your config
fleche check

# Preview what would be submitted
fleche run <job-name> --dry-run

# Submit a job (streams output by default)
fleche run <job-name>

# Submit without streaming
fleche run <job-name> --bg

# Check status
fleche status

# View logs (defaults to most recent job)
fleche logs

# Download results
fleche download

Configuration

fleche looks for fleche.toml in the current directory or parent directories.

Minimal Example

[remote]
host = "cluster"          # SSH host from ~/.ssh/config
base_path = "~/fleche"    # Where projects are stored on remote

[jobs.train]
command = "python train.py"

Full Example

[project]
name = "my-project"       # Optional, defaults to directory name

[remote]
host = "cluster"
base_path = "~/fleche"

[env]                     # Environment variables for all jobs
HF_HOME = "/scratch/cache"
PYTHONUNBUFFERED = "1"

[slurm]                   # Default Slurm settings
partition = "gpu"
time = "4:00:00"
gpus = 1

[jobs.train]
command = "python train.py"
inputs = ["data/"]        # gitignored files to copy to workspace
outputs = ["checkpoints/"]# files to download after completion

[jobs.train.slurm]        # Override Slurm settings for this job
gpus = 4
time = "24:00:00"
memory = "64G"

[jobs.train.env]          # Additional env vars for this job
CONFIG = "default"

[jobs.setup]
command = "bash setup.sh"
exec = true               # Run directly via SSH, skip Slurm

Environment Variable Substitution

Config values support ${VAR} patterns that expand to built-in variables, system environment variables, previously-defined [env] entries, or variables from a .env file:

[project]
name = "graphmind"

[remote]
base_path = "/scratch/${USER}/fleche"

[env]
CACHE = "/scratch/${USER}/cache"
UV_CACHE = "${CACHE}/uv"
# ${PROJECT} expands to project.name
UV_PROJECT_ENVIRONMENT = "${CACHE}/${PROJECT}/.venv"

For project-specific variables, create a .env file (gitignored):

# .env
SSH_USER=k21220155

# fleche.toml
[remote]
base_path = "/scratch/users/${SSH_USER}/fleche"

This enables user-agnostic configs that can be committed to version control. Use ${VAR:-default} for optional variables with fallbacks.

Separate Job Files

Jobs can also be defined in fleche/*.toml. The filename becomes the job name:

fleche/
  train.toml
  eval.toml
  experiments/ablation.toml  # -> job name: experiments/ablation

Commands

Command	Description
`fleche run [job\|cmd] [opts]`	Submit a job to the cluster
`fleche rerun <job-id>`	Re-run a previous job with same settings
`fleche exec <cmd>`	Run command directly via SSH (no Slurm)
`fleche status [job-id]`	Show job status (defaults to listing all)
`fleche logs [job-id]`	View job output (defaults to most recent)
`fleche download [job-id]`	Pull output files (defaults to most recent)
`fleche cancel [job-id]`	Cancel a job (defaults to most recent active)
`fleche clean [job-id]`	Remove job and remote files
`fleche tags`	List all unique tags across jobs
`fleche stats [job-id]`	Show resource usage (time, CPU, memory)
`fleche note <job-id> [text]`	View or set job note
`fleche wait [job-id]`	Wait for job to complete
`fleche proxy -- <cmd>`	Run command through SOCKS proxy to remote
`fleche jobs`	List available jobs from configuration
`fleche init`	Create starter config
`fleche check`	Validate config
`fleche guide`	Print comprehensive usage guide
`fleche completions <shell>`	Generate shell completions

All commands except run, rerun, exec, tags, jobs, init, check, and guide support --tag for filtering.

Run Options

fleche run <job-name> [options]

Options:
  --command <cmd>       Override or provide command
  --env <KEY=VALUE>     Set environment variable (repeatable)
  --tag <KEY=VALUE>     Add tag for filtering (repeatable)
  --note <text>         Add a note/annotation to the job
  --host <host>         Run on specific host ("local" for local execution)
  --exec                Run directly via SSH instead of submitting to Slurm
  --after <job-id>      Run after another job completes successfully
  --retry <n>           Retry up to n times on failure (exponential backoff)
  --partition <name>    Override Slurm partition
  --time <duration>     Override wall time
  --gpus <n>            Override GPU count
  --cpus <n>            Override CPU count
  --memory <size>       Override memory
  --constraint <str>    Override constraint
  --bg                  Run in background (don't stream output)
  --dry-run             Print sbatch script without submitting

Status Options

fleche status [job-id] [options]

Options:
  --filter <status>     Filter by status (repeatable: pending, running, completed, failed, cancelled)
  --tag <KEY=VALUE>     Filter by tag (repeatable)
  -n, --last <N>        Number of jobs to show (default: 20)
  --archived            Show only archived jobs
  --all-jobs            Show all jobs including archived

Filtering Options

Most commands support --tag for filtering:

fleche logs --tag <KEY=VALUE>       # Logs from most recent job with tag
fleche download --tag <KEY=VALUE>   # Download from most recent job with tag
fleche cancel --tag <KEY=VALUE>     # Cancel most recent active job with tag
fleche cancel --all --tag <K=V>     # Cancel all active jobs with tag
fleche clean --all --tag <KEY=VALUE># Clean all finished jobs with tag

Common Workflows

Parameterized Jobs

Use --env to pass parameters:

fleche run train --env CONFIG=llama_basic --env EPOCHS=100

In your command, reference as $CONFIG and $EPOCHS.

Quick GPU Test

Override command to test environment:

fleche run train --command "nvidia-smi"

Uses train's Slurm config but runs a different command.

Ad-hoc Commands

Run without a job definition:

fleche run "python test.py" --partition cpu --time 0:30:00

Direct SSH Execution

For quick tests without waiting in the Slurm queue:

fleche exec "python test.py"
fleche exec "ls -la"

This syncs your project and runs the command directly over SSH.

Exec Mode (Configured Direct Execution)

For jobs that should always run directly via SSH (bypassing Slurm), set exec = true. Unlike fleche exec, exec mode jobs are tracked with full status/logs/cancel support:

[jobs.setup]
command = "bash setup.sh"
exec = true

fleche run setup            # Runs directly via SSH (foreground)
fleche run setup --bg       # Runs in background
fleche run train --exec     # Override: skip Slurm for this run only

Local Execution

Run jobs on your local machine instead of a remote cluster:

# Run locally via CLI flag
fleche run train --host local

# Or configure in fleche.toml
[jobs.test]
command = "python test.py"
host = "local"

Local jobs run directly in the project directory with logs in .fleche/jobs/{id}/. Use --host local with fleche exec for quick local command execution:

fleche exec "python -c 'print(1+1)'" --host local

Job Chaining

Jobs share a workspace, so outputs from one job are available to the next:

fleche run train          # Creates checkpoints/
fleche run eval           # Can read checkpoints/ from train
fleche download           # Download results from eval

No need for explicit dependencies - files persist in the shared workspace.

Tagging Jobs

Add tags to track and filter experiments:

# Tag jobs when submitting
fleche run train --tag experiment=ablation --tag model=8b
fleche run train --tag experiment=baseline --tag model=8b

# Filter status by tag
fleche status --tag experiment=ablation
fleche status --tag model=8b --filter running

# View logs from most recent job with specific tag
fleche logs --tag experiment=ablation

# Download outputs from most recent job with tag
fleche download --tag experiment=ablation

# Cancel all jobs with a specific tag
fleche cancel --all --tag experiment=test

# Clean up old experiment jobs
fleche clean --all --tag experiment=old
fleche clean --older-than 7d --tag experiment=ablation

Tags are shown in status output below each job that has them.

Monitoring

# View logs (defaults to most recent job)
fleche logs

# Show only the last 50 lines
fleche logs -n 50

# Show only stdout or only stderr
fleche logs --stdout
fleche logs --stderr

# Stream logs in real-time (Ctrl+C to disconnect; job keeps running)
fleche logs --follow

# Pull outputs while job is still running
fleche download --partial

# Download a specific path
fleche download --path results/metrics.json

# Download only specific file types (searches inside directories)
fleche download --filter "*.json" --filter "*.csv"

# Download everything except checkpoints
fleche download --filter "!checkpoints/**"

# Preview what would be downloaded
fleche download --dry-run

Cleanup

# Remove a specific job
fleche clean <job-id>

# Remove all completed/failed jobs
fleche clean --all

# Remove jobs older than 7 days
fleche clean --older-than 7d

# Also delete the shared workspace
fleche clean --all --workspace

# Archive a job (hide without deleting)
fleche clean --archive <job-id>

# Restore an archived job
fleche clean --unarchive <job-id>

Listing Available Jobs

fleche jobs

Shows all configured jobs with their commands.

SOCKS Proxy

Route traffic through the cluster's network:

fleche proxy -- curl https://example.com
fleche proxy -- wget https://huggingface.co/model/weights.bin

Opens an SSH SOCKS tunnel, sets proxy environment variables, runs the command, and tears down the tunnel.

Architecture

fleche runs entirely on your local machine. All cluster interaction happens via standard Unix tools:

ssh for remote command execution (sbatch, squeue, scancel, sacct)
rsync for file synchronization

There is no agent or daemon on the remote server. This approach leverages your existing SSH configuration (~/.ssh/config), ssh-agent, ProxyJump, etc.

Remote Directory Structure

All jobs share a workspace directory:

<base_path>/<project>/
  .fleche/
    workspace/          # Shared workspace (project code + inputs)
      train.py
      data/
      checkpoints/
    jobs/               # Per-job logs and metadata
      train-abc123/
        job.sbatch
        job.out
        job.err
      eval-def456/
        ...

Project code is synced to workspace/, respecting .gitignore
Files in inputs are copied to workspace/ (for gitignored data)
Job commands run with workspace/ as their working directory
Job logs go to jobs/<job-id>/
fleche download copies outputs from workspace/ to local

File Locations

Purpose	Location
Project config	`fleche.toml` in repository root
Job definitions	`fleche/*.toml` in repository root
Job registry	`~/.config/fleche/jobs.db` (SQLite)
Remote workspace	`<base_path>/<project>/.fleche/workspace/`
Remote job logs	`<base_path>/<project>/.fleche/jobs/<id>/`

Job Lifecycle

Config loaded from fleche.toml and fleche/*.toml
Job resolved with merged settings (global -> job -> CLI)
Job ID generated with timestamp and random suffix
Remote directories created (workspace + job dir)
Project code synced to workspace via rsync (respects .gitignore)
Input files synced to workspace
sbatch script generated and uploaded to job dir (or exec script if exec = true)
Job submitted to Slurm (or started directly via SSH if exec mode)
Job recorded in local registry
Output streamed (unless --bg)

Slurm Options

These can be set in config or passed via CLI:

Option	sbatch flag	Example
`partition`	--partition	`gpu`
`time`	--time	`8:00:00`
`gpus`	--gpus	`1`
`cpus`	--cpus-per-task	`16`
`memory`	--mem	`32G`
`constraint`	--constraint	`a100`
`nodes`	--nodes	`2`
`exclude`	--exclude	`node01`

Job Status Values

Status	Description
pending	Submitted, waiting in queue
running	Currently executing
completed	Finished successfully
failed	Finished with error
cancelled	Cancelled by user

Requirements

Rust 1.85+ (for building, required by Rust 2024 edition)
SSH access to the remote cluster
rsync installed locally and on the cluster
Slurm scheduler on the remote cluster (not required for exec = true jobs)

Platform Support

Platform	Remote Jobs	Local Jobs
Linux	Full	Full
macOS	Full	Full
WSL	Full	Full
Windows	Full*	Foreground only

*Requires ssh and rsync in PATH.

Numeric Index Aliases

fleche status shows a # column with 1-based indices. Use these numbers anywhere a job ID is accepted:

fleche logs 1           # Logs for most recent job
fleche cancel 1         # Cancel most recent job
fleche download 2       # Download outputs from job #2
fleche status 1         # Detailed status for job #1

Tips

Use --dry-run to preview the sbatch script before submitting
Use fleche check to validate config after editing
Job IDs look like train-20260115-153042-847-x7k2 — or just use fleche logs 1
Ctrl+C during streaming disconnects but doesn't cancel the job
Use fleche exec for quick ad-hoc tests without Slurm queue wait
Use exec = true in config for jobs that should always bypass Slurm
Jobs share workspace, so chained jobs can read each other's outputs
Use --retry for flaky jobs that may fail due to transient issues
Use --note to document experiment parameters for future reference
Use fleche logs --note <pattern> to find jobs by note content
Use --archive to hide old jobs without deleting them
Use fleche jobs to see what jobs are available in the project
Use fleche proxy to route traffic through the cluster's network
Enable shell completions: fleche completions bash >> ~/.bashrc
The job registry is at ~/.config/fleche/jobs.db

License

GPLv3

fleche 6.14.1