fleche 6.22.0

Remote job runner for Slurm clusters
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
---
name: fleche
description: Reference documentation for fleche CLI (remote Slurm job runner). Use when working with fleche.toml, submitting or monitoring jobs, downloading results, or troubleshooting fleche.
---

# Fleche (Remote Job Submission)

`fleche` is a utility for running jobs on remote Slurm clusters via SSH.
Configuration is in `fleche.toml`. Run `fleche skill --install` to install this reference for AI coding agents.

## Key Concepts

- **Check `fleche.toml` first** for available jobs (or run `fleche jobs`)
- Most commands default to most recent job if no job-id given
- Short ID suffix works (e.g., `x7k2` instead of full `train-20260115-153042-847-x7k2`)
- Numeric index aliases from `fleche status` work anywhere a job ID is accepted (e.g., `fleche logs 1`)
- Config supports `${VAR}` substitution from env vars, `.env` file, and `${PROJECT}` built-in
- **`--filter` vs `--tag` vs `--name`**: `--filter` is for job STATUS, `--tag` is for your custom tags, `--name` is regex on job ID
- Use `--json` flag on supported commands for machine-readable output

## Quick Start

```bash
fleche init                          # Create starter fleche.toml
fleche check                         # Validate config
fleche run <job> --dry-run           # Preview sbatch script + files to sync
fleche run <job>                     # Submit and stream output
fleche run <job> --bg                # Submit without streaming
fleche run <job> --bg --notify       # Background + terminal notification
fleche run <job> --ntfy my-topic     # Push notifications via ntfy.sh
fleche wait <job-id>                 # Wait for completion
fleche status                        # Check status
fleche logs                          # View logs (most recent job)
fleche download                      # Download results
```

## Running Jobs

```bash
fleche run <job>                              # Submit and stream output (Ctrl+C disconnects, job keeps running)
fleche run <job> --bg                         # Run in background (--notify for alerts)
fleche run <job> --env VAR=value --tag key=value  # Set env vars and tags
fleche run <job> --note "description"         # Add note to document experiment
fleche run <job> --command "nvidia-smi"       # Override command (keeps job's Slurm config)
fleche run <job> --dry-run                    # Preview sbatch script + files to sync without submitting
fleche run <job> --host local                 # Run locally instead of on remote Slurm cluster
fleche run <job> --after <job-id>             # Run after another job completes (dependency)
fleche run <job> --retry 3                    # Auto-retry on failure with exponential backoff
fleche run <job> --exec                       # Bypass Slurm, run directly via SSH for this run
fleche run <job> --ntfy my-topic              # Push notifications via ntfy.sh on state changes
fleche run "command" --gpus 1 --time 1:00:00  # Adhoc Slurm command (no job definition)
fleche rerun <job-id>                         # Re-run previous job with same settings
fleche exec <cmd>                             # Run directly via SSH, no Slurm (quick tests)
fleche exec <cmd> --no-sync                   # Skip project sync (code already on remote)
fleche exec <cmd> --host local                # Run command locally without SSH
```

## Monitoring

```bash
fleche status -n 20                     # Show last 20 jobs
  --filter running                      #   Filter by status (running/pending/completed/failed/cancelled)
  --tag key=value                       #   Filter by tag
  --name 'pattern'                      #   Filter by job ID regex (substring match, use ^/$ to anchor)
  --archived                            #   Show only archived jobs
  --all-jobs                            #   Show all jobs including archived
fleche logs [job-id]                    # View logs (--raw to strip ANSI, --follow to stream)
  -n 50                                 #   Show only last N lines
  --stdout / --stderr                   #   Show only one stream
  --note 'pattern'                      #   Filter by note content (case-insensitive regex)
fleche wait [job-id]                    # Wait for completion (--notify for alerts, --ntfy for push)
fleche stats [job-id]                   # Show resource usage (elapsed time, CPU time, max memory)
fleche note <job-id> [text]             # View or set job note
fleche ping                             # Check Slurm cluster health
fleche check                            # Validate config after editing
fleche check --remote                   # Validate config against remote server (SSH, Slurm, disk space)
fleche doctor                           # Comprehensive troubleshooting diagnostics
fleche compare <a> <b>                  # Compare two job configurations side-by-side
fleche tags                             # List unique tags across all jobs
fleche jobs                             # List available jobs from configuration
fleche proxy -- <cmd>                   # Route traffic through SSH SOCKS tunnel to remote host
```

## Results

```bash
fleche download [job-id]                # Download output files (--partial while job running)
  --filter "*.json"                     #   Download only specific file types (repeatable, recursive)
  --filter "!checkpoints/**"            #   Exclude files/directories with ! prefix
  --dry-run                             #   Preview what would be downloaded
```

## Cleanup

```bash
fleche cancel [job-id]                  # Cancel job (--all for all active, --tag to filter)
fleche cancel --dry-run                 # Preview what would be cancelled
fleche clean [job-id]                   # Archive job (default: hides without deleting)
fleche clean --all                      # Archive all finished jobs
fleche clean --all --filter failed      # Archive only failed jobs
fleche clean --older-than 2h -y         # Archive old jobs periodically
fleche clean --delete [job-id]          # Permanently delete job and remote files
fleche clean --delete --archived --all  # Delete all archived jobs
fleche clean --delete --workspace       # Also delete shared workspace (use with caution)
fleche clean --unarchive [job-id]       # Restore archived job
fleche clean --dry-run                  # Preview what would be done
```

## Configuration

fleche looks for `fleche.toml` in the current directory or parent directories.

### Minimal Example

```toml
[remote]
host = "cluster"          # SSH host from ~/.ssh/config
base_path = "~/fleche"    # Where projects are stored on remote

[jobs.train]
command = "python train.py"
```

### Full Example

```toml
[project]
name = "my-project"       # Optional, defaults to directory name

[remote]
host = "cluster"
base_path = "~/fleche"

[env]                     # Environment variables for all jobs
HF_HOME = "/scratch/cache"
PYTHONUNBUFFERED = "1"

[slurm]                   # Default Slurm settings
partition = "gpu"
time = "4:00:00"
gpus = 1

[jobs.train]
command = "python train.py"
inputs = ["data/"]        # gitignored files to copy to workspace
outputs = ["checkpoints/"]# files to download after completion

[jobs.train.slurm]        # Override Slurm settings for this job
gpus = 4
time = "24:00:00"
memory = "64G"

[jobs.train.env]          # Additional env vars for this job
CONFIG = "default"

[jobs.setup]
command = "bash setup.sh"
exec = true               # Run directly via SSH, skip Slurm

# Optional settings to tune behavior
[settings]
# default_list_limit = 20           # Jobs shown in fleche status
# poll_interval_local_secs = 2      # Status check interval for local jobs
# poll_interval_remote_secs = 5     # Status check interval for remote jobs
# ssh_timeout_secs = 60             # SSH command timeout
# ssh_connect_timeout_secs = 30     # SSH connection timeout
# retry_base_delay_secs = 30        # Base delay for --retry backoff
```

### Environment Variable Substitution

Config values support `${VAR}` substitution, resolved from (highest precedence first):
1. CLI `--env` overrides (e.g., `--env DATASET=orc`)
2. Built-in variables (`${PROJECT}` = value of `project.name`)
3. Job-specific `[jobs.<name>.env]` entries
4. Global `[env]` entries (in definition order)
5. System environment variables (e.g., `$USER`, `$HOME`)
6. Variables from `.env` file in the project directory

This means `--env` can override any variable used in commands, inputs, or outputs.

```toml
[project]
name = "graphmind"

[remote]
base_path = "/scratch/${USER}/fleche"

[env]
CACHE = "/scratch/${USER}/cache"
UV_CACHE = "${CACHE}/uv"
# Use ${PROJECT} to avoid hardcoding the project name
UV_PROJECT_ENVIRONMENT = "${CACHE}/${PROJECT}/.venv"
```

Use `${VAR:-default}` for optional variables:

```toml
[remote]
base_path = "${SCRATCH:-/tmp}/${USER}/fleche"
```

### Using .env Files

For project-specific variables, create a `.env` file:

```bash
# .env (gitignored)
SSH_USER=k21220155
SCRATCH=/scratch/users/k21220155
```

```toml
# fleche.toml
[remote]
base_path = "${SCRATCH}/fleche"
```

This enables user-agnostic configs that can be committed to version control.

### Forwarding .env Variables to Jobs

By default, `.env` variables are only used for `${VAR}` expansion in config values.
They are NOT exported into job environments. To inject all variables from a dotenv
file as exports in the sbatch script, use the `dotenv` option:

```toml
# fleche.toml
dotenv = ".env"           # All vars from .env are exported in every job
```

Per-job override (replaces global, not additive):

```toml
dotenv = ".env"

[jobs.train]
dotenv = ".env.train"     # This job uses .env.train instead of .env
```

Precedence (lowest to highest):
1. `dotenv` file variables
2. Global `[env]`
3. Job-specific `[jobs.<name>.env]`
4. CLI `--env`

The configured file must exist — unlike the implicit `.env` lookup, a missing
`dotenv` file is an error.

### Separate Job Files

Jobs can also be defined in `fleche/*.toml`. The filename becomes the job name:

```
fleche/
  train.toml
  eval.toml
  inference.toml
```

## Common Workflows

### Parameterised Jobs

Use `--env` to pass parameters or override defaults:

```toml
# fleche/train.toml
command = "python train.py --dataset ${DATASET} --config ${CONFIG}"

[env]
DATASET = "default_dataset"   # Default value
CONFIG = "base_config"        # Default value
```

```bash
# Override defaults from CLI
fleche run train --env DATASET=orc --env CONFIG=llama_orc

# The command becomes: python train.py --dataset orc --config llama_orc
```

CLI `--env` values override config defaults during `${VAR}` expansion.

### Quick GPU Test

Override command to test environment:

```bash
fleche run train --command "nvidia-smi"
```

This uses train's Slurm config (partition, gpus) but runs a different command.

### Ad-hoc Commands

Run without a job definition:

```bash
fleche run "python test.py" --partition cpu --time 0:30:00
```

### Direct SSH Execution (No Slurm)

For quick tests or non-GPU work, use exec to bypass Slurm:

```bash
fleche exec "python test.py"
fleche exec "ls -la"
```

This syncs your project and runs the command directly over SSH.
Use `--no-sync` to skip syncing (useful when code is already on the remote):

```bash
fleche exec "python test.py" --no-sync
```

### Exec Mode (Configured Direct Execution)

For jobs that should always run directly via SSH (bypassing Slurm), set `exec = true`
in the job definition. Unlike `fleche exec`, exec mode jobs are tracked in the registry
with full support for status, logs, cancel, wait, retry, and background execution.

```toml
[jobs.setup]
command = "bash setup.sh"
exec = true
```

```bash
# Run in foreground (streams output)
fleche run setup

# Run in background
fleche run setup --bg

# All standard operations work
fleche status
fleche logs
fleche cancel
fleche wait
```

Use `--exec` to override any job to run directly:

```bash
fleche run train --exec   # Bypasses Slurm for this run only
```

Slurm options are ignored for exec jobs (a warning is shown if any are set).

### Local Execution

Run jobs on your local machine instead of a remote cluster:

```bash
# Run locally via CLI flag
fleche run train --host local

# Or configure in fleche.toml
[jobs.test]
command = "python test.py"
host = "local"
```

Local jobs run directly in the project directory with logs in `.fleche/jobs/{id}/`.
Use `--host local` with `fleche exec` for quick local command execution:

```bash
fleche exec "python -c 'print(1+1)'" --host local
```

### Tagging Jobs

Add tags to track and filter experiments:

```bash
# Tag jobs when submitting
fleche run train --tag experiment=ablation --tag model=8b
fleche run train --tag experiment=baseline --tag model=8b

# Filter status by tag
fleche status --tag experiment=ablation
fleche status --tag model=8b --filter running

# Filter by job name (regex pattern, implicit .* around)
fleche status --name 123             # jobs containing "123"
fleche status --name '^train'        # jobs starting with "train"
fleche status --name 'ablation$'     # jobs ending with "ablation"

# View logs from most recent job with specific tag
fleche logs --tag experiment=ablation

# Download outputs from most recent job with tag
fleche download --tag experiment=ablation

# Cancel all jobs with a specific tag
fleche cancel --all --tag experiment=test

# Clean up old experiment jobs
fleche clean --all --tag experiment=old
fleche clean --older-than 7d --tag experiment=ablation
```

Tags are shown in status output below each job that has them.

### Monitoring

```bash
# View logs (defaults to most recent job)
fleche logs

# Show only the last 50 lines
fleche logs -n 50

# Show only stdout or only stderr
fleche logs --stdout
fleche logs --stderr

# Stream logs in real-time (Ctrl+C to disconnect; job keeps running)
fleche logs --follow

# Pull outputs while job is still running
fleche download --partial

# Download only specific file types (searches inside directories)
fleche download --filter "*.json" --filter "*.csv"

# Download everything except checkpoints
fleche download --filter "!checkpoints/**"

# Preview what would be downloaded without actually downloading
fleche download --dry-run
fleche download --dry-run --filter "*.json"
```

### Job Chaining

Jobs share a workspace, so outputs from one job are available to the next:

```bash
fleche run train          # Creates checkpoints/
fleche run eval           # Can read checkpoints/ from train
fleche download           # Download results from eval
```

No need for explicit dependencies - files persist in the shared workspace.

### Job Dependencies

Use `--after` to run a job only after another completes successfully:

```bash
# Submit training job
fleche run train --bg
# Job ID: train-20260119-120000-abc1

# Submit eval to run after train completes
fleche run eval --after abc1
```

The second job waits in the Slurm queue until the dependency finishes with exit code 0.

### Automatic Retries

Use `--retry` to automatically retry failed jobs with exponential backoff:

```bash
# Retry up to 3 times on failure (30s, 60s, 120s delays)
fleche run train --retry 3
```

Each retry creates a new job ID. Works for both Slurm and local jobs (foreground only).

### Job Notes

Annotate jobs with notes for later reference:

```bash
# Add note when submitting
fleche run train --note "testing new learning rate"

# Add or update note later
fleche note <job-id> "increased batch size to 64"

# View note
fleche note <job-id>

# Notes also shown in fleche status <job-id>

# Search logs by note content (case-insensitive regex)
fleche logs --note "learning rate"
fleche logs --note "experiment.*baseline"
```

### Archiving Jobs

By default, `fleche clean` archives jobs (hides them from listings without deleting):

```bash
# Archive a job (default behavior)
fleche clean <job-id>

# Archive all finished jobs
fleche clean --all

# Archive only failed jobs
fleche clean --all --filter failed

# View archived jobs
fleche status --archived

# View all jobs including archived
fleche status --all-jobs

# Restore an archived job
fleche clean --unarchive <job-id>

# Restore all archived jobs
fleche clean --unarchive --all

# Permanently delete jobs (removes files)
fleche clean --delete <job-id>

# Delete all archived jobs
fleche clean --delete --archived --all

# Delete archived jobs older than 30 days
fleche clean --delete --archived --older-than 30d
```

Archived jobs are hidden from `fleche status` by default but their data is preserved.

### Resource Statistics

View resource usage for completed Slurm jobs:

```bash
# Stats for most recent job
fleche stats

# Stats for last 5 jobs
fleche stats -n 5

# Stats for specific job
fleche stats <job-id>
```

Shows elapsed time, CPU time, max memory, node, and allocated resources from sacct.

Resource usage is also shown in `fleche status <job-id>` for finished Slurm jobs:

```bash
fleche status <job-id>
# ...
#   Resource usage:
#     Node:         gpu-node01
#     Elapsed:      01:23:45
#     CPU time:     02:30:00
#     Max memory:   4096K
#     Resources:    4 CPU, 1 GPU, 16G mem
```

### Push Notifications (ntfy.sh)

Get push notifications on your phone or desktop when jobs change state:

```bash
# Notify on all state changes (submitted, running, completed/failed)
fleche run train --ntfy my-topic

# Works with background jobs
fleche run train --bg --ntfy my-topic

# Wait for an existing job with notifications
fleche wait <job-id> --ntfy my-topic

# Re-run with notifications
fleche rerun <job-id> --ntfy my-topic
```

Subscribe to notifications at `https://ntfy.sh/my-topic` or install the
ntfy app on your phone. Choose a unique topic name to avoid conflicts.

Notifications are sent for each state transition:
- **Submitted** — job entered the Slurm queue (low priority)
- **Running** — job started executing (default priority)
- **Completed** — job finished successfully (high priority)
- **Failed** — job failed (urgent priority)
- **Cancelled** — job was cancelled (high priority)

Job notes (from `--note`) are included in the notification body when present.

## Commands Reference

| Command | Description |
|---------|-------------|
| `fleche run [job\|cmd] [opts]` | Submit a job via Slurm (or directly with `--exec`, locally with `--host local`) |
| `fleche run <job> --dry-run` | Preview sbatch script and list the project/input files that would be synced |
| `fleche rerun <job-id>` | Re-run a previous job with same settings |
| `fleche exec <cmd>` | Run command directly via SSH (or locally with `--host local`, `--no-sync` to skip sync) |
| `fleche status [job-id\|#N]` | Show job status (defaults to listing all) |
| `fleche status -n 50` | Show last 50 jobs |
| `fleche status --filter running` | Filter by status (repeatable) |
| `fleche status --name <pattern>` | Filter by name (regex, implicit `.*` around) |
| `fleche status --tag <k=v>` | Filter jobs by tag |
| `fleche logs [job-id]` | View job output (defaults to most recent) |
| `fleche logs --raw` | Strip ANSI codes (auto when piped) |
| `fleche logs --tag <k=v>` | Logs from most recent job with tag |
| `fleche logs --note <pattern>` | Logs from most recent job matching note |
| `fleche download [job-id]` | Pull output files (defaults to most recent) |
| `fleche download --filter <pat>` | Filter by glob, searches inside directories (`!` to exclude) |
| `fleche download --dry-run` | Preview what would be downloaded |
| `fleche download --tag <k=v>` | Download from most recent job with tag |
| `fleche cancel [job-id]` | Cancel a job (defaults to most recent active) |
| `fleche cancel --all [--tag <k=v>]` | Cancel all (or tagged) active jobs |
| `fleche cancel --dry-run` | Show what would be cancelled without cancelling |
| `fleche clean [job-id]` | Archive job (hide from listings) |
| `fleche clean --all [--tag <k=v>]` | Archive all (or tagged) finished jobs |
| `fleche clean --all --filter failed` | Archive only failed jobs |
| `fleche clean --older-than <dur>` | Archive jobs older than duration |
| `fleche clean --delete [job-id]` | Permanently delete job and remote files |
| `fleche clean --delete --archived --all` | Delete all archived jobs |
| `fleche clean --delete --workspace` | Also delete shared workspace |
| `fleche clean --dry-run` | Show what would be done without doing it |
| `fleche clean --unarchive [job-id]` | Restore archived job |
| `fleche status --archived` | Show only archived jobs |
| `fleche status --all-jobs` | Show all jobs including archived |
| `fleche tags` | List all unique tags across jobs |
| `fleche wait [job-id]` | Wait for job to complete |
| `fleche wait --notify` | Wait and send terminal notification when done |
| `fleche wait --ntfy <topic>` | Wait and send push notifications via ntfy.sh |
| `fleche stats [job-id]` | Show resource usage (time, CPU, memory, node) |
| `fleche stats -n 5` | Show stats for last N jobs |
| `fleche note <job-id> [text]` | View or set job note |
| `fleche ping` | Check Slurm cluster health |
| `fleche init` | Create starter config |
| `fleche check` | Validate config |
| `fleche check --remote` | Also validate against remote server |
| `fleche doctor` | Comprehensive troubleshooting diagnostics |
| `fleche compare <a> <b>` | Compare two job configurations side-by-side |
| `fleche proxy -- <cmd>` | Run command through SOCKS proxy to remote host |
| `fleche jobs` | List available jobs from configuration |
| `fleche skill` | Print this skill reference |
| `fleche skill --install project` | Install skill to current project |
| `fleche skill --install global` | Install skill to user config |
| `fleche completions <shell>` | Generate shell completions (bash/zsh/fish) |

## Slurm Options

These can be set in config or passed via CLI:

| Option | sbatch flag | Example |
|--------|-------------|---------|
| `--partition` | --partition | `--partition gpu` |
| `--time` | --time | `--time 8:00:00` |
| `--gpus` | --gpus | `--gpus 1` |
| `--cpus` | --cpus-per-task | `--cpus 16` |
| `--memory` | --mem | `--memory 32G` |
| `--constraint` | --constraint | `--constraint a100` |
| `--nodes` | --nodes | `--nodes 2` |
| `--exclude` | --exclude | `--exclude node01,node02` |

## Remote Directory Structure

All jobs share a workspace directory:

```
<base_path>/<project>/
  .fleche/
    workspace/          # Shared workspace (project code + inputs)
      train.py
      data/
      checkpoints/
    jobs/               # Per-job logs and metadata
      train-abc123/
        job.sbatch
        job.out
        job.err
      eval-def456/
        ...
```

- Project code is synced to `workspace/`, respecting `.gitignore`
- Files in `inputs` are copied to `workspace/` (for gitignored data)
- Job commands run with `workspace/` as their working directory
- Job logs go to `jobs/<job-id>/`
- `fleche download` copies `outputs` from `workspace/` to local
- An empty `inputs`/`outputs` entry (e.g. a `${VAR}` that expands to `""`) is
  rejected and the job is not run — set the variable to a real path or remove
  the entry

## JSON Output

Use the global `--json` flag to get machine-readable output from any supported
command. This is useful for scripting, piping to `jq`, or when fleche is driven
by an AI agent.

```bash
fleche status --json                   # List jobs as JSON
fleche status --json <job-id>          # Detailed status as JSON
fleche jobs --json                     # Available job definitions
fleche tags --json                     # All tags
fleche stats --json                    # Resource stats
fleche wait --json                     # Wait and get final status as JSON
fleche cancel --dry-run --json         # Preview cancellation as JSON
fleche clean --all --dry-run --json    # Preview cleanup as JSON
```

The `--json` flag is supported by: `status`, `jobs`, `tags`, `stats`, `wait`,
`cancel`, and `clean`.

## Dry Run

Use `--dry-run` to preview what a command would do without side effects:

```bash
fleche run train --dry-run             # Preview sbatch script + project/input files to sync
fleche download --dry-run              # Preview downloads
fleche cancel --all --dry-run          # Preview cancellation
fleche clean --older-than 7d --dry-run # Preview cleanup
fleche clean --all --dry-run           # Preview cleanup
```

## Troubleshooting

### Validate Configuration

```bash
fleche check                           # Check config syntax locally
fleche check --remote                  # Validate against remote server
```

The `--remote` flag tests:
- SSH connectivity with timing
- Slurm controller availability
- Partition existence and node count
- Constraint validity for the partition
- Base path writability
- Available disk space

### Comprehensive Diagnostics

```bash
fleche doctor
```

Runs a full diagnostic check including:
- Local tools (ssh, rsync)
- Configuration validity
- Job registry health (stale jobs, old jobs)
- Remote connection and Slurm status
- Disk space warnings

### Compare Job Configurations

```bash
fleche compare <job-a> <job-b>
```

Shows differences in command, Slurm settings, environment, tags, and more.
Useful for debugging why one job succeeded while another failed.

### Numeric Index Aliases

`fleche status` shows a `#` column with 1-based indices (1 = most recent).
Use these numbers anywhere a job ID is accepted:

```bash
fleche status
#  ID                                            STATUS       SLURM ID     CREATED
1  train-20260301-120000-abc1                    running      12345        2026-03-01 12:00
2  eval-20260228-090000-def2                     completed    12340        2026-02-28 09:00
3  train-20260227-150000-ghi3                    failed       12335        2026-02-27 15:00

# Use index instead of job ID
fleche logs 1           # Logs for most recent job
fleche cancel 1         # Cancel most recent job
fleche download 2       # Download outputs from job #2
fleche stats 3          # Stats for job #3
fleche status 1         # Detailed status for job #1
```

Indices are stable within a session — they correspond to position in the
unfiltered global list. Filtered views may show gaps (e.g., `#1, #4, #7`)
but the numbers always resolve to the same job.

### SOCKS Proxy

Route traffic through the remote host using an SSH SOCKS tunnel:

```bash
fleche proxy -- curl https://example.com              # Route through cluster
fleche proxy -- wget https://huggingface.co/weights   # Download via cluster network
fleche proxy --port 1080 -- curl https://example.com  # Use specific port
fleche proxy --host other -- curl https://example.com  # Override host
```

The tunnel opens automatically, sets `ALL_PROXY`/`HTTP_PROXY`/`HTTPS_PROXY`
environment variables on the child process, and closes when the command exits.

## Tips

- Use `--dry-run` to preview the sbatch script before submitting
- Use `fleche check --remote` to validate config against the server
- Use `fleche doctor` when things aren't working as expected
- Job IDs look like `train-20260115-153042-847-x7k2` (use suffix like `x7k2` for short)
- Use numeric indices from `fleche status` for quick access (e.g., `fleche logs 1`)
- The job registry is at `~/.config/fleche/jobs.db`
- Ctrl+C during streaming disconnects but doesn't cancel the job
- Exit codes are tracked and shown in `fleche status <job-id>` and failure messages
- Raw Slurm state (e.g., TIMEOUT, OUT_OF_MEMORY, PREEMPTED) is shown in `fleche status <job-id>` for Slurm jobs
- Slurm resources at submission (partition, memory, time, GPUs, etc.) are shown in `fleche status <job-id>` — useful after Slurm purges the job record
- Resource usage (elapsed time, CPU, memory, node) is shown in `fleche status <job-id>` for finished Slurm jobs
- Use `fleche exec` for quick ad-hoc tests without Slurm queue wait
- Use `exec = true` in config for jobs that should always bypass Slurm
- Jobs share workspace, so chained jobs can read each other's outputs
- Use `--retry` for flaky jobs that may fail due to transient issues
- Use `--note` to document experiment parameters for future reference
- Use `fleche clean` to archive old jobs without deleting them
- Use `fleche jobs` to see what jobs are available in the project
- Use `fleche proxy` to route traffic through the cluster's network
- Use `--ntfy <topic>` to get push notifications on your phone via ntfy.sh
- Enable shell completions: `fleche completions bash >> ~/.bashrc`