flodl 0.4.0

floDl — a flow-graph deep learning framework built on libtorch
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
<p align="center">
  <img src="https://raw.githubusercontent.com/fab2s/floDl/main/docs/floDl.png" alt="floDl" width="640">
</p>

<h1 align="center">floDl</h1>

<p align="center">
A Rust-native deep learning framework built on libtorch.<br>
Same GPU kernels as PyTorch. No Python. No GIL. No GC. Just Rust.
</p>

<p align="center">
  <a href="https://flodl.dev"><img src="https://img.shields.io/badge/web-flodl.dev-6c8cff" alt="Website"></a>
  <a href="https://github.com/fab2s/floDl/actions"><img src="https://github.com/fab2s/floDl/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
  <a href="https://crates.io/crates/flodl"><img src="https://img.shields.io/crates/v/flodl.svg" alt="crates.io"></a>
  <a href="https://docs.rs/flodl"><img src="https://docs.rs/flodl/badge.svg" alt="docs.rs"></a>
  <a href="https://github.com/fab2s/floDl/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="MIT License"></a>
</p>

<p align="center">
  <a href="#if-you-know-pytorch-you-know-flodl">PyTorch Users</a> &bull;
  <a href="https://flodl.dev/thesis"><b>Thesis</b></a> &bull;
  <a href="#getting-started">Getting Started</a> &bull;
  <a href="#the-graph-builder">Graph Builder</a> &bull;
  <a href="#graph-tree-hierarchical-composition">Graph Tree</a> &bull;
  <a href="#the-training-experience">Training</a> &bull;
  <a href="#multi-gpu-training">Multi-GPU</a> &bull;
  <a href="#pytorch-parity">Parity</a> &bull;
  <a href="#performance">Benchmarks</a> &bull;
  <a href="https://github.com/fab2s/floDl/blob/main/ROADMAP.md">Roadmap</a> &bull;
  <a href="https://github.com/fab2s/floDl/blob/main/docs/pytorch_migration.md">Migration Guide</a> &bull;
  <a href="https://github.com/fab2s/floDl/blob/main/docs/tutorials/13-data-loading.md">Data Loading</a>
</p>

---

## If You Know PyTorch, You Know floDl

<table>
<tr><th>PyTorch</th><th>floDl</th></tr>
<tr><td>

```python
model = nn.Sequential(
    nn.Linear(2, 16),
    nn.GELU(),
    nn.LayerNorm(16),
    nn.Linear(16, 2),
)

pred = model(x)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer.step()
```

</td><td>

```rust
let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)
    .through(LayerNorm::new(16)?)
    .through(Linear::new(16, 2)?)
    .build()?;

let pred = model.forward(&x)?;
let loss = mse_loss(&pred, &target)?;
loss.backward()?;
optimizer.step()?;
```

</td></tr>
</table>

Same concepts, same names, same GPU kernels underneath. The `?` operator
replaces silent failures with compile-time error handling. `Drop` replaces the
garbage collector. The [full migration guide](https://github.com/fab2s/floDl/blob/main/docs/pytorch_migration.md) covers
every op, module, and pattern.

> **New to Rust?** Read [Rust for PyTorch Users]https://github.com/fab2s/floDl/blob/main/docs/tutorials/00-rust-primer.md — 10 patterns in 15 minutes.

## Getting Started

**With the CLI** (recommended, no Rust needed):

```bash
curl -sL https://flodl.dev/fdl -o fdl && chmod +x fdl
./fdl setup          # detect hardware, download libtorch, configure build environment
./fdl init my-proj   # scaffold a new project with training template
```

The `fdl` script auto-downloads a pre-compiled CLI binary (~750KB, pure Rust,
no libtorch dependency). It detects your GPUs, downloads the right libtorch
variant, and configures Docker or native builds. See the [full CLI
reference](docs/cli.md) for all commands.

**One-liner with Docker** (no Rust, no setup):

```bash
curl -sL https://flodl.dev/init.sh | sh -s my-project
cd my-project
make build    # first build (~5 min, downloads libtorch)
make run      # train the model
```

**Native** -- [Rust](https://rustup.rs/) 1.85+ and libtorch:

```bash
./fdl libtorch download    # auto-detects CPU or CUDA
cargo add flodl && cargo build
```

For CUDA: `cargo add flodl --features cuda` + [CUDA toolkit](https://developer.nvidia.com/cuda-downloads).

Both paths generate an annotated training template. Edit `src/main.rs` to
build your model:

```rust
use flodl::*;

let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)
    .through(LayerNorm::new(16)?)
    .also(Linear::new(16, 16)?)     // residual connection
    .through(Linear::new(16, 2)?)
    .build()?;

let params = model.parameters();
let mut optimizer = Adam::new(&params, 0.01);
model.train();

for (input_t, target_t) in &batches {
    let input = Variable::new(input_t.clone(), true);
    let target = Variable::new(target_t.clone(), false);

    let pred = model.forward(&input)?;
    let loss = mse_loss(&pred, &target)?;

    optimizer.zero_grad();
    loss.backward()?;
    clip_grad_norm(&params, 1.0)?;
    optimizer.step()?;
}
```

## The Graph Builder

floDl's fluent graph builder lets you describe complex architectures as
readable data flow — no boilerplate, no `nn.Module` subclassing.

```rust
let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)                        // activation
    .through(LayerNorm::new(16)?)         // normalization
    .also(Linear::new(16, 16)?)           // residual connection
    .through(Linear::new(16, 2)?)         // output projection
    .build()?;
```

`build()` returns a `Graph` that implements `Module` — you can nest it
inside other graphs. Things get interesting when architectures get complex:

```rust
let g = FlowBuilder::from(encoder).tag("encoded")
    .split(modules![head_a, head_b, head_c]).merge(MergeOp::Mean)
    .loop_body(refinement_block).for_n(3).tag("refined")
    .gate(router, modules![expert_a, expert_b]).using(&["encoded"])
    .switch(selector, modules![light_path, heavy_path]).using(&["refined"])
    .through(StateAdd).using(&["memory"]).tag("memory")
    .loop_body(decoder).while_cond(halt_condition, 10)
    .through(output_head)
    .build()?;
```

Every construct — `split/merge`, `also`, `loop_body`, `gate`, `switch`, `map`,
`tag/using` — composes cleanly. Forward references (`using` before `tag`) carry
state across calls, enabling recurrent architectures without special-casing.

| Method | What it does |
|--------|-------------|
| `from(m).through(m)` | Linear chain |
| `also(m)` | Residual: `input + m(input)` |
| `fork(m)` | Side branch: capture output as tag, stream continues |
| `split(modules![...]).merge(op)` | Parallel branches, merged by `Add` or `Mean` |
| `tag(name)` / `using(refs)` | Named references — backward or forward (across calls) |
| `loop_body(body).for_n(n)` | Fixed iteration with BPTT |
| `loop_body(body).while_cond` / `until_cond` | Conditional loops |
| `gate(router, modules![...])` | Soft routing — weighted combination |
| `switch(selector, modules![...])` | Hard routing — only selected branch |
| `map(body).each()` / `.over(tag)` / `.slices(n)` | Element-wise, tagged, or sliced iteration |
| `input(names)` | Auxiliary graph inputs for multi-input architectures |

See the **[Graph Builder Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/05-graph-builder.md)** and
the [full showcase](https://github.com/fab2s/floDl/tree/main/flodl/examples/showcase/).

## Graph Tree: Hierarchical Composition

This is where floDl goes beyond PyTorch. Graphs nest inside graphs with
**label-path addressing** — dot-separated paths that let you reach into any
subgraph from the root. Train components independently, compose them into
larger architectures, and control training phases declaratively.

```rust
// Build components independently
let scan = FlowBuilder::from(scan_net).tag("hidden")
    .label("scan").build()?;

let read = FlowBuilder::from(read_net).tag("confidence")
    .label("read").build()?;

let encoder = FlowBuilder::from(scan)
    .through(read)
    .label("encoder").build()?;

// Compose into full model
let model = FlowBuilder::from(encoder)
    .through(classifier)
    .build()?;
```

### Dotted paths reach anywhere

Every tag and subgraph is addressable through dotted paths from the root:

```rust
model.validate_path("encoder")?;                 // -> Subgraph
model.validate_path("encoder.scan.hidden")?;      // -> Tag (three levels deep)
model.validate_path("encoder.read.confidence")?;  // -> Tag
```

### Declarative training phases

Freeze and thaw entire subtrees by path — no manual parameter iteration:

```rust
// Phase 1: train only the classifier, encoder is frozen
model.freeze("encoder")?;
let fresh_params = model.parameters();  // only unfrozen params
let mut opt = Adam::new(&fresh_params, 1e-3);
// ... train ...

// Phase 2: thaw scan, keep read frozen (it's proven)
model.thaw("encoder.scan")?;
let mut opt = Adam::with_groups()
    .group(&model.parameters_at("encoder.scan")?, 1e-4)  // low LR
    .group(&model.parameters_at("classifier")?, 1e-3)
    .build();
```

### Subgraph checkpoints

Train a component standalone, save it, load it into a larger model:

```rust
// Pre-trained encoder saved earlier
encoder.save_checkpoint("encoder_v1.fdl.gz")?;

// Load into the composed model — namespace + hash validated
model.load_subgraph_checkpoint("encoder", "encoder_v1.fdl.gz")?;
model.freeze("encoder.read")?;  // lock what's proven
```

### Cross-boundary observation

Metrics flow up through the tree automatically:

```rust
model.record_at("encoder.scan.loss", scan_loss)?;
model.record_at("encoder.read.accuracy", read_acc)?;
model.record_scalar("total_loss", total)?;

model.flush(&[]);  // single call flushes the entire tree

// Trends across boundaries — drive training decisions
if model.trend_at("encoder.scan.loss")?.stalled(10, 1e-4) {
    model.thaw("encoder.read")?;  // scan stalled, unfreeze read
}

// Monitor sees all metrics with dotted names automatically
monitor.log(epoch, elapsed, &model);
// -> total_loss, encoder.scan.loss, encoder.read.accuracy
```

This is progressive model composition: each component is trained and
validated independently before becoming a building block in a larger
architecture. Checkpoints, metrics, and training phases compose just like
the graphs themselves.

See the full **[Graph Tree Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/10-graph-tree.md)**.

## The Training Experience

### Training Monitor

Drop-in monitor with adaptive ETA, resource tracking, and a live web
dashboard — no external dependencies, no separate process.

```rust
use flodl::monitor::Monitor;

let mut monitor = Monitor::new(num_epochs);
monitor.serve(3000)?;  // optional: live dashboard at http://localhost:3000

for epoch in 0..num_epochs {
    let t = std::time::Instant::now();
    // ... training ...
    monitor.log(epoch, t.elapsed(), &model);  // sees entire graph tree
}
monitor.finish();
```

```
  epoch   1/100  loss=1.5264  [49ms  ETA 4.8s]
  epoch  10/100  loss=0.3817  [25ms  ETA 2.2s]  VRAM: 2.1/6.0 GB (82%)
  epoch  50/100  loss=0.0023  [24ms  ETA 1.2s]  VRAM: 2.1/6.0 GB (82%)
  epoch 100/100  loss=0.0012  [23ms]             VRAM: 2.1/6.0 GB (82%)
  training complete in 2.8s  | loss: 0.0012
```

<p align="center">
  <a href="https://flodl.dev/benchmark">
    <img src="https://raw.githubusercontent.com/fab2s/floDl/main/docs/dashboard.gif" alt="floDl live training dashboard — click for interactive version" width="800">
  </a>
</p>
<p align="center"><em><a href="https://flodl.dev/benchmark">Interactive benchmark dashboard</a> — real data from a 100-epoch training run</em></p>

The live dashboard updates via Server-Sent Events (no WebSocket, no npm),
tracks CPU/GPU/RAM/VRAM, and supports late join — open it mid-training and
all past epochs backfill instantly.

```rust
monitor.save_html("training_report.html");  // self-contained archive
monitor.export_csv("training.csv")?;         // for external analysis
```

### Observation and Trend Queries

Tags double as observation points. Collect metrics during training and use
trend queries to make programmatic training decisions:

```rust
for epoch in 0..num_epochs {
    for (input, target) in &batches {
        let pred = graph.forward(&input)?;
        graph.collect(&["hidden"])?;                 // from graph tag
        graph.record_scalar("loss", loss.item()?);   // external metric
    }
    graph.flush(&["hidden", "loss"]);

    // Programmatic training control
    if graph.trend("loss").stalled(5, 1e-4) {
        optimizer.set_lr(optimizer.lr() * 0.5);      // decay LR
    }
    if graph.trend("loss").converged(5, 1e-5) {
        break;                                        // early stopping
    }
}
```

| Method | What it does |
|--------|-------------|
| `g.collect(tags)` / `g.flush(tags)` | Batch -> epoch metric aggregation |
| `g.record_scalar(tag, value)` | Inject external metrics (loss, accuracy) |
| `g.trend(tag).slope(n)` | OLS slope over last n epochs |
| `g.trend(tag).stalled(n, tol)` | Is \|slope\| below tolerance? |
| `g.trend(tag).improving(n)` | Is loss decreasing? |
| `g.trend(tag).converged(n, tol)` | Is variance below tolerance? |
| `g.trends(tags).all_improving(n)` | Group queries across branches |

### Visualization

```rust
let svg = g.svg(Some("model.svg"))?;              // architecture diagram
g.svg_with_profile(Some("profile.svg"))?;          // timing heatmap
g.plot_html("training.html", &["loss", "head"])?;  // interactive curves
```

See the **[Training Monitor Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/09-monitor.md)** and
the **[Observation example](https://github.com/fab2s/floDl/tree/main/flodl/examples/observation/)**.

## Multi-GPU Training

`Ddp::setup()` gives you transparent heterogeneous multi-GPU training with
zero changes to your training loop. floDl detects your GPUs, picks the best
strategy, and balances work automatically: the slowest GPU anchors the pace
while faster ones run ahead intelligently.

**Graph DDP** -- one line to go from single-GPU to multi-GPU:

```rust
// Detect GPUs, replicate model, set optimizer, enable training
Ddp::setup(&model, &builder, |p| Adam::new(p, 0.001))?;

// Training loop is IDENTICAL for 1 or N GPUs
for batch in model.epoch(0) {
    let loss = model.forward_batch(&batch?)?;
    model.step()?;  // AllReduce + sync + optimizer + zero_grad
}
```

**DDP Builder** -- thread-per-GPU, works with any `Module`:

```rust
let state = Ddp::builder(model_factory, optim_factory, train_fn)
    .dataset(dataset)
    .batch_size(32)
    .num_epochs(10)
    .policy(ApplyPolicy::Cadence)       // ElChe for mixed GPUs
    .backend(AverageBackend::Nccl)      // or Cpu for A/B testing
    .run()?
    .join()?;
```

| | Graph DDP | DDP Builder |
|---|---|---|
| **Works with** | `Graph` builder | Any `Module` |
| **GPU model** | Scatter per batch | Thread per GPU (Local SGD) |
| **Mixed GPUs** | El Che auto-enabled | `ApplyPolicy` x `AverageBackend` |
| **Setup** | One line (`Ddp::setup`) | Builder pattern |
| **Dashboard** | Integrated | Stderr logging |

**A/B testing**: swap `AverageBackend::Nccl` for `AverageBackend::Cpu`
with one line. If loss curves match, you have validated the cheaper
backend for your workload.

See the **[Multi-GPU Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/11-multi-gpu.md)**,
**[DDP Builder Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/12-async-ddp.md)**,
**[Data Loading Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/13-data-loading.md)**, and
**[DDP Reference](https://github.com/fab2s/floDl/blob/main/docs/ddp.md)**.

### Validation suite — `ddp-bench`

The repo ships with [`ddp-bench/`](https://github.com/fab2s/floDl/tree/main/ddp-bench),
a workspace member that reproduces published training setups (Logistic /
MLP / LeNet-5 / ResNet-20 / Char-RNN / GPT-nano / Conv-AE on MNIST,
CIFAR-10, Shakespeare) to build scientifically valid solo baselines, then
measures DDP/ElChe convergence quality against them across all 8
backend × policy combinations:

```bash
fdl ddp-bench --list                       # list models and modes
fdl ddp-bench quick                        # 1-epoch smoke test
fdl ddp-bench validate                     # full sweep vs structured baselines
fdl ddp-bench --model gpt-nano --mode nccl-cadence --epochs 50 --lr-scale 2
fdl ddp-bench --report runs/report.md      # convergence report from saved runs
```

Every run produces a high-frequency `Timeline` (CPU/GPU utilization, sync
events, anchor changes, idle gaps) saved as JSON / CSV / interactive HTML
under `runs/<model>/<mode>/`.

### Built-in datasets

The framework ships ready-to-use parsers for common benchmarks (all
implement `BatchDataSet`, plug straight into `DataLoader::builder`):

```rust
use flodl::data::datasets::{Cifar10, Mnist, Shakespeare};

let mnist = Mnist::parse(&images_gz, &labels_gz)?;
let cifar = Cifar10::parse(&[&batch1, &batch2, /* ... */])?;
let text  = Shakespeare::parse(&corpus, /*seq_len=*/ 128)?;
```

`ddp-bench` downloads and caches the underlying files on first run.

## PyTorch Parity

floDl covers the modules, losses, and optimizers you actually use:

| Category | Count | Highlights |
|----------|------:|-----------|
| **NN Modules** | 30+ | `Linear`, `Conv1d`/`2d`/`3d` + transpose, `GRU`/`LSTM`, `MultiheadAttention`, `Bilinear`, all norms (`Layer`/`RMS`/`Group`/`Batch`/`Instance`), all pooling, `Embedding`/`EmbeddingBag`, `PixelShuffle`, `Upsample`, `Unfold`/`Fold` |
| **Activations** | 17 | `ReLU`, `LeakyReLU`, `ELU`, `GELU`, `SiLU`, `Mish`, `SELU`, `Softplus`, `Hardswish`, `PReLU`, `Softmax`, ... |
| **Losses** | 15 | MSE, CrossEntropy, BCE, NLL, CTC, Focal, Triplet, KLDiv, SmoothL1, Cosine, Hinge, Margin, Poisson, ... |
| **Optimizers** | 7 | `SGD`, `Adam`, `AdamW`, `RMSprop`, `Adagrad`, `RAdam`, `NAdam` — all with parameter groups |
| **Schedulers** | 8 | Step, Cosine, Exponential, MultiStep, OneCycle, Cyclic, Warmup (composable), Plateau |
| **Init** | 9 | Xavier, Kaiming, orthogonal, truncated normal, uniform, normal |
| **Tensor Ops** | 100+ | Full arithmetic, trig, reductions, shape, indexing, comparisons, fused ops |
| **Autograd** | 90+ | Differentiable backward for every op above |

Fused Adam/AdamW on CUDA (single kernel for all parameters). Fused gradient
clipping via foreach ops. Mixed precision with `AutocastGuard` + `GradScaler`.
CUDA Graphs for replay-based training.

The [full migration guide](https://github.com/fab2s/floDl/blob/main/docs/pytorch_migration.md) has side-by-side
code for every op, module, and pattern.

## Performance

Same CUDA kernels as PyTorch — the difference comes from what happens
*between* kernel launches. Ten models, ten interleaved rounds, locked GPU
clocks (RTX 5060 Ti, v0.3.0 vs PyTorch 2.10.0):

| Model | PyTorch | flodl | Delta |
|---|---:|---:|---:|
| transformer | 3183.0 ms | 2199.8 ms | **-31%** |
| mlp | 291.1 ms | 207.0 ms | **-29%** |
| residual_tower | 406.9 ms | 309.7 ms | **-24%** |
| feedback_fixed | 275.3 ms | 231.3 ms | **-16%** |
| gated_routing | 248.0 ms | 217.3 ms | **-12%** |
| iterative_refine | 230.7 ms | 206.0 ms | **-11%** |
| gru_seq | 1105.1 ms | 1057.5 ms | **-4%** |
| conv_autoenc | 398.2 ms | 395.3 ms | -1% |
| lstm_seq | 692.3 ms | 692.3 ms | 0% |
| convnet | 1298.0 ms | 1298.2 ms | 0% |

Wins 8 of 10, ties 2, zero regressions. The ties (convnet, lstm_seq) are
compute-bound -- both frameworks saturate the GPU, confirming identical
CUDA kernels. The gap appears where framework overhead matters:
dispatch-bound architectures (transformer -31%, mlp -29%), graph routing
(residual_tower -24%), and recurrent loops (feedback_fixed -16%).

**[Benchmark Report](https://github.com/fab2s/floDl/blob/main/docs/benchmark.md)** |
[Interactive dashboard](https://flodl.dev/benchmark)

## Why Rust for Deep Learning?

**Deterministic memory.** Python adds ~3-5 us of framework overhead per GPU
op. Go's GC can't manage VRAM — an [earlier Go implementation](https://github.com/fab2s/goDl)
required 5 phases of lifecycle management (refcounting, GC callbacks, VRAM
budgets, pending-free queues). Rust replaces all of that with
`impl Drop for Tensor`. Memory is freed the instant a tensor leaves scope.

**Zero-cost safety.** Every op returns `Result<T>` — no silent failures.
Ownership ensures tensors are freed exactly once. The borrow checker
prevents data races at compile time.

**Same GPU kernels.** floDl binds libtorch — the C++ library under
PyTorch. CUDA, cuBLAS, cuDNN are identical. floDl replaces the dispatch
path, autograd tracking, and graph execution.

## Features Reference

<details>
<summary><strong>Training Tools</strong></summary>

| Tool | What it does |
|------|-------------|
| `clip_grad_norm` / `clip_grad_value` | Fused gradient clipping (2 kernels total via foreach ops) |
| `save_checkpoint` / `load_checkpoint` | Named `.fdl` checkpoints, structural hash, partial loading, `LoadReport` |
| `migrate_checkpoint` | Remap parameter names across versions |
| `Parameter::freeze` / `unfreeze` | Per-parameter gradient control |
| `GradScaler` | Dynamic loss scaling for fp16 training |
| `cast_parameters` | Cast model parameters to any dtype |
| `CpuWorker` / `ModelSnapshot` | Background checkpoint saving |
| `CudaGraph` | Capture/replay training steps for fixed-shape models |

</details>

<details>
<summary><strong>Module Traits</strong></summary>

Beyond `forward`/`parameters`, `Module` provides optional methods the graph
recognizes automatically:

| Method | What happens |
|--------|-------------|
| `as_named_input()` | `using()` refs arrive as a named map |
| `reset()` | Loops auto-call before iterating — clears per-forward state |
| `detach_state()` | Break gradient chains on retained state |
| `sub_modules()` | Recursive device placement, training mode, parameter collection |

</details>

<details>
<summary><strong>Build Profiles</strong></summary>

```toml
# Optimize floDl in dev builds — your code stays fast to compile.
[profile.dev.package.flodl]
opt-level = 3

[profile.dev.package.flodl-sys]
opt-level = 3

# Release: cross-crate optimization for maximum throughput.
[profile.release]
lto = "thin"
codegen-units = 1
```

| Profile | flodl | Your code | Typical rebuild |
|---------|-------|-----------|-----------------|
| `cargo build` | `-O3` (cached) | `-O0` (fast) | < 2s |
| `cargo build --release` | `-O3` + LTO | `-O3` + LTO | full link |

</details>

<details>
<summary><strong>Multi-GPU (DDP)</strong></summary>

| Component | What it does |
|-----------|-------------|
| `Ddp::setup` | One-liner: detect GPUs, distribute, set optimizer, train |
| `Ddp::builder` | Thread-per-GPU with Local SGD, any Module |
| `ApplyPolicy` | Sync / Cadence / Async (when to average) |
| `AverageBackend` | Nccl / Cpu (how to average, A/B testable) |
| `ElChe` | Heterogeneous GPU cadence strategy |
| `NcclComms` / `NcclRankComm` | NCCL AllReduce, Broadcast, abort handles |
| `CudaEvent` / `CudaStream` | Async GPU-CPU pipeline, timing |
| `DataLoader` | Resident/streaming/distributed, VRAM-aware prefetch, auto OOM fallback |

</details>

### Numerical Verification

Every differentiable path is verified against finite-difference gradients:
- 117 autograd op-level checks (every op + compositions)
- Module-level checks (every NN module, input + parameter gradients)
- Exact optimizer step verifications (SGD, Adam, AdamW, RMSprop, Adagrad, RAdam, NAdam)
- 1027 library tests, zero clippy warnings — all tests run on both CPU and CUDA

### Hardware Compatibility

Developed and tested from NVIDIA Pascal (GTX 1060 6GB) to Blackwell
(RTX 5060 Ti 16GB). PyTorch dropped Pascal support after 2.5.1 — floDl
links libtorch's stable C API, which supports every architecture the driver
supports. If `nvidia-smi` works, floDl trains on it.

## Documentation

### Choose your path

| Background | Start here |
|-----------|-----------|
| **New to Rust** | [Rust for PyTorch Users]https://github.com/fab2s/floDl/blob/main/docs/tutorials/00-rust-primer.md — 10 patterns in 15 minutes |
| **Know Rust, new to DL** | [Tensors]https://github.com/fab2s/floDl/blob/main/docs/tutorials/01-tensors.md then [Training]https://github.com/fab2s/floDl/blob/main/docs/tutorials/04-training.md |
| **Know PyTorch** | [Porting Guide]https://github.com/fab2s/floDl/blob/main/docs/porting.md (or `/port` with AI) then [Graph Builder]https://github.com/fab2s/floDl/blob/main/docs/tutorials/05-graph-builder.md |
| **Scaling to multi-GPU** | [Multi-GPU Training]https://github.com/fab2s/floDl/blob/main/docs/tutorials/11-multi-gpu.md then [DDP Builder]https://github.com/fab2s/floDl/blob/main/docs/tutorials/12-async-ddp.md |
| **Just show me code** | [`quickstart`]https://github.com/fab2s/floDl/tree/main/flodl/examples/quickstart/ or [`showcase`]https://github.com/fab2s/floDl/tree/main/flodl/examples/showcase/ |

### Tutorials

0. **[Rust for PyTorch Users]https://github.com/fab2s/floDl/blob/main/docs/tutorials/00-rust-primer.md** — 10 Rust patterns in 15 minutes
1. **[Tensors]https://github.com/fab2s/floDl/blob/main/docs/tutorials/01-tensors.md** — creation, ops, memory, CUDA
2. **[Autograd]https://github.com/fab2s/floDl/blob/main/docs/tutorials/02-autograd.md** — variables, gradients, backward
3. **[Modules]https://github.com/fab2s/floDl/blob/main/docs/tutorials/03-modules.md** — all layers, convolutions, RNNs, attention, normalization
4. **[Training]https://github.com/fab2s/floDl/blob/main/docs/tutorials/04-training.md** — losses, optimizers, mixed precision, full loop
5. **[Graph Builder]https://github.com/fab2s/floDl/blob/main/docs/tutorials/05-graph-builder.md** — fluent API from simple to complex
6. **[Advanced Graphs]https://github.com/fab2s/floDl/blob/main/docs/tutorials/06-advanced-graphs.md** — forward refs, loops, gates, switches
7. **[Visualization]https://github.com/fab2s/floDl/blob/main/docs/tutorials/07-visualization.md** — DOT/SVG, profiling heatmaps
8. **[Utilities]https://github.com/fab2s/floDl/blob/main/docs/tutorials/08-utilities.md** — checkpoints, clipping, freezing, initialization, scheduling, verbosity-gated logging
9. **[Training Monitor]https://github.com/fab2s/floDl/blob/main/docs/tutorials/09-monitor.md** — ETA, resource tracking, live dashboard
10. **[Graph Tree]https://github.com/fab2s/floDl/blob/main/docs/tutorials/10-graph-tree.md** — hierarchical composition, freeze/thaw, subgraph checkpoints
11. **[Multi-GPU Training]https://github.com/fab2s/floDl/blob/main/docs/tutorials/11-multi-gpu.md** — Ddp::setup, El Che, auto-balancing, DataLoader integration
12. **[DDP Builder]https://github.com/fab2s/floDl/blob/main/docs/tutorials/12-async-ddp.md** — thread-per-GPU, Local SGD, A/B testable backends
13. **[Data Loading]https://github.com/fab2s/floDl/blob/main/docs/tutorials/13-data-loading.md** — DataLoader, resident/streaming modes, VRAM-aware prefetch, DDP integration

### Examples

- [`quickstart`]https://github.com/fab2s/floDl/tree/main/flodl/examples/quickstart/ — build, train, and monitor a model with residual connections
- [`sine_wave`]https://github.com/fab2s/floDl/tree/main/flodl/examples/sine_wave/ — sine regression with monitor, checkpoint round-trip
- [`mixed_precision`]https://github.com/fab2s/floDl/tree/main/flodl/examples/mixed_precision/ — float16 training with `GradScaler`
- [`transfer_learning`]https://github.com/fab2s/floDl/tree/main/flodl/examples/transfer_learning/ — checkpoint, partial load, freeze, fine-tune
- [`schedulers`]https://github.com/fab2s/floDl/tree/main/flodl/examples/schedulers/ — warmup + cosine + plateau composition
- [`observation`]https://github.com/fab2s/floDl/tree/main/flodl/examples/observation/ — collect, flush, trend queries, early stopping
- [`showcase`]https://github.com/fab2s/floDl/tree/main/flodl/examples/showcase/ — every graph builder method in one graph

### Porting from PyTorch

- **[Porting Guide]https://github.com/fab2s/floDl/blob/main/docs/porting.md** — module mapping, FlowBuilder patterns, training loop translation
- **[AI-assisted porting]https://github.com/fab2s/floDl/tree/main/ai/skills/port/** — point any AI coding assistant at the skill guide for automated translation. With Claude Code: `/port my_model.py`
- **`fdl api-ref`** — generate a structured API reference for your flodl version. Used by AI tools and useful on its own.

### Architecture

```
+-----------------------------------------------------------+
|  User Code / Model Definitions                            |
+-----------------------------------------------------------+
|  monitor/  ETA, resource tracking, live web dashboard     |
+-----------------------------------------------------------+
|  graph/    Fluent builder, graph tree, execution, DOT/SVG |
+-----------------------------------------------------------+
|  data/     DataLoader, resident/streaming, prefetch       |
+-----------------------------------------------------------+
|  nn/       Modules, losses, optimizers, DDP, NCCL         |
+-----------------------------------------------------------+
|  autograd/ Reverse-mode AD, gradient tracking             |
+-----------------------------------------------------------+
|  tensor/   Owned tensors with Drop, CPU + CUDA            |
+-----------------------------------------------------------+
|  flodl-sys   FFI bindings to libtorch C++ shim            |
+-----------------------------------------------------------+
|  libtorch / CUDA / NCCL                                   |
+-----------------------------------------------------------+
```

## Story

floDl started as a question: what would a deep learning framework look like
if you designed it around Rust's ownership model instead of fighting a garbage
collector?

An [earlier attempt in Go](https://github.com/fab2s/goDl) proved the
architecture — the graph builder, the module system, the observation engine —
but hit a wall: Go's GC cannot manage GPU memory deterministically. That
required building five layers of memory management infrastructure on top of
the language, not with it.

Rust solved this at the language level. `impl Drop for Tensor` replaced
hundreds of lines of lifecycle management. The graph builder, module
composition, and design philosophy carried forward; the memory fights didn't.

## License

floDl is open-sourced software licensed under the [MIT license](https://github.com/fab2s/floDl/blob/main/LICENSE).