flodl 0.1.1

floDl — a flow-graph deep learning framework built on libtorch
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
<p align="center">
  <img src="docs/floDl.png" alt="floDl" width="640">
</p>

<h1 align="center">floDl</h1>

<p align="center">
A Rust-native deep learning framework built on libtorch.<br>
Same GPU kernels as PyTorch. No Python. No GIL. No GC. Just Rust.
</p>

<p align="center">
  <a href="https://github.com/fab2s/floDl/actions"><img src="https://github.com/fab2s/floDl/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
  <a href="https://crates.io/crates/flodl"><img src="https://img.shields.io/crates/v/flodl.svg" alt="crates.io"></a>
  <a href="https://docs.rs/flodl"><img src="https://docs.rs/flodl/badge.svg" alt="docs.rs"></a>
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="MIT License"></a>
</p>

<p align="center">
  <a href="#getting-started">Getting Started</a> &bull;
  <a href="#the-graph-builder">Graph Builder</a> &bull;
  <a href="#training-monitor">Training Monitor</a> &bull;
  <a href="#features">Features</a> &bull;
  <a href="docs/tutorials/01-tensors.md">Tutorials</a> &bull;
  <a href="docs/pytorch_migration.md">PyTorch Migration</a> &bull;
  <a href="docs/troubleshooting.md">Troubleshooting</a> &bull;
  <a href="#architecture">Architecture</a>
</p>

---

## If You Know PyTorch, You Know floDl

<table>
<tr><th>PyTorch</th><th>floDl</th></tr>
<tr><td>

```python
model = nn.Sequential(
    nn.Linear(2, 16),
    nn.GELU(),
    nn.LayerNorm(16),
    nn.Linear(16, 2),
)

pred = model(x)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer.step()
```

</td><td>

```rust
let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)
    .through(LayerNorm::new(16)?)
    .through(Linear::new(16, 2)?)
    .build()?;

let pred = model.forward(&x)?;
let loss = mse_loss(&pred, &target)?;
loss.backward()?;
optimizer.step()?;
```

</td></tr>
</table>

Same concepts, same names, same GPU kernels underneath. The `?` operator
replaces silent failures with compile-time error handling. `Drop` replaces the
garbage collector. The [full migration guide](docs/pytorch_migration.md) covers
every op, module, and pattern.

## Getting Started

**Prerequisite:** [Docker](https://docs.docker.com/get-docker/) (no Rust or
libtorch needed on your machine — everything runs in containers).

Create a new project with one command:

```bash
curl -sL https://flodl.dev/init.sh | sh -s my-project
cd my-project
make build    # first build (~5 min, downloads libtorch)
make run      # train the template model
```

This generates a complete project with Dockerfiles, Makefile, and an annotated
training template. Edit `src/main.rs` to build your model.

> **New to Rust?** Read [Rust for PyTorch Users]docs/tutorials/00-rust-primer.md — 10 patterns in 15 minutes.

## The Graph Builder

floDl's fluent graph builder lets you describe complex architectures as
readable data flow — no boilerplate, no graph construction commands.

```rust
let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)                        // activation
    .through(LayerNorm::new(16)?)         // normalization
    .also(Linear::new(16, 16)?)           // residual connection
    .through(Linear::new(16, 2)?)         // output projection
    .build()?;
```

That's a trainable model. `also` adds the residual — input flows through the
Linear *and* gets added to its output. `build()` returns a `Graph` that
implements `Module` — you can nest it inside other graphs.

Things get interesting when architectures get complex:

```rust
let g = FlowBuilder::from(encoder).tag("encoded")
    .split(modules![head_a, head_b, head_c]).merge(MergeOp::Mean)
    .loop_body(refinement_block).for_n(3).tag("refined")
    .gate(router, modules![expert_a, expert_b]).using(&["encoded"])
    .switch(selector, modules![light_path, heavy_path]).using(&["refined"])
    .through(StateAdd).using(&["memory"]).tag("memory")
    .loop_body(decoder).while_cond(halt_condition, 10)
    .through(output_head)
    .build()?;
```

Every construct — `split/merge`, `also`, `loop_body`, `gate`, `switch`, `map`,
`tag/using` — composes cleanly. Sub-graphs nest like any module. Forward
references (`using` before `tag`) carry state across calls, enabling recurrent
architectures without special-casing. Enough to express transformers,
mixture-of-experts, iterative refinement, attention with memory, or any
architecture you can draw as a data flow graph.

See the **[Graph Builder Tutorial](docs/tutorials/05-graph-builder.md)** and
the [full showcase](flodl/examples/showcase/) that exercises every builder
method.

## Training Monitor

Drop-in training monitor with adaptive ETA, system resource tracking, and a
live web dashboard — no external dependencies, no separate process.

```rust
use flodl::monitor::Monitor;

let mut monitor = Monitor::new(num_epochs);
monitor.serve(3000)?;  // optional: live dashboard at http://localhost:3000

for epoch in 0..num_epochs {
    let t = std::time::Instant::now();
    // ... training ...

    monitor.log(epoch, t.elapsed(), &[("loss", loss_val), ("lr", lr)]);
}
monitor.finish();
```

Terminal output adapts automatically — duration and ETA switch between hours,
minutes, seconds, and milliseconds as needed:

```
  epoch   1/100  loss=1.5264  [49ms  ETA 4.8s]
  epoch  10/100  loss=0.3817  [25ms  ETA 2.2s]  VRAM: 2.1/6.0 GB (82%)
  epoch  50/100  loss=0.0023  [24ms  ETA 1.2s]  VRAM: 2.1/6.0 GB (82%)
  epoch 100/100  loss=0.0012  [23ms]             VRAM: 2.1/6.0 GB (82%)
  training complete in 2.8s  | loss: 0.0012
```

### Live dashboard

Call `monitor.serve(port)` and open the URL in a browser. The page updates
in real time via Server-Sent Events — no polling, no WebSocket, no npm.

<p align="center">
  <a href="https://flodl.dev/benchmark">
    <img src="docs/dashboard.gif" alt="floDl live training dashboard — click for interactive version" width="800">
  </a>
</p>
<p align="center"><em><a href="https://flodl.dev/benchmark">Interactive benchmark dashboard</a> — real data from a 100-epoch training run</em></p>

The dashboard includes:

| Panel | What it shows |
|-------|--------------|
| **Header** | Epoch counter, progress bar, ETA, elapsed time |
| **Metrics chart** | All logged metrics (loss, lr, ...) as live canvas chart |
| **Resource chart** | CPU%, GPU%, RAM%, VRAM% over time |
| **Resource bars** | Current usage with values (e.g., `VRAM: 2.1/6.0 GB`) |
| **Epoch log** | Every epoch, newest first, with duration and resources |
| **Graph SVG** | Collapsible architecture diagram (via `monitor.watch(&model)`) |

Late join works — open the dashboard mid-training and it backfills all
past epochs instantly.

### Resource tracking

| Metric | Source | Availability |
|--------|--------|-------------|
| CPU % | `/proc/stat` delta | Linux |
| RAM | `/proc/meminfo` | Linux |
| GPU utilization % | NVML (dynamic `dlopen`) | NVIDIA GPU + driver |
| VRAM used/total | `cudaMemGetInfo` via FFI | CUDA builds |

Resources that aren't available are silently omitted. CPU-only builds show
CPU and RAM; CUDA builds add GPU and VRAM automatically.

### Export

```rust
monitor.save_html("training_report.html");  // self-contained dashboard archive
monitor.write_log("training.log")?;          // human-readable log
monitor.export_csv("training.csv")?;         // metrics + resources as CSV
```

`save_html` writes a complete dashboard at `finish()` — all metrics, resource
charts, and graph SVG baked into a single HTML file. Open it in any browser,
no server needed. Set it once before training and forget about it.

See the full **[Training Monitor Tutorial](docs/tutorials/09-monitor.md)**.

## Quick Start

### With Docker (recommended)

No Rust or libtorch needed — everything runs in containers:

```bash
curl -sL https://flodl.dev/init.sh | sh -s my-project
cd my-project && make run
```

### Without Docker

**Requirements:** Rust 1.85+ and [libtorch](https://pytorch.org/get-started/locally/)
(C++/libtorch variant).

```bash
cargo add flodl
```

Set `LIBTORCH_PATH` to your libtorch directory and `LD_LIBRARY_PATH` to
include `$LIBTORCH_PATH/lib`. For CUDA, also set `CUDA_HOME` and enable
the feature: `cargo add flodl --features cuda`.

See [libtorch downloads](https://pytorch.org/get-started/locally/) (pick the
C++/libtorch variant) and [CUDA toolkit](https://developer.nvidia.com/cuda-downloads)
if you need GPU support.

**Develop floDl itself:**
```bash
git clone https://github.com/fab2s/floDl.git
cd floDl
make image      # build dev container (Rust + libtorch)
make test       # run all tests (CPU)
make cuda-test  # run all tests on CUDA (requires NVIDIA GPU)
make test-all   # CPU first, then CUDA if a GPU is available
make clippy     # lint
make shell      # interactive shell in container
```

### Train a model

```rust
use flodl::*;

// Build the model.
let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)
    .through(LayerNorm::new(16)?)
    .also(Linear::new(16, 16)?)
    .through(Linear::new(16, 2)?)
    .build()?;

// Set up training.
let params = model.parameters();
let mut optimizer = Adam::new(&params, 0.01);
model.train();

// Training loop.
for (input_t, target_t) in &batches {
    let input = Variable::new(input_t.clone(), true);
    let target = Variable::new(target_t.clone(), false);

    let pred = model.forward(&input)?;
    let loss = mse_loss(&pred, &target)?;

    optimizer.zero_grad();
    loss.backward()?;
    clip_grad_norm(&params, 1.0)?;
    optimizer.step()?;
}
```

## Features

### Core Stack

| Layer | What it does |
|-------|-------------|
| **Tensor** | Owned RAII tensors with `Drop`, `Clone`. CPU and CUDA. |
| **Autograd** | Reverse-mode automatic differentiation. Full backward for every op. |
| **NN Modules** | `Linear`, `Conv2d`, `ConvTranspose2d`, `LayerNorm`, `BatchNorm`/`BatchNorm2d`, `Dropout`, `Dropout2d`, `Embedding`, `GRUCell`, `LSTMCell` |
| **Activations** | `Identity`, `ReLU`, `Sigmoid`, `Tanh`, `GELU`, `SiLU` |
| **Losses** | `mse_loss`, `cross_entropy_loss`, `bce_with_logits_loss`, `l1_loss`, `smooth_l1_loss`, `kl_div_loss` |
| **Optimizers** | `SGD` (with momentum), `Adam`, `AdamW` — all support parameter groups for per-group LR |
| **LR Scheduling** | `StepDecay`, `CosineScheduler`, `WarmupScheduler` (composable), `PlateauScheduler` |
| **Mixed Precision** | `Float16`/`BFloat16` dtype casting, `GradScaler` for loss scaling |
| **Monitor** | Human-readable ETA, CPU/GPU/RAM/VRAM tracking, live web dashboard |

### Graph Builder

| Method | What it does |
|--------|-------------|
| `from(m).through(m)` | Linear chain |
| `fork(m)` | Side branch: runs module, captures output as tag, stream continues unchanged |
| `input(names)` | Auxiliary graph inputs, accessible via `using(name)` — multi-input graphs |
| `split(modules![...]).merge(op)` | Parallel branches, merged by `Add` or `Mean` |
| `also(m)` | Residual connection: `input + m(input)` |
| `tag(name)` / `using(refs)` | Named references — backward (same pass) or forward (across calls) |
| `loop_body(body).for_n(n)` | Fixed iteration with BPTT |
| `loop_body(body).while_cond(cond, max)` | Condition before body (0..max iterations) |
| `loop_body(body).until_cond(cond, max)` | Condition after body (1..max iterations) |
| `gate(router, modules![...])` | Soft routing — all experts execute, weighted combination |
| `switch(selector, modules![...])` | Hard routing — only selected branch executes |
| `map(body).each()` | Apply body to each element along dim 0 |
| `map(body).over(tag)` | Iterate over a tagged tensor |
| `map(body).slices(n)` | Decompose last dim into n slices, map, recompose |
| `.batched()` | Fast path for Map — full batch in one call |
| `tag_group(name)` | Name parallel branches: `split(...).tag_group("head")` |

### Training Tools

| Tool | What it does |
|------|-------------|
| `clip_grad_norm` | L2 norm gradient clipping |
| `clip_grad_value` | Element-wise gradient clamping |
| `save_checkpoint` / `load_checkpoint` | Named `.fdl` checkpoint with partial loading, persists parameters + buffers, structural hash validation, `LoadReport` (file path or `Write`/`Read`) |
| `Parameter::freeze` / `unfreeze` | Disable/enable gradient tracking per parameter |
| `xavier_uniform/normal` | Weight initialization (also `kaiming_*` via `nn::init`) |
| LR schedulers | `StepDecay`, `CosineScheduler`, `WarmupScheduler`, `PlateauScheduler` (composable) |
| `GradScaler` | Dynamic loss scaling for mixed precision (float16) training |
| `cast_parameters` | Cast model parameters to any dtype |
| **Background** | `CpuWorker` (work queue), `ModelSnapshot` / `snapshot_cpu()` — offload checkpoints & eval to a background thread |

### Module Traits

Beyond the core `forward`/`parameters` methods, `Module` provides optional
methods that the graph recognizes automatically:

| Method | Default | What happens |
|--------|---------|-------------|
| `as_named_input()` | `None` | Returns `&dyn NamedInputModule` — loop and node `using()` refs arrive as a named map |
| `reset()` | no-op | Loops auto-call before iterating — clears per-forward state |
| `detach_state()` | no-op | `graph.detach_state()` propagates — breaks gradient chains on retained state |

Stateful modules just override `reset()` and/or `detach_state()` directly —
no separate trait impls needed. Modules that own child modules implement
`sub_modules()` for recursive device placement, training mode, and parameter
collection.

### Observation & Trends

Tags double as observation points — collect metrics during training, flush
to epoch history, and query trends to drive training decisions:

```rust
for epoch in 0..num_epochs {
    for (input, target) in &batches {
        let pred = graph.forward(&input)?;
        graph.collect(&["hidden"])?;                 // from graph tag

        let loss = mse_loss(&pred, &target)?;
        graph.record_scalar("loss", loss.item()?);   // external metric
    }
    graph.flush(&["hidden", "loss"]);

    if graph.trend("loss").stalled(5, 1e-4) {
        // decay learning rate
    }
}
```

| Method | What it does |
|--------|-------------|
| `g.tagged(tag)` | Access a tagged node's output after forward |
| `g.collect(tags)` / `g.flush(tags)` | Batch -> epoch metric collection |
| `g.record_scalar(tag, value)` | Inject external metrics |
| `g.trend(tag)` | Epoch-level trend: `slope`, `stalled`, `improving`, `converged` |
| `g.trends(tags)` | Group trends: `all_improving`, `any_stalled`, `mean_slope` |
| `g.end_step()` / `g.end_epoch()` | Training housekeeping |

### Visualization

```rust
println!("{}", g.dot());                       // Graphviz DOT with parameter counts
let svg = g.svg(Some("model.svg"))?;          // render to SVG

// Timing-annotated: nodes colored green->yellow->red by execution time.
g.enable_profiling();
g.forward(&input)?;
g.svg_with_profile(Some("profile.svg"))?;

// Training curves as self-contained HTML.
g.plot_html("training.html", &["loss", "head"])?;
g.export_trends("metrics.csv", &["loss"])?;
```

### Numerical Verification

Every differentiable path is verified against finite-difference gradients:
- 37 autograd op-level checks (every op + compositions)
- Module-level checks (every NN module, input + parameter gradients)
- Exact optimizer step verifications (SGD, Adam, AdamW)
- 329 library tests, zero clippy warnings — all tests run on both CPU and CUDA

## Why Rust for Deep Learning?

### The memory management problem

Python adds ~3-5 us of framework overhead to every GPU operation. For
architectures built on many small sequential operations — recurrent steps,
iterative refinement, multi-head attention — this overhead dominates.

Go solves the dispatch overhead with compiled binaries and goroutines, but
Go's garbage collector cannot manage VRAM deterministically. GPU memory lives
in libtorch's C++ allocator — invisible to Go's GC. An earlier Go
implementation required a 5-phase memory management system: atomic refcounting,
saved-tensor lifecycle, GC callbacks, VRAM budgets, and autograd Scope.
Hundreds of lines of `runtime.KeepAlive`, `Retain()`/`Release()`, and
pending-free queues.

Rust's ownership model eliminates all of this. `Tensor` owns a C++ handle.
`Drop` frees it immediately when it goes out of scope. No GC, no finalizers,
no reference counting, no VRAM budget heuristics, no KeepAlive. Five phases
of memory management infrastructure replaced by a single `impl Drop for Tensor`.

### Zero-cost safety

Rust's type system catches errors at compile time that other languages defer
to runtime:

- **Ownership**: tensors are freed exactly once, exactly when no longer needed
- **Result types**: every fallible operation returns `Result<T>` — no silent
  error propagation, no nil pointer panics
- **No data races**: the borrow checker prevents concurrent mutation bugs

### Same GPU kernels

floDl binds libtorch — the same C++ library that powers PyTorch. The actual
GPU math (CUDA kernels, cuBLAS, cuDNN) is identical. floDl replaces everything
above: the dispatch path, autograd tracking, module composition, and graph
execution.

## Performance

floDl runs the same CUDA kernels as PyTorch — the performance difference comes
from what happens *between* kernel launches: dispatch overhead, autograd
bookkeeping, and memory management. Rust eliminates Python's per-op overhead
and the GC pauses that plague Go.

Measured on a real training workload (FBRL letter recognition — recurrent
attention with a 9-component loss stack), same model, same data, same GPU:

| Metric | PyTorch 2.5.1 | flodl | Delta |
|--------|--------------|-------|-------|
| Avg epoch | 50.1s | 42.1s | **-16%** |
| GPU utilization | ~80% (spiky) | 88-92% (flat) | more stable |
| VRAM | 2,805 MB | 2,977 MB | +6%* |

\* Static libtorch linkage + monitor thread + gzip checkpoint compression.

Full methodology, raw data, and reproduction commands:
**[Benchmark Report](docs/benchmark.md)** |
[Raw artifacts](https://github.com/fab2s/fbrl/tree/102225b) (both sides, committed)

### Build profiles

Add this to your project's `Cargo.toml` to get optimized floDl with fast
recompilation of your own code:

```toml
# Optimize floDl in dev builds — your code stays fast to compile.
# After the first build, only your graph code recompiles.
[profile.dev.package.flodl]
opt-level = 3

[profile.dev.package.flodl-sys]
opt-level = 3

# Release: cross-crate optimization for maximum throughput.
[profile.release]
lto = "thin"
codegen-units = 1
```

| Profile | flodl | Your code | Typical rebuild |
|---------|-------|-----------|-----------------|
| `cargo build` | `-O3` (cached) | `-O0` (fast) | < 2s |
| `cargo build --release` | `-O3` + LTO | `-O3` + LTO | full link |

The GPU kernels (cuBLAS, cuDNN) run at the same speed regardless of Rust
optimization level — the profile settings affect graph dispatch, autograd
bookkeeping, and module overhead.

## Hardware Compatibility

floDl is developed and tested on an NVIDIA GTX 1060 (6 GB VRAM, Pascal
architecture). It works out of the box — no version pinning, no feature
flags, no workarounds.

This matters because PyTorch dropped Pascal support after version 2.5.1.
Training on older GPUs now requires pinning `torch==2.5.1` and hoping
nothing in your dependency tree pulls a newer version. floDl sidesteps
this entirely: it links against libtorch's stable C API, which continues
to support every CUDA architecture that the driver supports.

If your GPU runs `nvidia-smi`, floDl can train on it.

## Architecture

```
+-----------------------------------------------------------+
|  User Code / Model Definitions                            |
+-----------------------------------------------------------+
|  monitor/  ETA, resource tracking, live web dashboard     |
+-----------------------------------------------------------+
|  graph/    Fluent builder, execution, DOT/SVG             |
+-----------------------------------------------------------+
|  nn/       Modules, losses, optimizers, checkpoints       |
+-----------------------------------------------------------+
|  autograd/ Reverse-mode AD, gradient tracking             |
+-----------------------------------------------------------+
|  tensor/   Owned tensors with Drop, CPU + CUDA            |
+-----------------------------------------------------------+
|  flodl-sys   FFI bindings to libtorch C++ shim            |
+-----------------------------------------------------------+
|  libtorch / CUDA / CPU                                    |
+-----------------------------------------------------------+
```

floDl is developed and tested on **NVIDIA CUDA** (Pascal and newer) and
**CPU**. Since floDl binds libtorch — not CUDA directly — additional backends
(AMD ROCm, Apple MPS, Intel XPU) are architecturally possible but not yet
exposed or tested. Contributions welcome — see [CONTRIBUTING.md](CONTRIBUTING.md).

## Documentation

### Choose your path

| Background | Start here |
|-----------|-----------|
| **New to Rust** | [Rust for PyTorch Users]docs/tutorials/00-rust-primer.md — 10 patterns in 15 minutes |
| **Know Rust, new to DL** | [Tensors]docs/tutorials/01-tensors.md then [Training]docs/tutorials/04-training.md |
| **Know PyTorch** | [PyTorch Migration Guide]docs/pytorch_migration.md then [Graph Builder]docs/tutorials/05-graph-builder.md |
| **Just show me code** | [`quickstart`]flodl/examples/quickstart/ or [`showcase`]flodl/examples/showcase/ |

### Tutorials

Step-by-step guides from basics to advanced, each with code examples:

0. **[Rust for PyTorch Users]docs/tutorials/00-rust-primer.md** — 10 Rust patterns in 15 minutes (new to Rust? start here)
1. **[Tensors]docs/tutorials/01-tensors.md** — creation, ops, error handling, memory
2. **[Autograd]docs/tutorials/02-autograd.md** — variables, gradients, backward pass
3. **[Modules]docs/tutorials/03-modules.md** — Linear, Conv2d, normalization, RNN cells
4. **[Training]docs/tutorials/04-training.md** — losses, optimizers, full training loop
5. **[Graph Builder]docs/tutorials/05-graph-builder.md** — the fluent API from simple to complex
6. **[Advanced Graphs]docs/tutorials/06-advanced-graphs.md** — forward refs, loops, gates, switches
7. **[Visualization]docs/tutorials/07-visualization.md** — DOT/SVG output, reading diagrams
8. **[Utilities]docs/tutorials/08-utilities.md** — checkpoints, clipping, freezing, initialization
9. **[Training Monitor]docs/tutorials/09-monitor.md** — ETA, resource tracking, live web dashboard

### Design

- [Benchmark]docs/benchmark.md — flodl vs PyTorch head-to-head with raw data
- [Roadmap]docs/design/roadmap.md — development plan and port status
- [Trajectory Thesis]docs/design/trajectory-thesis.md — geometric intuition behind the project

### Examples

- [`quickstart`]flodl/examples/quickstart/ — build, train, and monitor a model with residual connections
- [`sine_wave`]flodl/examples/sine_wave/ — sine regression with monitor, checkpoint round-trip
- [`mixed_precision`]flodl/examples/mixed_precision/ — float16 training with `GradScaler`
- [`transfer_learning`]flodl/examples/transfer_learning/ — checkpoint, partial load, freeze, fine-tune
- [`schedulers`]flodl/examples/schedulers/ — warmup + cosine + plateau composition
- [`observation`]flodl/examples/observation/ — collect, flush, trend queries, early stopping
- [`showcase`]flodl/examples/showcase/ — every graph builder method in one graph

## Story

floDl started as a question: what would a deep learning framework look like
if you designed it around Rust's ownership model instead of fighting a garbage
collector?

An [earlier attempt in Go](https://github.com/fab2s/goDl) proved the
architecture — the graph builder, the module system, the observation engine —
but hit a wall: Go's GC cannot manage GPU memory deterministically. That
required building five layers of memory management infrastructure on top of
the language, not with it.

Rust solved this at the language level. `impl Drop for Tensor` replaced
hundreds of lines of lifecycle management. The graph builder, module
composition, and design philosophy carried forward; the memory fights didn't.

## License

floDl is open-sourced software licensed under the [MIT license](./LICENSE).