<p align="center">
<img src="https://raw.githubusercontent.com/fab2s/floDl/main/docs/floDl.png" alt="floDl" width="640">
</p>
<h1 align="center">floDl</h1>
<p align="center">
A Rust-native deep learning framework built on libtorch.<br>
Same GPU kernels as PyTorch. No Python. No GIL. No GC. Just Rust.
</p>
<p align="center">
<a href="https://flodl.dev"><img src="https://img.shields.io/badge/web-flodl.dev-6c8cff" alt="Website"></a>
<a href="https://github.com/fab2s/floDl/actions"><img src="https://github.com/fab2s/floDl/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
<a href="https://crates.io/crates/flodl"><img src="https://img.shields.io/crates/v/flodl.svg" alt="crates.io"></a>
<a href="https://docs.rs/flodl"><img src="https://docs.rs/flodl/badge.svg" alt="docs.rs"></a>
<a href="https://github.com/fab2s/floDl/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="MIT License"></a>
</p>
<p align="center">
<a href="#if-you-know-pytorch-you-know-flodl">PyTorch Users</a> •
<a href="#getting-started">Getting Started</a> •
<a href="#the-graph-builder">Graph Builder</a> •
<a href="#graph-tree-hierarchical-composition">Graph Tree</a> •
<a href="#the-training-experience">Training</a> •
<a href="#pytorch-parity">Parity</a> •
<a href="#performance">Benchmarks</a> •
<a href="https://github.com/fab2s/floDl/blob/main/docs/pytorch_migration.md">Migration Guide</a>
</p>
---
## If You Know PyTorch, You Know floDl
<table>
<tr><th>PyTorch</th><th>floDl</th></tr>
<tr><td>
```python
model = nn.Sequential(
nn.Linear(2, 16),
nn.GELU(),
nn.LayerNorm(16),
nn.Linear(16, 2),
)
pred = model(x)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer.step()
```
</td><td>
```rust
let model = FlowBuilder::from(Linear::new(2, 16)?)
.through(GELU)
.through(LayerNorm::new(16)?)
.through(Linear::new(16, 2)?)
.build()?;
let pred = model.forward(&x)?;
let loss = mse_loss(&pred, &target)?;
loss.backward()?;
optimizer.step()?;
```
</td></tr>
</table>
Same concepts, same names, same GPU kernels underneath. The `?` operator
replaces silent failures with compile-time error handling. `Drop` replaces the
garbage collector. The [full migration guide](https://github.com/fab2s/floDl/blob/main/docs/pytorch_migration.md) covers
every op, module, and pattern.
> **New to Rust?** Read [Rust for PyTorch Users](https://github.com/fab2s/floDl/blob/main/docs/tutorials/00-rust-primer.md) — 10 patterns in 15 minutes.
## Getting Started
**With Docker** (no Rust or libtorch needed):
```bash
make build # first build (~5 min, downloads libtorch)
make run # train the template model
```
**Without Docker** — [Rust](https://rustup.rs/) 1.85+ and libtorch:
```bash
# Auto-detects CPU or CUDA
```
For CUDA: `cargo add flodl --features cuda` + [CUDA toolkit](https://developer.nvidia.com/cuda-downloads).
Both paths generate an annotated training template. Edit `src/main.rs` to
build your model:
```rust
use flodl::*;
let model = FlowBuilder::from(Linear::new(2, 16)?)
.through(GELU)
.through(LayerNorm::new(16)?)
.also(Linear::new(16, 16)?) // residual connection
.through(Linear::new(16, 2)?)
.build()?;
let params = model.parameters();
let mut optimizer = Adam::new(¶ms, 0.01);
model.train();
for (input_t, target_t) in &batches {
let input = Variable::new(input_t.clone(), true);
let target = Variable::new(target_t.clone(), false);
let pred = model.forward(&input)?;
let loss = mse_loss(&pred, &target)?;
optimizer.zero_grad();
loss.backward()?;
clip_grad_norm(¶ms, 1.0)?;
optimizer.step()?;
}
```
## The Graph Builder
floDl's fluent graph builder lets you describe complex architectures as
readable data flow — no boilerplate, no `nn.Module` subclassing.
```rust
let model = FlowBuilder::from(Linear::new(2, 16)?)
.through(GELU) // activation
.through(LayerNorm::new(16)?) // normalization
.also(Linear::new(16, 16)?) // residual connection
.through(Linear::new(16, 2)?) // output projection
.build()?;
```
`build()` returns a `Graph` that implements `Module` — you can nest it
inside other graphs. Things get interesting when architectures get complex:
```rust
let g = FlowBuilder::from(encoder).tag("encoded")
.split(modules![head_a, head_b, head_c]).merge(MergeOp::Mean)
.loop_body(refinement_block).for_n(3).tag("refined")
.gate(router, modules![expert_a, expert_b]).using(&["encoded"])
.switch(selector, modules![light_path, heavy_path]).using(&["refined"])
.through(StateAdd).using(&["memory"]).tag("memory")
.loop_body(decoder).while_cond(halt_condition, 10)
.through(output_head)
.build()?;
```
Every construct — `split/merge`, `also`, `loop_body`, `gate`, `switch`, `map`,
`tag/using` — composes cleanly. Forward references (`using` before `tag`) carry
state across calls, enabling recurrent architectures without special-casing.
| `from(m).through(m)` | Linear chain |
| `also(m)` | Residual: `input + m(input)` |
| `fork(m)` | Side branch: capture output as tag, stream continues |
| `split(modules![...]).merge(op)` | Parallel branches, merged by `Add` or `Mean` |
| `tag(name)` / `using(refs)` | Named references — backward or forward (across calls) |
| `loop_body(body).for_n(n)` | Fixed iteration with BPTT |
| `loop_body(body).while_cond` / `until_cond` | Conditional loops |
| `gate(router, modules![...])` | Soft routing — weighted combination |
| `switch(selector, modules![...])` | Hard routing — only selected branch |
| `map(body).each()` / `.over(tag)` / `.slices(n)` | Element-wise, tagged, or sliced iteration |
| `input(names)` | Auxiliary graph inputs for multi-input architectures |
See the **[Graph Builder Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/05-graph-builder.md)** and
the [full showcase](https://github.com/fab2s/floDl/tree/main/flodl/examples/showcase/).
## Graph Tree: Hierarchical Composition
This is where floDl goes beyond PyTorch. Graphs nest inside graphs with
**label-path addressing** — dot-separated paths that let you reach into any
subgraph from the root. Train components independently, compose them into
larger architectures, and control training phases declaratively.
```rust
// Build components independently
let scan = FlowBuilder::from(scan_net).tag("hidden")
.label("scan").build()?;
let read = FlowBuilder::from(read_net).tag("confidence")
.label("read").build()?;
let encoder = FlowBuilder::from(scan)
.through(read)
.label("encoder").build()?;
// Compose into full model
let model = FlowBuilder::from(encoder)
.through(classifier)
.build()?;
```
### Dotted paths reach anywhere
Every tag and subgraph is addressable through dotted paths from the root:
```rust
model.validate_path("encoder")?; // -> Subgraph
model.validate_path("encoder.scan.hidden")?; // -> Tag (three levels deep)
model.validate_path("encoder.read.confidence")?; // -> Tag
```
### Declarative training phases
Freeze and thaw entire subtrees by path — no manual parameter iteration:
```rust
// Phase 1: train only the classifier, encoder is frozen
model.freeze("encoder")?;
let fresh_params = model.parameters(); // only unfrozen params
let mut opt = Adam::new(&fresh_params, 1e-3);
// ... train ...
// Phase 2: thaw scan, keep read frozen (it's proven)
model.thaw("encoder.scan")?;
let mut opt = Adam::with_groups()
.group(&model.parameters_at("encoder.scan")?, 1e-4) // low LR
.group(&model.parameters_at("classifier")?, 1e-3)
.build();
```
### Subgraph checkpoints
Train a component standalone, save it, load it into a larger model:
```rust
// Pre-trained encoder saved earlier
encoder.save_checkpoint("encoder_v1.fdl.gz")?;
// Load into the composed model — namespace + hash validated
model.load_subgraph_checkpoint("encoder", "encoder_v1.fdl.gz")?;
model.freeze("encoder.read")?; // lock what's proven
```
### Cross-boundary observation
Metrics flow up through the tree automatically:
```rust
model.record_at("encoder.scan.loss", scan_loss)?;
model.record_at("encoder.read.accuracy", read_acc)?;
model.record_scalar("total_loss", total)?;
model.flush(&[]); // single call flushes the entire tree
// Trends across boundaries — drive training decisions
if model.trend_at("encoder.scan.loss")?.stalled(10, 1e-4) {
model.thaw("encoder.read")?; // scan stalled, unfreeze read
}
// Monitor sees all metrics with dotted names automatically
monitor.log(epoch, elapsed, &model);
// -> total_loss, encoder.scan.loss, encoder.read.accuracy
```
This is progressive model composition: each component is trained and
validated independently before becoming a building block in a larger
architecture. Checkpoints, metrics, and training phases compose just like
the graphs themselves.
See the full **[Graph Tree Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/10-graph-tree.md)**.
## The Training Experience
### Training Monitor
Drop-in monitor with adaptive ETA, resource tracking, and a live web
dashboard — no external dependencies, no separate process.
```rust
use flodl::monitor::Monitor;
let mut monitor = Monitor::new(num_epochs);
monitor.serve(3000)?; // optional: live dashboard at http://localhost:3000
for epoch in 0..num_epochs {
let t = std::time::Instant::now();
// ... training ...
monitor.log(epoch, t.elapsed(), &model); // sees entire graph tree
}
monitor.finish();
```
```
epoch 1/100 loss=1.5264 [49ms ETA 4.8s]
epoch 10/100 loss=0.3817 [25ms ETA 2.2s] VRAM: 2.1/6.0 GB (82%)
epoch 50/100 loss=0.0023 [24ms ETA 1.2s] VRAM: 2.1/6.0 GB (82%)
epoch 100/100 loss=0.0012 [23ms] VRAM: 2.1/6.0 GB (82%)
<p align="center">
<a href="https://flodl.dev/benchmark">
<img src="https://raw.githubusercontent.com/fab2s/floDl/main/docs/dashboard.gif" alt="floDl live training dashboard — click for interactive version" width="800">
</a>
</p>
<p align="center"><em><a href="https://flodl.dev/benchmark">Interactive benchmark dashboard</a> — real data from a 100-epoch training run</em></p>
The live dashboard updates via Server-Sent Events (no WebSocket, no npm),
tracks CPU/GPU/RAM/VRAM, and supports late join — open it mid-training and
all past epochs backfill instantly.
```rust
monitor.save_html("training_report.html"); // self-contained archive
monitor.export_csv("training.csv")?; // for external analysis
```
### Observation and Trend Queries
Tags double as observation points. Collect metrics during training and use
trend queries to make programmatic training decisions:
```rust
for epoch in 0..num_epochs {
for (input, target) in &batches {
let pred = graph.forward(&input)?;
graph.collect(&["hidden"])?; // from graph tag
graph.record_scalar("loss", loss.item()?); // external metric
}
graph.flush(&["hidden", "loss"]);
// Programmatic training control
if graph.trend("loss").stalled(5, 1e-4) {
optimizer.set_lr(optimizer.lr() * 0.5); // decay LR
}
if graph.trend("loss").converged(5, 1e-5) {
break; // early stopping
}
}
```
| `g.collect(tags)` / `g.flush(tags)` | Batch -> epoch metric aggregation |
| `g.record_scalar(tag, value)` | Inject external metrics (loss, accuracy) |
| `g.trend(tag).slope(n)` | OLS slope over last n epochs |
| `g.trend(tag).stalled(n, tol)` | Is \|slope\| below tolerance? |
| `g.trend(tag).improving(n)` | Is loss decreasing? |
| `g.trend(tag).converged(n, tol)` | Is variance below tolerance? |
| `g.trends(tags).all_improving(n)` | Group queries across branches |
### Visualization
```rust
let svg = g.svg(Some("model.svg"))?; // architecture diagram
g.svg_with_profile(Some("profile.svg"))?; // timing heatmap
g.plot_html("training.html", &["loss", "head"])?; // interactive curves
```
See the **[Training Monitor Tutorial](https://github.com/fab2s/floDl/blob/main/docs/tutorials/09-monitor.md)** and
the **[Observation example](https://github.com/fab2s/floDl/tree/main/flodl/examples/observation/)**.
## PyTorch Parity
floDl covers the modules, losses, and optimizers you actually use:
| **NN Modules** | 30+ | `Linear`, `Conv1d`/`2d`/`3d` + transpose, `GRU`/`LSTM`, `MultiheadAttention`, `Bilinear`, all norms (`Layer`/`RMS`/`Group`/`Batch`/`Instance`), all pooling, `Embedding`/`EmbeddingBag`, `PixelShuffle`, `Upsample`, `Unfold`/`Fold` |
| **Activations** | 17 | `ReLU`, `LeakyReLU`, `ELU`, `GELU`, `SiLU`, `Mish`, `SELU`, `Softplus`, `Hardswish`, `PReLU`, `Softmax`, ... |
| **Losses** | 15 | MSE, CrossEntropy, BCE, NLL, CTC, Focal, Triplet, KLDiv, SmoothL1, Cosine, Hinge, Margin, Poisson, ... |
| **Optimizers** | 7 | `SGD`, `Adam`, `AdamW`, `RMSprop`, `Adagrad`, `RAdam`, `NAdam` — all with parameter groups |
| **Schedulers** | 8 | Step, Cosine, Exponential, MultiStep, OneCycle, Cyclic, Warmup (composable), Plateau |
| **Init** | 9 | Xavier, Kaiming, orthogonal, truncated normal, uniform, normal |
| **Tensor Ops** | 100+ | Full arithmetic, trig, reductions, shape, indexing, comparisons, fused ops |
| **Autograd** | 90+ | Differentiable backward for every op above |
Fused Adam/AdamW on CUDA (single kernel for all parameters). Fused gradient
clipping via foreach ops. Mixed precision with `AutocastGuard` + `GradScaler`.
CUDA Graphs for replay-based training.
The [full migration guide](https://github.com/fab2s/floDl/blob/main/docs/pytorch_migration.md) has side-by-side
code for every op, module, and pattern.
## Performance
Same CUDA kernels as PyTorch — the difference comes from what happens
*between* kernel launches. Seven models, ten interleaved rounds, locked GPU
clocks (RTX 5060 Ti, v0.1.3 vs PyTorch 2.6.0):
| mlp | 271.0 ms | 188.5 ms | **-30%** | ±10.1 | ±2.9 |
| convnet | 1189.4 ms | 1190.5 ms | +0% | ±2.7 | ±1.0 |
| gru_seq | 1015.3 ms | 949.7 ms | **-6%** | ±222.4 | ±10.8 |
| residual_tower | 371.3 ms | 278.6 ms | **-25%** | ±25.9 | ±3.6 |
| gated_routing | 222.6 ms | 196.9 ms | **-12%** | ±13.8 | ±2.6 |
| iterative_refine | 208.7 ms | 186.7 ms | **-11%** | ±27.2 | ±5.6 |
| feedback_fixed | 250.2 ms | 207.2 ms | **-17%** | ±27.3 | ±8.7 |
Wins 6 of 7 on speed, 3-20x tighter variance across every model. The
convnet tie proves both frameworks dispatch identical CUDA kernels — the
gap comes from Rust eliminating Python's per-op dispatch overhead.
**[Benchmark Report](https://github.com/fab2s/floDl/blob/main/docs/benchmark.md)** |
[Interactive dashboard](https://flodl.dev/benchmark)
## Why Rust for Deep Learning?
**Deterministic memory.** Python adds ~3-5 us of framework overhead per GPU
op. Go's GC can't manage VRAM — an [earlier Go implementation](https://github.com/fab2s/goDl)
required 5 phases of lifecycle management (refcounting, GC callbacks, VRAM
budgets, pending-free queues). Rust replaces all of that with
`impl Drop for Tensor`. Memory is freed the instant a tensor leaves scope.
**Zero-cost safety.** Every op returns `Result<T>` — no silent failures.
Ownership ensures tensors are freed exactly once. The borrow checker
prevents data races at compile time.
**Same GPU kernels.** floDl binds libtorch — the C++ library under
PyTorch. CUDA, cuBLAS, cuDNN are identical. floDl replaces the dispatch
path, autograd tracking, and graph execution.
## Features Reference
<details>
<summary><strong>Training Tools</strong></summary>
| `clip_grad_norm` / `clip_grad_value` | Fused gradient clipping (2 kernels total via foreach ops) |
| `save_checkpoint` / `load_checkpoint` | Named `.fdl` checkpoints, structural hash, partial loading, `LoadReport` |
| `migrate_checkpoint` | Remap parameter names across versions |
| `Parameter::freeze` / `unfreeze` | Per-parameter gradient control |
| `GradScaler` | Dynamic loss scaling for fp16 training |
| `cast_parameters` | Cast model parameters to any dtype |
| `CpuWorker` / `ModelSnapshot` | Background checkpoint saving |
| `CudaGraph` | Capture/replay training steps for fixed-shape models |
</details>
<details>
<summary><strong>Module Traits</strong></summary>
Beyond `forward`/`parameters`, `Module` provides optional methods the graph
recognizes automatically:
| `as_named_input()` | `using()` refs arrive as a named map |
| `reset()` | Loops auto-call before iterating — clears per-forward state |
| `detach_state()` | Break gradient chains on retained state |
| `sub_modules()` | Recursive device placement, training mode, parameter collection |
</details>
<details>
<summary><strong>Build Profiles</strong></summary>
```toml
# Optimize floDl in dev builds — your code stays fast to compile.
[profile.dev.package.flodl]
opt-level = 3
[profile.dev.package.flodl-sys]
opt-level = 3
# Release: cross-crate optimization for maximum throughput.
[profile.release]
lto = "thin"
codegen-units = 1
```
| `cargo build` | `-O3` (cached) | `-O0` (fast) | < 2s |
| `cargo build --release` | `-O3` + LTO | `-O3` + LTO | full link |
</details>
### Numerical Verification
Every differentiable path is verified against finite-difference gradients:
- 117 autograd op-level checks (every op + compositions)
- Module-level checks (every NN module, input + parameter gradients)
- Exact optimizer step verifications (SGD, Adam, AdamW, RMSprop, Adagrad, RAdam, NAdam)
- 769 library tests, zero clippy warnings — all tests run on both CPU and CUDA
### Hardware Compatibility
Developed and tested from NVIDIA Pascal (GTX 1060 6GB) to Blackwell
(RTX 5060 Ti 16GB). PyTorch dropped Pascal support after 2.5.1 — floDl
links libtorch's stable C API, which supports every architecture the driver
supports. If `nvidia-smi` works, floDl trains on it.
## Documentation
### Choose your path
| **New to Rust** | [Rust for PyTorch Users](https://github.com/fab2s/floDl/blob/main/docs/tutorials/00-rust-primer.md) — 10 patterns in 15 minutes |
| **Know Rust, new to DL** | [Tensors](https://github.com/fab2s/floDl/blob/main/docs/tutorials/01-tensors.md) then [Training](https://github.com/fab2s/floDl/blob/main/docs/tutorials/04-training.md) |
| **Know PyTorch** | [Migration Guide](https://github.com/fab2s/floDl/blob/main/docs/pytorch_migration.md) then [Graph Builder](https://github.com/fab2s/floDl/blob/main/docs/tutorials/05-graph-builder.md) |
| **Just show me code** | [`quickstart`](https://github.com/fab2s/floDl/tree/main/flodl/examples/quickstart/) or [`showcase`](https://github.com/fab2s/floDl/tree/main/flodl/examples/showcase/) |
### Tutorials
0. **[Rust for PyTorch Users](https://github.com/fab2s/floDl/blob/main/docs/tutorials/00-rust-primer.md)** — 10 Rust patterns in 15 minutes
1. **[Tensors](https://github.com/fab2s/floDl/blob/main/docs/tutorials/01-tensors.md)** — creation, ops, memory, CUDA
2. **[Autograd](https://github.com/fab2s/floDl/blob/main/docs/tutorials/02-autograd.md)** — variables, gradients, backward
3. **[Modules](https://github.com/fab2s/floDl/blob/main/docs/tutorials/03-modules.md)** — all layers, convolutions, RNNs, attention, normalization
4. **[Training](https://github.com/fab2s/floDl/blob/main/docs/tutorials/04-training.md)** — losses, optimizers, mixed precision, full loop
5. **[Graph Builder](https://github.com/fab2s/floDl/blob/main/docs/tutorials/05-graph-builder.md)** — fluent API from simple to complex
6. **[Advanced Graphs](https://github.com/fab2s/floDl/blob/main/docs/tutorials/06-advanced-graphs.md)** — forward refs, loops, gates, switches
7. **[Visualization](https://github.com/fab2s/floDl/blob/main/docs/tutorials/07-visualization.md)** — DOT/SVG, profiling heatmaps
8. **[Utilities](https://github.com/fab2s/floDl/blob/main/docs/tutorials/08-utilities.md)** — checkpoints, clipping, freezing, initialization, scheduling
9. **[Training Monitor](https://github.com/fab2s/floDl/blob/main/docs/tutorials/09-monitor.md)** — ETA, resource tracking, live dashboard
10. **[Graph Tree](https://github.com/fab2s/floDl/blob/main/docs/tutorials/10-graph-tree.md)** — hierarchical composition, freeze/thaw, subgraph checkpoints
### Examples
- [`quickstart`](https://github.com/fab2s/floDl/tree/main/flodl/examples/quickstart/) — build, train, and monitor a model with residual connections
- [`sine_wave`](https://github.com/fab2s/floDl/tree/main/flodl/examples/sine_wave/) — sine regression with monitor, checkpoint round-trip
- [`mixed_precision`](https://github.com/fab2s/floDl/tree/main/flodl/examples/mixed_precision/) — float16 training with `GradScaler`
- [`transfer_learning`](https://github.com/fab2s/floDl/tree/main/flodl/examples/transfer_learning/) — checkpoint, partial load, freeze, fine-tune
- [`schedulers`](https://github.com/fab2s/floDl/tree/main/flodl/examples/schedulers/) — warmup + cosine + plateau composition
- [`observation`](https://github.com/fab2s/floDl/tree/main/flodl/examples/observation/) — collect, flush, trend queries, early stopping
- [`showcase`](https://github.com/fab2s/floDl/tree/main/flodl/examples/showcase/) — every graph builder method in one graph
### Architecture
```
+-----------------------------------------------------------+
| monitor/ ETA, resource tracking, live web dashboard |
+-----------------------------------------------------------+
| graph/ Fluent builder, graph tree, execution, DOT/SVG |
+-----------------------------------------------------------+
| nn/ Modules, losses, optimizers, checkpoints |
+-----------------------------------------------------------+
| autograd/ Reverse-mode AD, gradient tracking |
+-----------------------------------------------------------+
| tensor/ Owned tensors with Drop, CPU + CUDA |
+-----------------------------------------------------------+
| flodl-sys FFI bindings to libtorch C++ shim |
+-----------------------------------------------------------+
| libtorch / CUDA / CPU |
+-----------------------------------------------------------+
```
## Story
floDl started as a question: what would a deep learning framework look like
if you designed it around Rust's ownership model instead of fighting a garbage
collector?
An [earlier attempt in Go](https://github.com/fab2s/goDl) proved the
architecture — the graph builder, the module system, the observation engine —
but hit a wall: Go's GC cannot manage GPU memory deterministically. That
required building five layers of memory management infrastructure on top of
the language, not with it.
Rust solved this at the language level. `impl Drop for Tensor` replaced
hundreds of lines of lifecycle management. The graph builder, module
composition, and design philosophy carried forward; the memory fights didn't.
## License
floDl is open-sourced software licensed under the [MIT license](https://github.com/fab2s/floDl/blob/main/LICENSE).