If You Know PyTorch, You Know floDl
=
=
=
let model = from
.through
.through
.through
.build?;
let pred = model.forward?;
let loss = mse_loss?;
loss.backward?;
optimizer.step?;
Same concepts, same names, same GPU kernels underneath. The ? operator
replaces silent failures with compile-time error handling. Drop replaces the
garbage collector. The full migration guide covers
every op, module, and pattern.
New to Rust? Read Rust for PyTorch Users — 10 patterns in 15 minutes.
Getting Started
With Docker (no Rust or libtorch needed):
|
Without Docker — Rust 1.85+ and libtorch:
# Auto-detects CPU or CUDA
|
&&
For CUDA: cargo add flodl --features cuda + CUDA toolkit.
Both paths generate an annotated training template. Edit src/main.rs to
build your model:
use *;
let model = from
.through
.through
.also // residual connection
.through
.build?;
let params = model.parameters;
let mut optimizer = new;
model.train;
for in &batches
The Graph Builder
floDl's fluent graph builder lets you describe complex architectures as
readable data flow — no boilerplate, no nn.Module subclassing.
let model = from
.through // activation
.through // normalization
.also // residual connection
.through // output projection
.build?;
build() returns a Graph that implements Module — you can nest it
inside other graphs. Things get interesting when architectures get complex:
let g = from.tag
.split.merge
.loop_body.for_n.tag
.gate.using
.switch.using
.through.using.tag
.loop_body.while_cond
.through
.build?;
Every construct — split/merge, also, loop_body, gate, switch, map,
tag/using — composes cleanly. Forward references (using before tag) carry
state across calls, enabling recurrent architectures without special-casing.
| Method | What it does |
|---|---|
from(m).through(m) |
Linear chain |
also(m) |
Residual: input + m(input) |
fork(m) |
Side branch: capture output as tag, stream continues |
split(modules![...]).merge(op) |
Parallel branches, merged by Add or Mean |
tag(name) / using(refs) |
Named references — backward or forward (across calls) |
loop_body(body).for_n(n) |
Fixed iteration with BPTT |
loop_body(body).while_cond / until_cond |
Conditional loops |
gate(router, modules![...]) |
Soft routing — weighted combination |
switch(selector, modules![...]) |
Hard routing — only selected branch |
map(body).each() / .over(tag) / .slices(n) |
Element-wise, tagged, or sliced iteration |
input(names) |
Auxiliary graph inputs for multi-input architectures |
See the Graph Builder Tutorial and the full showcase.
Graph Tree: Hierarchical Composition
This is where floDl goes beyond PyTorch. Graphs nest inside graphs with label-path addressing — dot-separated paths that let you reach into any subgraph from the root. Train components independently, compose them into larger architectures, and control training phases declaratively.
// Build components independently
let scan = from.tag
.label.build?;
let read = from.tag
.label.build?;
let encoder = from
.through
.label.build?;
// Compose into full model
let model = from
.through
.build?;
Dotted paths reach anywhere
Every tag and subgraph is addressable through dotted paths from the root:
model.validate_path?; // -> Subgraph
model.validate_path?; // -> Tag (three levels deep)
model.validate_path?; // -> Tag
Declarative training phases
Freeze and thaw entire subtrees by path — no manual parameter iteration:
// Phase 1: train only the classifier, encoder is frozen
model.freeze?;
let fresh_params = model.parameters; // only unfrozen params
let mut opt = new;
// ... train ...
// Phase 2: thaw scan, keep read frozen (it's proven)
model.thaw?;
let mut opt = with_groups
.group // low LR
.group
.build;
Subgraph checkpoints
Train a component standalone, save it, load it into a larger model:
// Pre-trained encoder saved earlier
encoder.save_checkpoint?;
// Load into the composed model — namespace + hash validated
model.load_subgraph_checkpoint?;
model.freeze?; // lock what's proven
Cross-boundary observation
Metrics flow up through the tree automatically:
model.record_at?;
model.record_at?;
model.record_scalar?;
model.flush; // single call flushes the entire tree
// Trends across boundaries — drive training decisions
if model.trend_at?.stalled
// Monitor sees all metrics with dotted names automatically
monitor.log;
// -> total_loss, encoder.scan.loss, encoder.read.accuracy
This is progressive model composition: each component is trained and validated independently before becoming a building block in a larger architecture. Checkpoints, metrics, and training phases compose just like the graphs themselves.
See the full Graph Tree Tutorial.
The Training Experience
Training Monitor
Drop-in monitor with adaptive ETA, resource tracking, and a live web dashboard — no external dependencies, no separate process.
use Monitor;
let mut monitor = new;
monitor.serve?; // optional: live dashboard at http://localhost:3000
for epoch in 0..num_epochs
monitor.finish;
epoch 1/100 loss=1.5264 [49ms ETA 4.8s]
epoch 10/100 loss=0.3817 [25ms ETA 2.2s] VRAM: 2.1/6.0 GB (82%)
epoch 50/100 loss=0.0023 [24ms ETA 1.2s] VRAM: 2.1/6.0 GB (82%)
epoch 100/100 loss=0.0012 [23ms] VRAM: 2.1/6.0 GB (82%)
training complete in 2.8s | loss: 0.0012
The live dashboard updates via Server-Sent Events (no WebSocket, no npm), tracks CPU/GPU/RAM/VRAM, and supports late join — open it mid-training and all past epochs backfill instantly.
monitor.save_html; // self-contained archive
monitor.export_csv?; // for external analysis
Observation and Trend Queries
Tags double as observation points. Collect metrics during training and use trend queries to make programmatic training decisions:
for epoch in 0..num_epochs
| Method | What it does |
|---|---|
g.collect(tags) / g.flush(tags) |
Batch -> epoch metric aggregation |
g.record_scalar(tag, value) |
Inject external metrics (loss, accuracy) |
g.trend(tag).slope(n) |
OLS slope over last n epochs |
g.trend(tag).stalled(n, tol) |
Is |slope| below tolerance? |
g.trend(tag).improving(n) |
Is loss decreasing? |
g.trend(tag).converged(n, tol) |
Is variance below tolerance? |
g.trends(tags).all_improving(n) |
Group queries across branches |
Visualization
let svg = g.svg?; // architecture diagram
g.svg_with_profile?; // timing heatmap
g.plot_html?; // interactive curves
See the Training Monitor Tutorial and the Observation example.
PyTorch Parity
floDl covers the modules, losses, and optimizers you actually use:
| Category | Count | Highlights |
|---|---|---|
| NN Modules | 30+ | Linear, Conv1d/2d/3d + transpose, GRU/LSTM, MultiheadAttention, Bilinear, all norms (Layer/RMS/Group/Batch/Instance), all pooling, Embedding/EmbeddingBag, PixelShuffle, Upsample, Unfold/Fold |
| Activations | 17 | ReLU, LeakyReLU, ELU, GELU, SiLU, Mish, SELU, Softplus, Hardswish, PReLU, Softmax, ... |
| Losses | 15 | MSE, CrossEntropy, BCE, NLL, CTC, Focal, Triplet, KLDiv, SmoothL1, Cosine, Hinge, Margin, Poisson, ... |
| Optimizers | 7 | SGD, Adam, AdamW, RMSprop, Adagrad, RAdam, NAdam — all with parameter groups |
| Schedulers | 8 | Step, Cosine, Exponential, MultiStep, OneCycle, Cyclic, Warmup (composable), Plateau |
| Init | 9 | Xavier, Kaiming, orthogonal, truncated normal, uniform, normal |
| Tensor Ops | 100+ | Full arithmetic, trig, reductions, shape, indexing, comparisons, fused ops |
| Autograd | 90+ | Differentiable backward for every op above |
Fused Adam/AdamW on CUDA (single kernel for all parameters). Fused gradient
clipping via foreach ops. Mixed precision with AutocastGuard + GradScaler.
CUDA Graphs for replay-based training.
The full migration guide has side-by-side code for every op, module, and pattern.
Performance
Same CUDA kernels as PyTorch — the difference comes from what happens between kernel launches. Seven models, ten interleaved rounds, locked GPU clocks (RTX 5060 Ti, v0.1.3 vs PyTorch 2.6.0):
| Model | PyTorch | flodl | Delta | Py σ | Rs σ |
|---|---|---|---|---|---|
| mlp | 271.0 ms | 188.5 ms | -30% | ±10.1 | ±2.9 |
| convnet | 1189.4 ms | 1190.5 ms | +0% | ±2.7 | ±1.0 |
| gru_seq | 1015.3 ms | 949.7 ms | -6% | ±222.4 | ±10.8 |
| residual_tower | 371.3 ms | 278.6 ms | -25% | ±25.9 | ±3.6 |
| gated_routing | 222.6 ms | 196.9 ms | -12% | ±13.8 | ±2.6 |
| iterative_refine | 208.7 ms | 186.7 ms | -11% | ±27.2 | ±5.6 |
| feedback_fixed | 250.2 ms | 207.2 ms | -17% | ±27.3 | ±8.7 |
Wins 6 of 7 on speed, 3-20x tighter variance across every model. The convnet tie proves both frameworks dispatch identical CUDA kernels — the gap comes from Rust eliminating Python's per-op dispatch overhead.
Benchmark Report | Interactive dashboard
Why Rust for Deep Learning?
Deterministic memory. Python adds ~3-5 us of framework overhead per GPU
op. Go's GC can't manage VRAM — an earlier Go implementation
required 5 phases of lifecycle management (refcounting, GC callbacks, VRAM
budgets, pending-free queues). Rust replaces all of that with
impl Drop for Tensor. Memory is freed the instant a tensor leaves scope.
Zero-cost safety. Every op returns Result<T> — no silent failures.
Ownership ensures tensors are freed exactly once. The borrow checker
prevents data races at compile time.
Same GPU kernels. floDl binds libtorch — the C++ library under PyTorch. CUDA, cuBLAS, cuDNN are identical. floDl replaces the dispatch path, autograd tracking, and graph execution.
Features Reference
| Tool | What it does |
|---|---|
clip_grad_norm / clip_grad_value |
Fused gradient clipping (2 kernels total via foreach ops) |
save_checkpoint / load_checkpoint |
Named .fdl checkpoints, structural hash, partial loading, LoadReport |
migrate_checkpoint |
Remap parameter names across versions |
Parameter::freeze / unfreeze |
Per-parameter gradient control |
GradScaler |
Dynamic loss scaling for fp16 training |
cast_parameters |
Cast model parameters to any dtype |
CpuWorker / ModelSnapshot |
Background checkpoint saving |
CudaGraph |
Capture/replay training steps for fixed-shape models |
Beyond forward/parameters, Module provides optional methods the graph
recognizes automatically:
| Method | What happens |
|---|---|
as_named_input() |
using() refs arrive as a named map |
reset() |
Loops auto-call before iterating — clears per-forward state |
detach_state() |
Break gradient chains on retained state |
sub_modules() |
Recursive device placement, training mode, parameter collection |
# Optimize floDl in dev builds — your code stays fast to compile.
[]
= 3
[]
= 3
# Release: cross-crate optimization for maximum throughput.
[]
= "thin"
= 1
| Profile | flodl | Your code | Typical rebuild |
|---|---|---|---|
cargo build |
-O3 (cached) |
-O0 (fast) |
< 2s |
cargo build --release |
-O3 + LTO |
-O3 + LTO |
full link |
Numerical Verification
Every differentiable path is verified against finite-difference gradients:
- 117 autograd op-level checks (every op + compositions)
- Module-level checks (every NN module, input + parameter gradients)
- Exact optimizer step verifications (SGD, Adam, AdamW, RMSprop, Adagrad, RAdam, NAdam)
- 769 library tests, zero clippy warnings — all tests run on both CPU and CUDA
Hardware Compatibility
Developed and tested from NVIDIA Pascal (GTX 1060 6GB) to Blackwell
(RTX 5060 Ti 16GB). PyTorch dropped Pascal support after 2.5.1 — floDl
links libtorch's stable C API, which supports every architecture the driver
supports. If nvidia-smi works, floDl trains on it.
Documentation
Choose your path
| Background | Start here |
|---|---|
| New to Rust | Rust for PyTorch Users — 10 patterns in 15 minutes |
| Know Rust, new to DL | Tensors then Training |
| Know PyTorch | Migration Guide then Graph Builder |
| Just show me code | quickstart or showcase |
Tutorials
- Rust for PyTorch Users — 10 Rust patterns in 15 minutes
- Tensors — creation, ops, memory, CUDA
- Autograd — variables, gradients, backward
- Modules — all layers, convolutions, RNNs, attention, normalization
- Training — losses, optimizers, mixed precision, full loop
- Graph Builder — fluent API from simple to complex
- Advanced Graphs — forward refs, loops, gates, switches
- Visualization — DOT/SVG, profiling heatmaps
- Utilities — checkpoints, clipping, freezing, initialization, scheduling
- Training Monitor — ETA, resource tracking, live dashboard
- Graph Tree — hierarchical composition, freeze/thaw, subgraph checkpoints
Examples
quickstart— build, train, and monitor a model with residual connectionssine_wave— sine regression with monitor, checkpoint round-tripmixed_precision— float16 training withGradScalertransfer_learning— checkpoint, partial load, freeze, fine-tuneschedulers— warmup + cosine + plateau compositionobservation— collect, flush, trend queries, early stoppingshowcase— every graph builder method in one graph
Architecture
+-----------------------------------------------------------+
| User Code / Model Definitions |
+-----------------------------------------------------------+
| monitor/ ETA, resource tracking, live web dashboard |
+-----------------------------------------------------------+
| graph/ Fluent builder, graph tree, execution, DOT/SVG |
+-----------------------------------------------------------+
| nn/ Modules, losses, optimizers, checkpoints |
+-----------------------------------------------------------+
| autograd/ Reverse-mode AD, gradient tracking |
+-----------------------------------------------------------+
| tensor/ Owned tensors with Drop, CPU + CUDA |
+-----------------------------------------------------------+
| flodl-sys FFI bindings to libtorch C++ shim |
+-----------------------------------------------------------+
| libtorch / CUDA / CPU |
+-----------------------------------------------------------+
Story
floDl started as a question: what would a deep learning framework look like if you designed it around Rust's ownership model instead of fighting a garbage collector?
An earlier attempt in Go proved the architecture — the graph builder, the module system, the observation engine — but hit a wall: Go's GC cannot manage GPU memory deterministically. That required building five layers of memory management infrastructure on top of the language, not with it.
Rust solved this at the language level. impl Drop for Tensor replaced
hundreds of lines of lifecycle management. The graph builder, module
composition, and design philosophy carried forward; the memory fights didn't.
License
floDl is open-sourced software licensed under the MIT license.