If You Know PyTorch, You Know floDl
=
=
=
let model = from
.through
.through
.through
.build?;
let pred = model.forward?;
let loss = mse_loss?;
loss.backward?;
optimizer.step?;
Same concepts, same names, same GPU kernels underneath. The ? operator
replaces silent failures with compile-time error handling. Drop replaces the
garbage collector. The full migration guide covers
every op, module, and pattern.
Getting Started
Prerequisite: Docker (no Rust or libtorch needed on your machine — everything runs in containers).
Create a new project with one command:
|
This generates a complete project with Dockerfiles, Makefile, and an annotated
training template. Edit src/main.rs to build your model.
New to Rust? Read Rust for PyTorch Users — 10 patterns in 15 minutes.
The Graph Builder
floDl's fluent graph builder lets you describe complex architectures as readable data flow — no boilerplate, no graph construction commands.
let model = from
.through // activation
.through // normalization
.also // residual connection
.through // output projection
.build?;
That's a trainable model. also adds the residual — input flows through the
Linear and gets added to its output. build() returns a Graph that
implements Module — you can nest it inside other graphs.
Things get interesting when architectures get complex:
let g = from.tag
.split.merge
.loop_body.for_n.tag
.gate.using
.switch.using
.through.using.tag
.loop_body.while_cond
.through
.build?;
Every construct — split/merge, also, loop_body, gate, switch, map,
tag/using — composes cleanly. Sub-graphs nest like any module. Forward
references (using before tag) carry state across calls, enabling recurrent
architectures without special-casing. Enough to express transformers,
mixture-of-experts, iterative refinement, attention with memory, or any
architecture you can draw as a data flow graph.
See the Graph Builder Tutorial and the full showcase that exercises every builder method.
Training Monitor
Drop-in training monitor with adaptive ETA, system resource tracking, and a live web dashboard — no external dependencies, no separate process.
use Monitor;
let mut monitor = new;
monitor.serve?; // optional: live dashboard at http://localhost:3000
for epoch in 0..num_epochs
monitor.finish;
Terminal output adapts automatically — duration and ETA switch between hours, minutes, seconds, and milliseconds as needed:
epoch 1/100 loss=1.5264 [49ms ETA 4.8s]
epoch 10/100 loss=0.3817 [25ms ETA 2.2s] VRAM: 2.1/6.0 GB (82%)
epoch 50/100 loss=0.0023 [24ms ETA 1.2s] VRAM: 2.1/6.0 GB (82%)
epoch 100/100 loss=0.0012 [23ms] VRAM: 2.1/6.0 GB (82%)
training complete in 2.8s | loss: 0.0012
Live dashboard
Call monitor.serve(port) and open the URL in a browser. The page updates
in real time via Server-Sent Events — no polling, no WebSocket, no npm.
The dashboard includes:
| Panel | What it shows |
|---|---|
| Header | Epoch counter, progress bar, ETA, elapsed time |
| Metrics chart | All logged metrics (loss, lr, ...) as live canvas chart |
| Resource chart | CPU%, GPU%, RAM%, VRAM% over time |
| Resource bars | Current usage with values (e.g., VRAM: 2.1/6.0 GB) |
| Epoch log | Every epoch, newest first, with duration and resources |
| Graph SVG | Collapsible architecture diagram (via monitor.watch(&model)) |
Late join works — open the dashboard mid-training and it backfills all past epochs instantly.
Resource tracking
| Metric | Source | Availability |
|---|---|---|
| CPU % | /proc/stat delta |
Linux |
| RAM | /proc/meminfo |
Linux |
| GPU utilization % | NVML (dynamic dlopen) |
NVIDIA GPU + driver |
| VRAM used/total | cudaMemGetInfo via FFI |
CUDA builds |
Resources that aren't available are silently omitted. CPU-only builds show CPU and RAM; CUDA builds add GPU and VRAM automatically.
Export
monitor.save_html; // self-contained dashboard archive
monitor.write_log?; // human-readable log
monitor.export_csv?; // metrics + resources as CSV
save_html writes a complete dashboard at finish() — all metrics, resource
charts, and graph SVG baked into a single HTML file. Open it in any browser,
no server needed. Set it once before training and forget about it.
See the full Training Monitor Tutorial.
Quick Start
With Docker (recommended)
No Rust or libtorch needed — everything runs in containers:
|
&&
Without Docker
Requirements: Rust 1.85+ and libtorch (C++/libtorch variant).
Set LIBTORCH_PATH to your libtorch directory and LD_LIBRARY_PATH to
include $LIBTORCH_PATH/lib. For CUDA, also set CUDA_HOME and enable
the feature: cargo add flodl --features cuda.
See libtorch downloads (pick the C++/libtorch variant) and CUDA toolkit if you need GPU support.
Develop floDl itself:
Train a model
use *;
// Build the model.
let model = from
.through
.through
.also
.through
.build?;
// Set up training.
let params = model.parameters;
let mut optimizer = new;
model.train;
// Training loop.
for in &batches
Features
Core Stack
| Layer | What it does |
|---|---|
| Tensor | Owned RAII tensors with Drop, Clone. CPU and CUDA. |
| Autograd | Reverse-mode automatic differentiation. Full backward for every op. |
| NN Modules | Linear, Conv2d, ConvTranspose2d, LayerNorm, BatchNorm/BatchNorm2d, Dropout, Dropout2d, Embedding, GRUCell, LSTMCell |
| Activations | Identity, ReLU, Sigmoid, Tanh, GELU, SiLU |
| Losses | mse_loss, cross_entropy_loss, bce_with_logits_loss, l1_loss, smooth_l1_loss, kl_div_loss |
| Optimizers | SGD (with momentum), Adam, AdamW — all support parameter groups for per-group LR |
| LR Scheduling | StepDecay, CosineScheduler, WarmupScheduler (composable), PlateauScheduler |
| Mixed Precision | Float16/BFloat16 dtype casting, GradScaler for loss scaling |
| Monitor | Human-readable ETA, CPU/GPU/RAM/VRAM tracking, live web dashboard |
Graph Builder
| Method | What it does |
|---|---|
from(m).through(m) |
Linear chain |
fork(m) |
Side branch: runs module, captures output as tag, stream continues unchanged |
input(names) |
Auxiliary graph inputs, accessible via using(name) — multi-input graphs |
split(modules![...]).merge(op) |
Parallel branches, merged by Add or Mean |
also(m) |
Residual connection: input + m(input) |
tag(name) / using(refs) |
Named references — backward (same pass) or forward (across calls) |
loop_body(body).for_n(n) |
Fixed iteration with BPTT |
loop_body(body).while_cond(cond, max) |
Condition before body (0..max iterations) |
loop_body(body).until_cond(cond, max) |
Condition after body (1..max iterations) |
gate(router, modules![...]) |
Soft routing — all experts execute, weighted combination |
switch(selector, modules![...]) |
Hard routing — only selected branch executes |
map(body).each() |
Apply body to each element along dim 0 |
map(body).over(tag) |
Iterate over a tagged tensor |
map(body).slices(n) |
Decompose last dim into n slices, map, recompose |
.batched() |
Fast path for Map — full batch in one call |
tag_group(name) |
Name parallel branches: split(...).tag_group("head") |
Training Tools
| Tool | What it does |
|---|---|
clip_grad_norm |
L2 norm gradient clipping |
clip_grad_value |
Element-wise gradient clamping |
save_checkpoint / load_checkpoint |
Named .fdl checkpoint with partial loading, persists parameters + buffers, structural hash validation, LoadReport (file path or Write/Read) |
Parameter::freeze / unfreeze |
Disable/enable gradient tracking per parameter |
xavier_uniform/normal |
Weight initialization (also kaiming_* via nn::init) |
| LR schedulers | StepDecay, CosineScheduler, WarmupScheduler, PlateauScheduler (composable) |
GradScaler |
Dynamic loss scaling for mixed precision (float16) training |
cast_parameters |
Cast model parameters to any dtype |
| Background | CpuWorker (work queue), ModelSnapshot / snapshot_cpu() — offload checkpoints & eval to a background thread |
Module Traits
Beyond the core forward/parameters methods, Module provides optional
methods that the graph recognizes automatically:
| Method | Default | What happens |
|---|---|---|
as_named_input() |
None |
Returns &dyn NamedInputModule — loop and node using() refs arrive as a named map |
reset() |
no-op | Loops auto-call before iterating — clears per-forward state |
detach_state() |
no-op | graph.detach_state() propagates — breaks gradient chains on retained state |
Stateful modules just override reset() and/or detach_state() directly —
no separate trait impls needed. Modules that own child modules implement
sub_modules() for recursive device placement, training mode, and parameter
collection.
Observation & Trends
Tags double as observation points — collect metrics during training, flush to epoch history, and query trends to drive training decisions:
for epoch in 0..num_epochs
| Method | What it does |
|---|---|
g.tagged(tag) |
Access a tagged node's output after forward |
g.collect(tags) / g.flush(tags) |
Batch -> epoch metric collection |
g.record_scalar(tag, value) |
Inject external metrics |
g.trend(tag) |
Epoch-level trend: slope, stalled, improving, converged |
g.trends(tags) |
Group trends: all_improving, any_stalled, mean_slope |
g.end_step() / g.end_epoch() |
Training housekeeping |
Visualization
println!; // Graphviz DOT with parameter counts
let svg = g.svg?; // render to SVG
// Timing-annotated: nodes colored green->yellow->red by execution time.
g.enable_profiling;
g.forward?;
g.svg_with_profile?;
// Training curves as self-contained HTML.
g.plot_html?;
g.export_trends?;
Numerical Verification
Every differentiable path is verified against finite-difference gradients:
- 37 autograd op-level checks (every op + compositions)
- Module-level checks (every NN module, input + parameter gradients)
- Exact optimizer step verifications (SGD, Adam, AdamW)
- 329 library tests, zero clippy warnings — all tests run on both CPU and CUDA
Why Rust for Deep Learning?
The memory management problem
Python adds ~3-5 us of framework overhead to every GPU operation. For architectures built on many small sequential operations — recurrent steps, iterative refinement, multi-head attention — this overhead dominates.
Go solves the dispatch overhead with compiled binaries and goroutines, but
Go's garbage collector cannot manage VRAM deterministically. GPU memory lives
in libtorch's C++ allocator — invisible to Go's GC. An earlier Go
implementation required a 5-phase memory management system: atomic refcounting,
saved-tensor lifecycle, GC callbacks, VRAM budgets, and autograd Scope.
Hundreds of lines of runtime.KeepAlive, Retain()/Release(), and
pending-free queues.
Rust's ownership model eliminates all of this. Tensor owns a C++ handle.
Drop frees it immediately when it goes out of scope. No GC, no finalizers,
no reference counting, no VRAM budget heuristics, no KeepAlive. Five phases
of memory management infrastructure replaced by a single impl Drop for Tensor.
Zero-cost safety
Rust's type system catches errors at compile time that other languages defer to runtime:
- Ownership: tensors are freed exactly once, exactly when no longer needed
- Result types: every fallible operation returns
Result<T>— no silent error propagation, no nil pointer panics - No data races: the borrow checker prevents concurrent mutation bugs
Same GPU kernels
floDl binds libtorch — the same C++ library that powers PyTorch. The actual GPU math (CUDA kernels, cuBLAS, cuDNN) is identical. floDl replaces everything above: the dispatch path, autograd tracking, module composition, and graph execution.
Performance
floDl runs the same CUDA kernels as PyTorch — the performance difference comes from what happens between kernel launches: dispatch overhead, autograd bookkeeping, and memory management. Rust eliminates Python's per-op overhead and the GC pauses that plague Go.
Measured on a real training workload (FBRL letter recognition — recurrent attention with a 9-component loss stack), same model, same data, same GPU:
| Metric | PyTorch 2.5.1 | flodl | Delta |
|---|---|---|---|
| Avg epoch | 50.1s | 42.1s | -16% |
| GPU utilization | ~80% (spiky) | 88-92% (flat) | more stable |
| VRAM | 2,805 MB | 2,977 MB | +6%* |
* Static libtorch linkage + monitor thread + gzip checkpoint compression.
Full methodology, raw data, and reproduction commands: Benchmark Report | Raw artifacts (both sides, committed)
Build profiles
Add this to your project's Cargo.toml to get optimized floDl with fast
recompilation of your own code:
# Optimize floDl in dev builds — your code stays fast to compile.
# After the first build, only your graph code recompiles.
[]
= 3
[]
= 3
# Release: cross-crate optimization for maximum throughput.
[]
= "thin"
= 1
| Profile | flodl | Your code | Typical rebuild |
|---|---|---|---|
cargo build |
-O3 (cached) |
-O0 (fast) |
< 2s |
cargo build --release |
-O3 + LTO |
-O3 + LTO |
full link |
The GPU kernels (cuBLAS, cuDNN) run at the same speed regardless of Rust optimization level — the profile settings affect graph dispatch, autograd bookkeeping, and module overhead.
Hardware Compatibility
floDl is developed and tested on an NVIDIA GTX 1060 (6 GB VRAM, Pascal architecture). It works out of the box — no version pinning, no feature flags, no workarounds.
This matters because PyTorch dropped Pascal support after version 2.5.1.
Training on older GPUs now requires pinning torch==2.5.1 and hoping
nothing in your dependency tree pulls a newer version. floDl sidesteps
this entirely: it links against libtorch's stable C API, which continues
to support every CUDA architecture that the driver supports.
If your GPU runs nvidia-smi, floDl can train on it.
Architecture
+-----------------------------------------------------------+
| User Code / Model Definitions |
+-----------------------------------------------------------+
| monitor/ ETA, resource tracking, live web dashboard |
+-----------------------------------------------------------+
| graph/ Fluent builder, execution, DOT/SVG |
+-----------------------------------------------------------+
| nn/ Modules, losses, optimizers, checkpoints |
+-----------------------------------------------------------+
| autograd/ Reverse-mode AD, gradient tracking |
+-----------------------------------------------------------+
| tensor/ Owned tensors with Drop, CPU + CUDA |
+-----------------------------------------------------------+
| flodl-sys FFI bindings to libtorch C++ shim |
+-----------------------------------------------------------+
| libtorch / CUDA / CPU |
+-----------------------------------------------------------+
floDl is developed and tested on NVIDIA CUDA (Pascal and newer) and CPU. Since floDl binds libtorch — not CUDA directly — additional backends (AMD ROCm, Apple MPS, Intel XPU) are architecturally possible but not yet exposed or tested. Contributions welcome — see CONTRIBUTING.md.
Documentation
Choose your path
| Background | Start here |
|---|---|
| New to Rust | Rust for PyTorch Users — 10 patterns in 15 minutes |
| Know Rust, new to DL | Tensors then Training |
| Know PyTorch | PyTorch Migration Guide then Graph Builder |
| Just show me code | quickstart or showcase |
Tutorials
Step-by-step guides from basics to advanced, each with code examples:
- Rust for PyTorch Users — 10 Rust patterns in 15 minutes (new to Rust? start here)
- Tensors — creation, ops, error handling, memory
- Autograd — variables, gradients, backward pass
- Modules — Linear, Conv2d, normalization, RNN cells
- Training — losses, optimizers, full training loop
- Graph Builder — the fluent API from simple to complex
- Advanced Graphs — forward refs, loops, gates, switches
- Visualization — DOT/SVG output, reading diagrams
- Utilities — checkpoints, clipping, freezing, initialization
- Training Monitor — ETA, resource tracking, live web dashboard
Design
- Benchmark — flodl vs PyTorch head-to-head with raw data
- Roadmap — development plan and port status
- Trajectory Thesis — geometric intuition behind the project
Examples
quickstart— build, train, and monitor a model with residual connectionssine_wave— sine regression with monitor, checkpoint round-tripmixed_precision— float16 training withGradScalertransfer_learning— checkpoint, partial load, freeze, fine-tuneschedulers— warmup + cosine + plateau compositionobservation— collect, flush, trend queries, early stoppingshowcase— every graph builder method in one graph
Story
floDl started as a question: what would a deep learning framework look like if you designed it around Rust's ownership model instead of fighting a garbage collector?
An earlier attempt in Go proved the architecture — the graph builder, the module system, the observation engine — but hit a wall: Go's GC cannot manage GPU memory deterministically. That required building five layers of memory management infrastructure on top of the language, not with it.
Rust solved this at the language level. impl Drop for Tensor replaced
hundreds of lines of lifecycle management. The graph builder, module
composition, and design philosophy carried forward; the memory fights didn't.
License
floDl is open-sourced software licensed under the MIT license.