flodl 0.1.1 - Docs.rs

<p align="center">
  <img src="docs/floDl.png" alt="floDl" width="640">
</p>

<h1 align="center">floDl</h1>

<p align="center">
A Rust-native deep learning framework built on libtorch.<br>
Same GPU kernels as PyTorch. No Python. No GIL. No GC. Just Rust.
</p>

<p align="center">
  <a href="https://github.com/fab2s/floDl/actions"><img src="https://github.com/fab2s/floDl/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
  <a href="https://crates.io/crates/flodl"><img src="https://img.shields.io/crates/v/flodl.svg" alt="crates.io"></a>
  <a href="https://docs.rs/flodl"><img src="https://docs.rs/flodl/badge.svg" alt="docs.rs"></a>
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="MIT License"></a>
</p>

<p align="center">
  <a href="#getting-started">Getting Started</a> &bull;
  <a href="#the-graph-builder">Graph Builder</a> &bull;
  <a href="#training-monitor">Training Monitor</a> &bull;
  <a href="#features">Features</a> &bull;
  <a href="docs/tutorials/01-tensors.md">Tutorials</a> &bull;
  <a href="docs/pytorch_migration.md">PyTorch Migration</a> &bull;
  <a href="docs/troubleshooting.md">Troubleshooting</a> &bull;
  <a href="#architecture">Architecture</a>
</p>

---

## If You Know PyTorch, You Know floDl

<table>
<tr><th>PyTorch</th><th>floDl</th></tr>
<tr><td>

```python
model = nn.Sequential(
    nn.Linear(2, 16),
    nn.GELU(),
    nn.LayerNorm(16),
    nn.Linear(16, 2),
)

pred = model(x)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer.step()
```

</td><td>

```rust
let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)
    .through(LayerNorm::new(16)?)
    .through(Linear::new(16, 2)?)
    .build()?;

let pred = model.forward(&x)?;
let loss = mse_loss(&pred, &target)?;
loss.backward()?;
optimizer.step()?;
```

</td></tr>
</table>

Same concepts, same names, same GPU kernels underneath. The `?` operator
replaces silent failures with compile-time error handling. `Drop` replaces the
garbage collector. The [full migration guide](docs/pytorch_migration.md) covers
every op, module, and pattern.

## Getting Started

**Prerequisite:** [Docker](https://docs.docker.com/get-docker/) (no Rust or
libtorch needed on your machine — everything runs in containers).

Create a new project with one command:

```bash
curl -sL https://flodl.dev/init.sh | sh -s my-project
cd my-project
make build    # first build (~5 min, downloads libtorch)
make run      # train the template model
```

This generates a complete project with Dockerfiles, Makefile, and an annotated
training template. Edit `src/main.rs` to build your model.

> **New to Rust?** Read [Rust for PyTorch Users](docs/tutorials/00-rust-primer.md) — 10 patterns in 15 minutes.

## The Graph Builder

floDl's fluent graph builder lets you describe complex architectures as
readable data flow — no boilerplate, no graph construction commands.

```rust
let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)                        // activation
    .through(LayerNorm::new(16)?)         // normalization
    .also(Linear::new(16, 16)?)           // residual connection
    .through(Linear::new(16, 2)?)         // output projection
    .build()?;
```

That's a trainable model. `also` adds the residual — input flows through the
Linear *and* gets added to its output. `build()` returns a `Graph` that
implements `Module` — you can nest it inside other graphs.

Things get interesting when architectures get complex:

```rust
let g = FlowBuilder::from(encoder).tag("encoded")
    .split(modules![head_a, head_b, head_c]).merge(MergeOp::Mean)
    .loop_body(refinement_block).for_n(3).tag("refined")
    .gate(router, modules![expert_a, expert_b]).using(&["encoded"])
    .switch(selector, modules![light_path, heavy_path]).using(&["refined"])
    .through(StateAdd).using(&["memory"]).tag("memory")
    .loop_body(decoder).while_cond(halt_condition, 10)
    .through(output_head)
    .build()?;
```

Every construct — `split/merge`, `also`, `loop_body`, `gate`, `switch`, `map`,
`tag/using` — composes cleanly. Sub-graphs nest like any module. Forward
references (`using` before `tag`) carry state across calls, enabling recurrent
architectures without special-casing. Enough to express transformers,
mixture-of-experts, iterative refinement, attention with memory, or any
architecture you can draw as a data flow graph.

See the **[Graph Builder Tutorial](docs/tutorials/05-graph-builder.md)** and
the [full showcase](flodl/examples/showcase/) that exercises every builder
method.

## Training Monitor

Drop-in training monitor with adaptive ETA, system resource tracking, and a
live web dashboard — no external dependencies, no separate process.

```rust
use flodl::monitor::Monitor;

let mut monitor = Monitor::new(num_epochs);
monitor.serve(3000)?;  // optional: live dashboard at http://localhost:3000

for epoch in 0..num_epochs {
    let t = std::time::Instant::now();
    // ... training ...

    monitor.log(epoch, t.elapsed(), &[("loss", loss_val), ("lr", lr)]);
}
monitor.finish();
```

Terminal output adapts automatically — duration and ETA switch between hours,
minutes, seconds, and milliseconds as needed:

```
  epoch   1/100  loss=1.5264  [49ms  ETA 4.8s]
  epoch  10/100  loss=0.3817  [25ms  ETA 2.2s]  VRAM: 2.1/6.0 GB (82%)
  epoch  50/100  loss=0.0023  [24ms  ETA 1.2s]  VRAM: 2.1/6.0 GB (82%)
  epoch 100/100  loss=0.0012  [23ms]             VRAM: 2.1/6.0 GB (82%)
  training complete in 2.8s  | loss: 0.0012
```

### Live dashboard

Call `monitor.serve(port)` and open the URL in a browser. The page updates
in real time via Server-Sent Events — no polling, no WebSocket, no npm.

<p align="center">
  <a href="https://flodl.dev/benchmark">
    <img src="docs/dashboard.gif" alt="floDl live training dashboard — click for interactive version" width="800">
  </a>
</p>
<p align="center"><em><a href="https://flodl.dev/benchmark">Interactive benchmark dashboard</a> — real data from a 100-epoch training run</em></p>

The dashboard includes:

| Panel | What it shows |
|-------|--------------|
| **Header** | Epoch counter, progress bar, ETA, elapsed time |
| **Metrics chart** | All logged metrics (loss, lr, ...) as live canvas chart |
| **Resource chart** | CPU%, GPU%, RAM%, VRAM% over time |
| **Resource bars** | Current usage with values (e.g., `VRAM: 2.1/6.0 GB`) |
| **Epoch log** | Every epoch, newest first, with duration and resources |
| **Graph SVG** | Collapsible architecture diagram (via `monitor.watch(&model)`) |

Late join works — open the dashboard mid-training and it backfills all
past epochs instantly.

### Resource tracking

| Metric | Source | Availability |
|--------|--------|-------------|
| CPU % | `/proc/stat` delta | Linux |
| RAM | `/proc/meminfo` | Linux |
| GPU utilization % | NVML (dynamic `dlopen`) | NVIDIA GPU + driver |
| VRAM used/total | `cudaMemGetInfo` via FFI | CUDA builds |

Resources that aren't available are silently omitted. CPU-only builds show
CPU and RAM; CUDA builds add GPU and VRAM automatically.

### Export

```rust
monitor.save_html("training_report.html");  // self-contained dashboard archive
monitor.write_log("training.log")?;          // human-readable log
monitor.export_csv("training.csv")?;         // metrics + resources as CSV
```

`save_html` writes a complete dashboard at `finish()` — all metrics, resource
charts, and graph SVG baked into a single HTML file. Open it in any browser,
no server needed. Set it once before training and forget about it.

See the full **[Training Monitor Tutorial](docs/tutorials/09-monitor.md)**.

## Quick Start

### With Docker (recommended)

No Rust or libtorch needed — everything runs in containers:

```bash
curl -sL https://flodl.dev/init.sh | sh -s my-project
cd my-project && make run
```

### Without Docker

**Requirements:** Rust 1.85+ and [libtorch](https://pytorch.org/get-started/locally/)
(C++/libtorch variant).

```bash
cargo add flodl
```

Set `LIBTORCH_PATH` to your libtorch directory and `LD_LIBRARY_PATH` to
include `$LIBTORCH_PATH/lib`. For CUDA, also set `CUDA_HOME` and enable
the feature: `cargo add flodl --features cuda`.

See [libtorch downloads](https://pytorch.org/get-started/locally/) (pick the
C++/libtorch variant) and [CUDA toolkit](https://developer.nvidia.com/cuda-downloads)
if you need GPU support.

**Develop floDl itself:**
```bash
git clone https://github.com/fab2s/floDl.git
cd floDl
make image      # build dev container (Rust + libtorch)
make test       # run all tests (CPU)
make cuda-test  # run all tests on CUDA (requires NVIDIA GPU)
make test-all   # CPU first, then CUDA if a GPU is available
make clippy     # lint
make shell      # interactive shell in container
```

### Train a model

```rust
use flodl::*;

// Build the model.
let model = FlowBuilder::from(Linear::new(2, 16)?)
    .through(GELU)
    .through(LayerNorm::new(16)?)
    .also(Linear::new(16, 16)?)
    .through(Linear::new(16, 2)?)
    .build()?;

// Set up training.
let params = model.parameters();
let mut optimizer = Adam::new(&params, 0.01);
model.train();

// Training loop.
for (input_t, target_t) in &batches {
    let input = Variable::new(input_t.clone(), true);
    let target = Variable::new(target_t.clone(), false);

    let pred = model.forward(&input)?;
    let loss = mse_loss(&pred, &target)?;

    optimizer.zero_grad();
    loss.backward()?;
    clip_grad_norm(&params, 1.0)?;
    optimizer.step()?;
}
```

## Features

### Core Stack

| Layer | What it does |
|-------|-------------|
| **Tensor** | Owned RAII tensors with `Drop`, `Clone`. CPU and CUDA. |
| **Autograd** | Reverse-mode automatic differentiation. Full backward for every op. |
| **NN Modules** | `Linear`, `Conv2d`, `ConvTranspose2d`, `LayerNorm`, `BatchNorm`/`BatchNorm2d`, `Dropout`, `Dropout2d`, `Embedding`, `GRUCell`, `LSTMCell` |
| **Activations** | `Identity`, `ReLU`, `Sigmoid`, `Tanh`, `GELU`, `SiLU` |
| **Losses** | `mse_loss`, `cross_entropy_loss`, `bce_with_logits_loss`, `l1_loss`, `smooth_l1_loss`, `kl_div_loss` |
| **Optimizers** | `SGD` (with momentum), `Adam`, `AdamW` — all support parameter groups for per-group LR |
| **LR Scheduling** | `StepDecay`, `CosineScheduler`, `WarmupScheduler` (composable), `PlateauScheduler` |
| **Mixed Precision** | `Float16`/`BFloat16` dtype casting, `GradScaler` for loss scaling |
| **Monitor** | Human-readable ETA, CPU/GPU/RAM/VRAM tracking, live web dashboard |

### Graph Builder

| Method | What it does |
|--------|-------------|
| `from(m).through(m)` | Linear chain |
| `fork(m)` | Side branch: runs module, captures output as tag, stream continues unchanged |
| `input(names)` | Auxiliary graph inputs, accessible via `using(name)` — multi-input graphs |
| `split(modules![...]).merge(op)` | Parallel branches, merged by `Add` or `Mean` |
| `also(m)` | Residual connection: `input + m(input)` |
| `tag(name)` / `using(refs)` | Named references — backward (same pass) or forward (across calls) |
| `loop_body(body).for_n(n)` | Fixed iteration with BPTT |
| `loop_body(body).while_cond(cond, max)` | Condition before body (0..max iterations) |
| `loop_body(body).until_cond(cond, max)` | Condition after body (1..max iterations) |
| `gate(router, modules![...])` | Soft routing — all experts execute, weighted combination |
| `switch(selector, modules![...])` | Hard routing — only selected branch executes |
| `map(body).each()` | Apply body to each element along dim 0 |
| `map(body).over(tag)` | Iterate over a tagged tensor |
| `map(body).slices(n)` | Decompose last dim into n slices, map, recompose |
| `.batched()` | Fast path for Map — full batch in one call |
| `tag_group(name)` | Name parallel branches: `split(...).tag_group("head")` |

### Training Tools

| Tool | What it does |
|------|-------------|
| `clip_grad_norm` | L2 norm gradient clipping |
| `clip_grad_value` | Element-wise gradient clamping |
| `save_checkpoint` / `load_checkpoint` | Named `.fdl` checkpoint with partial loading, persists parameters + buffers, structural hash validation, `LoadReport` (file path or `Write`/`Read`) |
| `Parameter::freeze` / `unfreeze` | Disable/enable gradient tracking per parameter |
| `xavier_uniform/normal` | Weight initialization (also `kaiming_*` via `nn::init`) |
| LR schedulers | `StepDecay`, `CosineScheduler`, `WarmupScheduler`, `PlateauScheduler` (composable) |
| `GradScaler` | Dynamic loss scaling for mixed precision (float16) training |
| `cast_parameters` | Cast model parameters to any dtype |
| **Background** | `CpuWorker` (work queue), `ModelSnapshot` / `snapshot_cpu()` — offload checkpoints & eval to a background thread |

### Module Traits

Beyond the core `forward`/`parameters` methods, `Module` provides optional
methods that the graph recognizes automatically:

| Method | Default | What happens |
|--------|---------|-------------|
| `as_named_input()` | `None` | Returns `&dyn NamedInputModule` — loop and node `using()` refs arrive as a named map |
| `reset()` | no-op | Loops auto-call before iterating — clears per-forward state |
| `detach_state()` | no-op | `graph.detach_state()` propagates — breaks gradient chains on retained state |

Stateful modules just override `reset()` and/or `detach_state()` directly —
no separate trait impls needed. Modules that own child modules implement
`sub_modules()` for recursive device placement, training mode, and parameter
collection.

### Observation & Trends

Tags double as observation points — collect metrics during training, flush
to epoch history, and query trends to drive training decisions:

```rust
for epoch in 0..num_epochs {
    for (input, target) in &batches {
        let pred = graph.forward(&input)?;
        graph.collect(&["hidden"])?;                 // from graph tag

        let loss = mse_loss(&pred, &target)?;
        graph.record_scalar("loss", loss.item()?);   // external metric
    }
    graph.flush(&["hidden", "loss"]);

    if graph.trend("loss").stalled(5, 1e-4) {
        // decay learning rate
    }
}
```

| Method | What it does |
|--------|-------------|
| `g.tagged(tag)` | Access a tagged node's output after forward |
| `g.collect(tags)` / `g.flush(tags)` | Batch -> epoch metric collection |
| `g.record_scalar(tag, value)` | Inject external metrics |
| `g.trend(tag)` | Epoch-level trend: `slope`, `stalled`, `improving`, `converged` |
| `g.trends(tags)` | Group trends: `all_improving`, `any_stalled`, `mean_slope` |
| `g.end_step()` / `g.end_epoch()` | Training housekeeping |

### Visualization

```rust
println!("{}", g.dot());                       // Graphviz DOT with parameter counts
let svg = g.svg(Some("model.svg"))?;          // render to SVG

// Timing-annotated: nodes colored green->yellow->red by execution time.
g.enable_profiling();
g.forward(&input)?;
g.svg_with_profile(Some("profile.svg"))?;

// Training curves as self-contained HTML.
g.plot_html("training.html", &["loss", "head"])?;
g.export_trends("metrics.csv", &["loss"])?;
```

### Numerical Verification

Every differentiable path is verified against finite-difference gradients:
- 37 autograd op-level checks (every op + compositions)
- Module-level checks (every NN module, input + parameter gradients)
- Exact optimizer step verifications (SGD, Adam, AdamW)
- 329 library tests, zero clippy warnings — all tests run on both CPU and CUDA

## Why Rust for Deep Learning?

### The memory management problem

Python adds ~3-5 us of framework overhead to every GPU operation. For
architectures built on many small sequential operations — recurrent steps,
iterative refinement, multi-head attention — this overhead dominates.

Go solves the dispatch overhead with compiled binaries and goroutines, but
Go's garbage collector cannot manage VRAM deterministically. GPU memory lives
in libtorch's C++ allocator — invisible to Go's GC. An earlier Go
implementation required a 5-phase memory management system: atomic refcounting,
saved-tensor lifecycle, GC callbacks, VRAM budgets, and autograd Scope.
Hundreds of lines of `runtime.KeepAlive`, `Retain()`/`Release()`, and
pending-free queues.

Rust's ownership model eliminates all of this. `Tensor` owns a C++ handle.
`Drop` frees it immediately when it goes out of scope. No GC, no finalizers,
no reference counting, no VRAM budget heuristics, no KeepAlive. Five phases
of memory management infrastructure replaced by a single `impl Drop for Tensor`.

### Zero-cost safety

Rust's type system catches errors at compile time that other languages defer
to runtime:

- **Ownership**: tensors are freed exactly once, exactly when no longer needed
- **Result types**: every fallible operation returns `Result<T>` — no silent
  error propagation, no nil pointer panics
- **No data races**: the borrow checker prevents concurrent mutation bugs

### Same GPU kernels

floDl binds libtorch — the same C++ library that powers PyTorch. The actual
GPU math (CUDA kernels, cuBLAS, cuDNN) is identical. floDl replaces everything
above: the dispatch path, autograd tracking, module composition, and graph
execution.

## Performance

floDl runs the same CUDA kernels as PyTorch — the performance difference comes
from what happens *between* kernel launches: dispatch overhead, autograd
bookkeeping, and memory management. Rust eliminates Python's per-op overhead
and the GC pauses that plague Go.

Measured on a real training workload (FBRL letter recognition — recurrent
attention with a 9-component loss stack), same model, same data, same GPU:

| Metric | PyTorch 2.5.1 | flodl | Delta |
|--------|--------------|-------|-------|
| Avg epoch | 50.1s | 42.1s | **-16%** |
| GPU utilization | ~80% (spiky) | 88-92% (flat) | more stable |
| VRAM | 2,805 MB | 2,977 MB | +6%* |

\* Static libtorch linkage + monitor thread + gzip checkpoint compression.

Full methodology, raw data, and reproduction commands:
**[Benchmark Report](docs/benchmark.md)** |
[Raw artifacts](https://github.com/fab2s/fbrl/tree/102225b) (both sides, committed)

### Build profiles

Add this to your project's `Cargo.toml` to get optimized floDl with fast
recompilation of your own code:

```toml
# Optimize floDl in dev builds — your code stays fast to compile.
# After the first build, only your graph code recompiles.
[profile.dev.package.flodl]
opt-level = 3

[profile.dev.package.flodl-sys]
opt-level = 3

# Release: cross-crate optimization for maximum throughput.
[profile.release]
lto = "thin"
codegen-units = 1
```

| Profile | flodl | Your code | Typical rebuild |
|---------|-------|-----------|-----------------|
| `cargo build` | `-O3` (cached) | `-O0` (fast) | < 2s |
| `cargo build --release` | `-O3` + LTO | `-O3` + LTO | full link |

The GPU kernels (cuBLAS, cuDNN) run at the same speed regardless of Rust
optimization level — the profile settings affect graph dispatch, autograd
bookkeeping, and module overhead.

## Hardware Compatibility

floDl is developed and tested on an NVIDIA GTX 1060 (6 GB VRAM, Pascal
architecture). It works out of the box — no version pinning, no feature
flags, no workarounds.

This matters because PyTorch dropped Pascal support after version 2.5.1.
Training on older GPUs now requires pinning `torch==2.5.1` and hoping
nothing in your dependency tree pulls a newer version. floDl sidesteps
this entirely: it links against libtorch's stable C API, which continues
to support every CUDA architecture that the driver supports.

If your GPU runs `nvidia-smi`, floDl can train on it.

## Architecture

```
+-----------------------------------------------------------+
|  User Code / Model Definitions                            |
+-----------------------------------------------------------+
|  monitor/  ETA, resource tracking, live web dashboard     |
+-----------------------------------------------------------+
|  graph/    Fluent builder, execution, DOT/SVG             |
+-----------------------------------------------------------+
|  nn/       Modules, losses, optimizers, checkpoints       |
+-----------------------------------------------------------+
|  autograd/ Reverse-mode AD, gradient tracking             |
+-----------------------------------------------------------+
|  tensor/   Owned tensors with Drop, CPU + CUDA            |
+-----------------------------------------------------------+
|  flodl-sys   FFI bindings to libtorch C++ shim            |
+-----------------------------------------------------------+
|  libtorch / CUDA / CPU                                    |
+-----------------------------------------------------------+
```

floDl is developed and tested on **NVIDIA CUDA** (Pascal and newer) and
**CPU**. Since floDl binds libtorch — not CUDA directly — additional backends
(AMD ROCm, Apple MPS, Intel XPU) are architecturally possible but not yet
exposed or tested. Contributions welcome — see [CONTRIBUTING.md](CONTRIBUTING.md).

## Documentation

### Choose your path

| Background | Start here |
|-----------|-----------|
| **New to Rust** | [Rust for PyTorch Users](docs/tutorials/00-rust-primer.md) — 10 patterns in 15 minutes |
| **Know Rust, new to DL** | [Tensors](docs/tutorials/01-tensors.md) then [Training](docs/tutorials/04-training.md) |
| **Know PyTorch** | [PyTorch Migration Guide](docs/pytorch_migration.md) then [Graph Builder](docs/tutorials/05-graph-builder.md) |
| **Just show me code** | [`quickstart`](flodl/examples/quickstart/) or [`showcase`](flodl/examples/showcase/) |

### Tutorials

Step-by-step guides from basics to advanced, each with code examples:

0. **[Rust for PyTorch Users](docs/tutorials/00-rust-primer.md)** — 10 Rust patterns in 15 minutes (new to Rust? start here)
1. **[Tensors](docs/tutorials/01-tensors.md)** — creation, ops, error handling, memory
2. **[Autograd](docs/tutorials/02-autograd.md)** — variables, gradients, backward pass
3. **[Modules](docs/tutorials/03-modules.md)** — Linear, Conv2d, normalization, RNN cells
4. **[Training](docs/tutorials/04-training.md)** — losses, optimizers, full training loop
5. **[Graph Builder](docs/tutorials/05-graph-builder.md)** — the fluent API from simple to complex
6. **[Advanced Graphs](docs/tutorials/06-advanced-graphs.md)** — forward refs, loops, gates, switches
7. **[Visualization](docs/tutorials/07-visualization.md)** — DOT/SVG output, reading diagrams
8. **[Utilities](docs/tutorials/08-utilities.md)** — checkpoints, clipping, freezing, initialization
9. **[Training Monitor](docs/tutorials/09-monitor.md)** — ETA, resource tracking, live web dashboard

### Design

- [Benchmark](docs/benchmark.md) — flodl vs PyTorch head-to-head with raw data
- [Roadmap](docs/design/roadmap.md) — development plan and port status
- [Trajectory Thesis](docs/design/trajectory-thesis.md) — geometric intuition behind the project

### Examples

- [`quickstart`](flodl/examples/quickstart/) — build, train, and monitor a model with residual connections
- [`sine_wave`](flodl/examples/sine_wave/) — sine regression with monitor, checkpoint round-trip
- [`mixed_precision`](flodl/examples/mixed_precision/) — float16 training with `GradScaler`
- [`transfer_learning`](flodl/examples/transfer_learning/) — checkpoint, partial load, freeze, fine-tune
- [`schedulers`](flodl/examples/schedulers/) — warmup + cosine + plateau composition
- [`observation`](flodl/examples/observation/) — collect, flush, trend queries, early stopping
- [`showcase`](flodl/examples/showcase/) — every graph builder method in one graph

## Story

floDl started as a question: what would a deep learning framework look like
if you designed it around Rust's ownership model instead of fighting a garbage
collector?

An [earlier attempt in Go](https://github.com/fab2s/goDl) proved the
architecture — the graph builder, the module system, the observation engine —
but hit a wall: Go's GC cannot manage GPU memory deterministically. That
required building five layers of memory management infrastructure on top of
the language, not with it.

Rust solved this at the language level. `impl Drop for Tensor` replaced
hundreds of lines of lifecycle management. The graph builder, module
composition, and design philosophy carried forward; the memory fights didn't.

## License

floDl is open-sourced software licensed under the [MIT license](./LICENSE).