Module dfdx::nn

Expand description

High level neural network building blocks such as Linear, activations, and tuples as Modules. Also includes .save() & .load() for all Modules.

Mutable vs Immutable forwards

This is provided as two separate traits

ModuleMut::forward_mut() which receives &mut self.
Module::forward() which receives &self.

This has nothing to do with whether gradients are being tracked or not. It only controls whether the module itself can be modified. Both OwnedTape and NoneTape can still be passed to both, and all modules should conform to this expected behavior.

In general, ModuleMut::forward_mut() should be used during training, and Module::forward() during evaluation/testing/inference/validation.

Here is a list of existing modules that have different behavior in these two functions:

Initializing

All modules implement Default, and this initializes all parameters to 0.0. The intention is then to call ResetParams::reset_params(), which randomizes the parameters:

let mut model: Linear<5, 2> = Default::default(); // set all params to 0
model.reset_params(&mut rng); // randomize weights

Sequential models

Tuple’s implement Module, so you can string multiple module’s together.

Here’s a single layer MLP:

type Mlp = (Linear<5, 3>, ReLU, Linear<3, 2>);

Here’s a more complex feedforward network that takes vectors of 5 elements and maps them to 2 elements.

type ComplexNetwork = (
    DropoutOneIn<2>, // 1. dropout 50% of input
    Linear<5, 3>,    // 2. pass into a linear layer
    LayerNorm1D<3>,  // 3. normalize elements
    ReLU,            // 4. activate with relu
    Residual<(       // 5. residual connection that adds input to the result of it's sub layers
        Linear<3, 3>,// 5.a. Apply linear layer
        ReLU,        // 5.b. Apply Relu
    )>,              // 5.c. the input to the residual is added back in after the sub layers
    Linear<3, 2>,    // 6. Apply another linear layer
);

Saving and Loading

Call SaveToNpz::save() and LoadFromNpz::load() traits. All modules provided here implement it, including tuples. These all save to/from .npz files, which are basically zip files with multiple .npy files.

This is implemented to be fairly portable. For example you can load a simple MLP into pytorch like so:

import torch
import numpy as np
state_dict = {k: torch.from_numpy(v) for k, v in np.load("dfdx-model.npz").items()}
mlp.load_state_dict(state_dict)

Structs

Abs

Unit struct that impls Module as calling abs() on input.

AvgPool2D

Average pool with 2d kernel that operates on images (3d) and batches of images (4d). Each patch reduces to the average of the values in the patch.

AvgPoolGlobal

Applies average pooling over an entire image, fully reducing the height and width dimensions:

BatchNorm2D

Batch normalization for images as described in Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Conv2D

Requires Nightly Performs 2d convolutions on 3d and 4d images.

Cos

Unit struct that impls Module as calling cos() on input.

Dropout

Does nothing as a Module, and calls dropout() as ModuleMut with probability 1.0 / N.

DropoutOneIn

Does nothing as a Module, and calls dropout() as ModuleMut with probability 1.0 / N.

Exp

Unit struct that impls Module as calling exp() on input.

Flatten2D

Requires Nightly Flattens 3d tensors to 1d, and 4d tensors to 2d.

GeneralizedResidual

A residual connection R around F: F(x) + R(x), as introduced in Deep Residual Learning for Image Recognition.

LayerNorm1D

Implements layer normalization as described in Layer Normalization.

Linear

A linear transformation of the form weight * x + bias, where weight is a matrix, x is a vector or matrix, and bias is a vector.

Ln

Unit struct that impls Module as calling ln() on input.

MaxPool2D

Max pool with 2d kernel that operates on images (3d) and batches of images (4d). Each patch reduces to the maximum value in that patch.

MaxPoolGlobal

Applies max pooling over an entire image, fully reducing the height and width dimensions:

MinPool2D

Minimum pool with 2d kernel that operates on images (3d) and batches of images (4d). Each patch reduces to the minimum of the values in the patch.

MinPoolGlobal

Applies min pooling over an entire image, fully reducing the height and width dimensions:

MultiHeadAttention

Requires Nightly A multi-head attention layer.

ReLU

Unit struct that impls Module as calling relu() on input.

Repeated

Repeats T N times. This requires that T’s input is the same as it’s output.

Residual

A residual connection around F: F(x) + x, as introduced in Deep Residual Learning for Image Recognition.

Sigmoid

Unit struct that impls Module as calling sigmoid() on input.

Sin

Unit struct that impls Module as calling sin() on input.

Softmax

Unit struct that impls Module as calling softmax() on input.“

SplitInto

Splits input into multiple heads. T should be a tuple, where every element of the tuple accepts the same input type.

Sqrt

Unit struct that impls Module as calling sqrt() on input.

Square

Unit struct that impls Module as calling square() on input.

Tanh

Unit struct that impls Module as calling tanh() on input.

Transformer

Requires Nightly Transformer architecture as described in Attention is all you need.

TransformerDecoder

Requires Nightly A transformer decoder.

TransformerDecoderBlock

Requires Nightly A transformer decoder block. Different than the normal transformer block as this self attention accepts an additional sequence from the encoder.

TransformerEncoderBlock

Requires Nightly A single transformer encoder block