nbml

A minimal machine learning library built on ndarray for low-level ML algorithm development in Rust.

Unlike high-level frameworks, nbml provides bare primitives and a lightweight optimizer API for building custom neural networks from scratch. If you want comfortable abstractions, see Burn. If you want to understand what's happening under the hood and have full control, nbml gives you the building blocks.

Features

Core primitives: Transformers, Attention, LSTM, Conv2D, Feedforward layers, etc
Activation functions: ReLU, Sigmoid, Tanh, etc
Layers: Softmax, LayerNorm, Sequence Pooling
Optimizers: AdamW, SGD
Utilities: Variable Sequence Batching, Gradient Clipping, Gumbel Softmax, Plots, etc
Minimal abstractions: Direct ndarray integration for custom algorithms

Quick Start

use nbml::nn::FFN;
use nbml::f::Activation;
use nbml::optim::{AdamW, Optimizer, ToParams};

// Build a simple feedforward network
let mut model = FFN::new(vec![
    (784, 12, Activation::Relu),
    (12, 1, Activation::Sigmoid),
]);

// Create optimizer
let mut optimizer = AdamW::default().with(&mut model);

// Training loop (simplified)
for batch in training_data {
    let output = model.forward(batch.x, true);
    let loss = cross_entropy(&output, &batch.y);
    let grad = model.backward(loss);
    optimizer.step(&mut model);
    model.zero_grads();
}

Architecture

NN Layers (`nbml::nn`)

Layer: Single nonlinear projection layer
FFN: Feedforward network with configurable layers
RNN: Vanilla recurrent neural network
LSTM: Long Short-Term Memory Network
GRU: Gated Recurrent Unit
ESN: Echo-state network, fixed recurrence + readout
LSM: Liquid state machine
RNNReservoir: RNN reservoir (used by ESN)
SNNReservoir: Spiking neural network reservoir (used by LSM)
Conv2D: Explicit Im2Col Conv2D layer (CPU efficient, memory hungry)
PatchwiseConv2D: Patchwise Conv2D layer (CPU hungry, memory efficient)
LinearSSM: Discrete Linear SSM
Attention: Core softmax attention primitive
SelfAttention: Multi-head self attention
CrossAttention: Multi-head cross attention
LinearAttention: Linear self attention with recurrent matrix-valued state. Subquadratic alternative to softmax attention (Katharopoulos et al., 2020)
GatedLinearAttention: Gated linear attention with matrix-valued state and outer-product gating (Yang et al., 2024)
DeltaNet: Linear attention with delta rule for error-corrected state updates (Yang et al., 2024)
GatedDeltaNet: Gated delta network combining data-dependent gating with delta rule updates (Yang et al., 2025)
Transformer: Transformer encoder/decoder block
LinearTransformer: Transformer block using linear self attention instead of softmax attention
GlaTransformer: Transformer block using gated linear attention.
DeltaNetTransformer: Transformer block using DeltaNet linear attention.
GdnTransformer: Transformer block using Gated DeltaNet linear attention.

Layers (`nbml::layers`)

Layers that are only useful as components of other modules:

Dropout: Stateful dropout layer
Embedding: Learnable token embeddings
L2Norm: L2 normalization
Linear: Affine transformation
Softmax: Row-wise softmax
LayerNorm: Layer normalization
SequencePooling: Sequence mean-pooling

Optimizers (`nbml::optim`)

`ToParams`

ToParams connects your model's weights and gradients to the optimizer. Implement params() to return a list of Param entries, each pairing a weight array with its gradient. The optimizer reads these pointers on each step to update weights in-place — no ownership transfer, no framework magic.

Param::new and with_grad accept any ndarray dimension (Array1, Array2, Array3, etc.), so you don't need separate methods per shape:

pub struct Affine {
    w: Array2<f32>,
    b: Array1<f32>,

    d_w: Array2<f32>,
    d_b: Array1<f32>,
}

impl ToParams for Affine {
    fn params(&mut self) -> Vec<Param> {
        vec![
            Param::new(&mut self.w).with_grad(&mut self.d_w),
            Param::new(&mut self.b).with_grad(&mut self.d_b),
        ]
    }
}

Params compose — bubble them up from sub-modules to build arbitrary architectures:

pub struct AffineAffine {
    affine1: Affine,
    affine2: Affine,
}

impl ToParams for AffineAffine {
    fn params(&mut self) -> Vec<Param> {
        let mut params = vec![];
        params.append(&mut self.affine1.params());
        params.append(&mut self.affine2.params());
        params
    }
}

ToParams also provides zero_grads() to reset all gradient arrays after an optimizer step:

let mut aa = AffineAffine::new();
aa.forward(x, true) // <- implement this yourself
aa.backward(d_loss) // <- implement this yourself
optimizer.step(&mut aa);
aa.zero_grads();

Available optimizers:

AdamW: Adaptive moment estimation with weight decay
SGD: Stochastic gradient descent

Use .with(&mut impl ToParams) to initialize a stateful optimizer (like AdamW) for your network:

let mut model = AffineAffine::new();
let mut optim = AdamW::default().with(&mut model); // creates momentum/variance state for all parameters

`ToIntermediates`

ToIntermediates lets you snapshot and restore a module's cached activations (the values stored during forward() that backward() needs for gradient computation). This enables training loops that aren't possible in standard frameworks:

Recursive / weight-tied depth: Forward the same module N times, stashing intermediates between each call. During backward, restore each stash in reverse to compute correct weight gradients for every application.
Online learning with rollback: Checkpoint recurrent state mid-sequence, run an optimizer step, then restore and continue from the checkpoint.

Implement intermediates() to return mutable references to your cached values:

impl ToIntermediates for MyLayer {
    fn intermediates(&mut self) -> Vec<&mut dyn Intermediate> {
        vec![&mut self.cache.x, &mut self.cache.z]
    }
}

Then stash_intermediates() and apply_intermediates() work automatically:

let mut model = MyLayer::new();

// Forward pass A
model.forward(x_a, true);
let stash_a = model.stash_intermediates();

// Forward pass B (overwrites cache)
model.forward(x_b, true);
model.backward(d_loss_b); // correct grads for B

// Restore A's cache, compute A's grads
model.apply_intermediates(stash_a);
model.backward(d_loss_a); // correct grads for A

Intermediates are returned from stash_intermediates as Vec<ArrayD<T>> - aliased as IntermediateCache.

Activation Functions (`nbml::f`)

use nbml::f;

let x = Array2::from_vec(vec![-1.0, 0.0, 1.0]);
let activated = f::relu(&x);

Includes derivatives for backpropagation: d_relu, d_tanh, d_sigmoid, etc.

Design Philosophy

nbml is designed for:

Experimentation / Research: Prototyping of novel architectures, through full control of forward and backward passes
Nonstandard Architectures: A lot more freedom without autograd running the show
Transparency: No hidden magic, every operation is explicit
Compute-Constrained Deployment: Lightweight + no C deps. Very quick for small models

nbml is not designed for:

Large Scale Production deployment (use PyTorch, TensorFlow, or Burn)
Automatic differentiation (you wire up the backward pass for custom modules)
GPU acceleration (CPU-only via ndarray)

The included nn primitives are technically plug and play, but when composing them you will have to wire backward() yourself.

pub struct SequenceClassifier {
    pub transformer: GlaTransformer,
    pub pooling: SequencePooling,
    pub readout: Layer,
}

impl SequenceClassifier {
    pub fn new(d_model: usize) -> Self { ... }

    pub fn forward(&mut self, x: Array3<f32>, grad: bool) -> Array2<f32> {
        // on-the-spot mask
        let mask = Array3::ones((x.dim().0, x.dim().1, x.dim().1));
        let x = self.transformer.forward(x, mask.clone(), grad); // (B, S, D)
        let x = self.pooling.forward(x, mask); // (B, D)
        let x = self.readout.forward(x); // (B, D)

        x
    }

    pub fn backward(&mut self, d_loss: Array2<f32>) -> Array3<f32> {
        let d_loss = self.readout.backward(d_loss); // (B, D)
        let d_loss = self.pooling.backward(d_loss); // (B, S, D)
        let d_loss = self.transformer.backward(d_loss); // (B, S, D)

        d_loss
    }
}

Examples

Custom LSTM Training

use nbml::nn::LSTM;
use nbml::optim::{AdamW, Optimizer, ToParams};

let mut lstm = LSTM::new(
    128     // d_model or feature dimension
);
let mut optimizer = AdamW::default().with(&mut lstm);

// where batch.dim() is (batch_size, seq_len, features)
// and features == lstm.d_model == (128 in this case)

for batch in data {
    let output = lstm.forward(batch, true);
    let loss = compute_loss(&output, &target);
    let grad = lstm.backward(loss);
    optimizer.step(&mut lstm);
    lstm.zero_grads();
}

Multi-Head Attention

use nbml::nn::SelfAttention;

let mut attention = SelfAttention::new(
    512,  // d_in
    64,   // d_head
    8     // n_head
);

// where input.dim() is (batch_size, seq_len, features)
// features == d_in == (512 in this case)
// and mask == (batch_size, seq_len, seq_len)
// with each element as 1. or 0. depending on whether or not the token
// is padding

let output = attention.forward(
    input, // (batch_size, seq_len, features)
    mask,  // binary mask, (batch_size, seq_len, seq_len)
    true    // grad
);

Transformer Decoder

use nbml::nn::Transformer;
use nbml::ndarray::Array3;

let mut transformer = Transformer::new_decoder(
    512,  // d_in
    64,   // d_head
    8,    // n_head
);

let y_pred = transformer.forward(
    input, // (batch_size, seq_len, features)
    mask,  // binary mask, (batch_size, seq_len, seq_len)
    true    // grad
);

// some bs.
let d_y_pred = Array3::ones(y_pred.dim());
transformer.backward(d_y_pred);
transformer.zero_grads();

nbml 0.6.2