nbml

A minimal machine learning library built on ndarray for low-level ML algorithm development in Rust.

Unlike high-level frameworks, nbml provides bare primitives and a lightweight optimizer API for building custom neural networks from scratch. If you want comfortable abstractions, see Burn. If you want to understand what's happening under the hood and have full control, nbml gives you the building blocks.

Features

Core primitives: Attention, LSTM, RNN, Conv2D, Feedforward layers, etc
Activation functions: ReLU, Sigmoid, Tanh, Softmax, etc
Optimizers: AdamW, SGD
Utilities: Variable Sequence Batching, Gradient Clipping, Gumbel Softmax, Plots, etc
Minimal abstractions: Direct ndarray integration for custom algorithms

Quick Start

use nbml::layers::ffn::FFN;
use nbml::f::Activation;
use nbml::optim::adam::AdamW;
use nbml::optim::param::ToParams;

// Build a simple feedforward network
let mut model = FFN::new(vec![(
    (784, 12, Activation::Relu),
    (12, 1, Activation::Sigmoid)
)]);

// Create optimizer
let mut optimizer = AdamW::default().with(&mut model);

// Training loop (simplified)
for batch in training_data {
    let output = model.forward(batch.x, true);
    let loss = cross_entropy(&output, &batch.y);
    let grad = model.backward(loss);
    optimizer.step();
    model.zero_grad();
}

Architecture

NN Layers (`nbml::nn`)

Layer: Single nonlinear projection layer
FFN: Feedforward network with configurable layers
LSTM: Long Short-Term Memory Network
RNN: Vanilla recurrent neural network
ESN: Echo-state network, fixed recurrence + readout
LayerNorm: Layer normalization
Pooling: Sequence mean-pooling
Conv2D: Explicit Im2Col Conv2D layer (CPU efficient, memory hungry)
PatchwiseConv2D: Patchwise Conv2D layer (CPU hungry, memory efficient)
LinearSSM: Discrete Linear SSM
Attention: Core attention primitive
SelfAttention: Multi-head self attention
CrossAttention: Multi-head cross attention
Transformer: Transformer encoder/decoder block
GatedLinearAttention: Multi-head gated linear attention with matrix-valued state and outer-product gating (Yang et al., 2024)
AttentionHead: Multi-head self-attention mechanism (dep, use SelfAttention)
TransformerEncoder: Pre-norm transformer encoder (dep, use Transformer::new_encoder())
TransformerDecoder: Pre-norm transformer decoder (dep, use Transformer::new_decoder())

Optimizers (`nbml::optim`)

Implement the ToParams trait for gradient-based optimization:

pub struct Affine {
    w: Array2<f64>,
    b: Array1<f64>,

    d_w: Array2<f64>,
    d_b: Array1<f64>,
}

// impl Affine {}

impl ToParams for Affine {
    fn params(&mut self) -> Vec<Param> {
        vec![
            Param::matrix(&mut self.w).with_matrix_grad(&self.d_w),
            Param::vector(&mut self.b).with_vector_grad(&self.d_b),
        ]
    }
}

You can bubble params up:

pub struct AffineAffine {
    affine1: Affine,
    affine2: Affine,
}

// impl AffineAffine {}

impl ToParams for AffineAffine {
    fn params(&mut self) -> Vec<Param> {
        let mut params = vec![];
        params.append(&mut self.affine1.params());
        params.append(&mut self.affine2.params());
        params
    }
}

ToParams will also let you zero gradients:

let mut aa = AffineAffine::new();
aa.forward(x, true) // <- implement this yourself
aa.backward(d_loss) // <- implement this yourself
aa.zero_grads();

Available optimizers:

AdamW: Adaptive moment estimation with bias correction
SGD: Stochastic gradient descent with optional momentum

Use .with(&mut impl ToParams) to prepare a stateful optimizer (like AdamW) for your network:

let mut model = AffineAffine::new();
let mut optim = AdamW::default().with(&mut model); // <- adamw creates momentums, values for all parameters in Model

Activation Functions (`nbml::f`)

use nbml::f;

let x = Array1::from_vec(vec![-1.0, 0.0, 1.0]);
let activated = f::relu(&x);
let softmax = f::softmax(&x);

Includes derivatives for backpropagation: d_relu, d_tanh, d_sigmoid, etc.

Design Philosophy

nbml is designed for:

Experimentation / Research: Prototyping of novel architectures, through full control of forward and backward passes
Transparency: No hidden magic, every operation is explicit
Compute-Constrained Deployment: Lightweight + no C deps. Very quick for small models.

nbml is not designed for:

Large Scale Production deployment (use PyTorch, TensorFlow, or Burn)
Automatic differentiation (you write the backward pass)
GPU acceleration (CPU-only via ndarray)
Plug-and-play models (you build everything yourself)

Examples

Custom LSTM Training

use nbml::nn::LSTM;
use nbml::optim::adam::Adam;

let mut lstm = LSTM::new(
    128     // d_model or feature dimension
);
let mut optimizer = Adam::default().with(&mut lstm);

// where batch.dim() is (batch_size, seq_len, features)
// and features == lstm.d_model == (128 in this case)

for batch in data {
    let output = lstm.forward(batch, true);
    let loss = compute_loss(&output, &target);
    let grad = lstm.backward(loss);
    optimizer.step();
    lstm.zero_grads();
}

Multi-Head Attention

use nbml::nn::SelfAttention;

let mut attention = SelfAttention::new(
    512,  // d_in
    64,   // d_head
    8     // n_head
);

// where input.dim() is (batch_size, seq_len, features)
// features == d_in == (512 in this case)
// and mask == (batch_size, seq_len, seq_len)
// with each element as 1. or 0. depending on whether or not the token
// is padding

let output = attention.forward(
    input, // (batch_size, seq_len, features)
    mask,  // binary mask, (batch_size, seq_len, seq_len)
    true    // grad
);

Transformer Decoder

use nbml::nn::Transformer;
use nbml::f::Activation;
use nbml::ndarray::Array3;

let mut transformer = Transformer::new_decoder(
    512,  // d_in
    64,   // d_head
    8,    // n_head
    vec![ // feedforward network layer definition
        (512, 512 * 4, Activation::Relu),
        (512 * 4, 512, Activation::Identity)
    ]
);

let y_pred = transformer.forward(
    input, // (batch_size, seq_len, features)
    mask,  // binary mask, (batch_size, seq_len, seq_len)
    true    // grad
);

// some bs.
let d_y_pred = Array3::ones(y_pred.dim());
transformer.backward(d_y_pred);
transformer.zero_grads();

nbml 0.3.2