Expand description
A rust library for performing GLM regression with data represented in
ndarrays.
The ndarray-linalg crate is used to allow
optimization of linear algebra operations with BLAS.
Numerical accuracy cannot be 100% guaranteed but the tests do include several comparisons with
R’s glm and glmnet packages, but some cases may not be covered directly or involve inherent
ambiguities or imprecisions.
§Feature summary:
- Many common exponential families (see table below for full list)
- Generic over floating-point type
- L1 (lasso), L2 (ridge), and elastic net regularization
- Statistical tests of fit result
- Weighted observations (both frequency and variance)
- Alternative and custom link functions
§Setting up BLAS backend
See the backend features of
ndarray-linalg
for a description of the available BLAS configuartions. You do not need to
include ndarray-linalg in your crate; simply provide the feature you need to
ndarray-glm and it will be forwarded to ndarray-linalg.
Examples using OpenBLAS are shown here. In principle you should also be able to use Netlib or Intel MKL, although these backends are untested.
§System OpenBLAS (recommended)
Ensure that the development OpenBLAS library is installed on your system. In
Debian/Ubuntu, for instance, this means installing libopenblas-dev. Then, put the
following into your crate’s Cargo.toml:
ndarray = { version = "0.17", features = ["blas"]}
ndarray-glm = { version = "0.1", features = ["openblas-system"] }§Compile OpenBLAS from source
This option does not require OpenBLAS to be installed on your system, but the
initial compile time will be very long. Use the folling lines in your crate’s
Cargo.toml.
ndarray = { version = "0.17", features = ["blas"]}
ndarray-glm = { version = "0.1", features = ["openblas-static"] }§Examples:
Basic linear regression:
use ndarray_glm::{array, Linear, ModelBuilder};
let data_y = array![0.3, 1.3, 0.7];
let data_x = array![[0.1, 0.2], [-0.4, 0.1], [0.2, 0.4]];
let model = ModelBuilder::<Linear>::data(&data_y, &data_x).build().unwrap();
let fit = model.fit().unwrap();
// The result is a flat array with the first term as the intercept.
println!("Fit result: {}", fit.result);L2 regularization:
use ndarray_glm::{array, Linear, ModelBuilder};
let data_y = array![0.3, 1.3, 0.7];
let data_x = array![[0.1, 0.2], [-0.4, 0.1], [0.2, 0.4]];
// The model builder standardizes the design matrix internally by default, so the penalty
// is applied uniformly across features regardless of their scale. The returned coefficients
// are always in the original (un-standardized) coordinate system.
let model = ModelBuilder::<Linear>::data(&data_y, &data_x).build().unwrap();
// L2 (ridge) regularization can be applied with l2_reg().
let fit = model.fit_options().l2_reg(1e-5).fit().unwrap();
println!("Fit result: {}", fit.result);Logistic regression with a non-canonical link function. The fit options may need adjusting as these are typically more difficult to converge:
use ndarray_glm::{array, Logistic, logistic_link::Cloglog, ModelBuilder};
let data_y = array![true, false, false, true, true];
let data_x = array![[0.5, 0.2], [0.1, 0.3], [0.2, 0.6], [0.6, 0.3], [0.4, 0.4]];
let model = ModelBuilder::<Logistic<Cloglog>>::data(&data_y, &data_x).build().unwrap();
let fit = model.fit_options().max_iter(256).l2_reg(0.1).fit().unwrap();
println!("Fit result: {}", fit.result);§Generalized linear models
A generalized linear model (GLM) describes the expected value of a response
variable $y$ through a link function $g$ applied to a linear combination of
$K-1$ feature variables $x_k$ (plus an intercept term):
g(\text{E}[y]) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{K-1} x_{K-1}The right-hand side is the linear predictor $\omega = \mathbf{x}^\top \boldsymbol{\beta}$.
With $N$ observations the data matrix $\mathbf{X}$ has a leading column of
ones (for the intercept) and the model relates the response vector
$\mathbf{y}$ to $\mathbf{X}$ and parameters $\boldsymbol{\beta}$.
§Exponential family
GLMs assume the response follows a distribution from the exponential family. In canonical form the density is
f(y;\eta) = \exp\!\bigl[\eta\, y - A(\eta) + B(y)\bigr]where $\eta$ is the natural parameter, $A(\eta)$ is the log-partition
function, and $B(y)$ depends only on the observation.
The expected value and variance of $y$ follow directly from $A$:
\text{E}[y] = A'(\eta), \qquad \text{Var}[y] = \phi\, A''(\eta)where $\phi$ is the dispersion parameter (fixed to 1 for some families and estimated from
the data for others).
§Link and variance functions
The link function $g$ maps the expected response $\mu = \text{E}[y]$ to
the linear predictor:
g(\mu) = \omega, \qquad \mu = g^{-1}(\omega)The canonical link is the one for which $\eta = \omega$, i.e. the natural
parameter equals the linear predictor. In general, in terms of an arbitrary link function $g$
and canonical link function $g_0$, we have $\eta(\omega) = g_0(g^{-1}(\omega))$. Non-canonical
links are supported in principle (see link).
The variance function $V(\mu)$ characterizes how the variance of $y$
depends on the mean, independent of the choice of link:
\text{Var}[y] = \phi\, V(\mu)§Supported families
| Family | Canonical link | $V(\mu)$ | $\phi$ |
|---|---|---|---|
Linear (Gaussian) | Identity | $1$ | estimated |
Logistic (Bernoulli) | Logit | $\mu(1-\mu)$ | $1$ |
Poisson | Log | $\mu$ | $1$ |
Binomial (fixed $n$) | Logit | $\mu(1-\mu/n)$ | $1$ |
Exponential | $-1/\mu$ | $\mu^2$ | $1$ |
Gamma | $-1/\mu$ | $\mu^2$ | estimated |
InvGaussian (inverse Gaussian) | $-1/(2\mu^2)$ | $\mu^3$ | estimated |
§Fitting via IRLS
Parameters are estimated by maximum likelihood using iteratively reweighted
least squares (IRLS). Each step solves for the update
$\Delta\boldsymbol{\beta}$:
-\mathbf{H}(\boldsymbol{\beta}) \cdot \Delta\boldsymbol{\beta} = \mathbf{J}(\boldsymbol{\beta})where $\mathbf{J}$ and $\mathbf{H}$ are the gradient and Hessian of the
log-likelihood. With the canonical link and a diagonal variance matrix
$\mathbf{S}$ ($S_{ii} = \text{Var}[y^{(i)} | \eta]$) this simplifies to:
(\mathbf{X}^\top \mathbf{S} \mathbf{X})\, \Delta\boldsymbol{\beta}
= \mathbf{X}^\top \bigl[\mathbf{y} - g^{-1}(\mathbf{X}\boldsymbol{\beta})\bigr]Observation weights $\mathbf{W}$ (inverse dispersion per observation)
generalize this to:
(\mathbf{X}^\top \mathbf{W} \mathbf{S} \mathbf{X})\, \Delta\boldsymbol{\beta}
= \mathbf{X}^\top \mathbf{W} \bigl[\mathbf{y} - g^{-1}(\mathbf{X}\boldsymbol{\beta})\bigr]The iteration converges when the log-likelihood is concave, which is guaranteed with the canonical link.
§Regularization
Optional penalty terms discourage large parameter values:
- L2 (ridge): adds $
\frac{\lambda_2}{2} \sum |\beta_k|^2$ to the negative log-likelihood, equivalent to a Gaussian prior on the coefficients. - L1 (lasso): adds $
\lambda_1 \sum |\beta_k|$, implemented via ADMM, which drives small coefficients to exactly zero. - Elastic net: combines both L1 and L2 penalties.
In all cases the intercept ($\beta_0$) is excluded from the penalty.
For a more complete mathematical reference, see the derivation notes.
Re-exports§
pub use model::ModelBuilder;
Modules§
- error
- define the error enum for the result of regressions
- exp_
link - Link functions for exponential regression
- gamma_
link - Link functions for gamma regression
- inv_
gauss_ link - Link functions for inverse Gaussian regression
- link
- Traits and utilities for link functions.
- logistic_
link - Link functions for logistic regression
- model
- Collect data for and configure a model
- num
- numerical trait constraints
Macros§
Structs§
- Binomial
- Binomial regression with a fixed N. Non-canonical link functions are not possible at this time due to the awkward ergonomics with the const trait parameter N.
- Dataset
- Exponential
- Exponential regression
- Fit
- the result of a successful GLM fit
- FitOptions
- Specifies the fitting options
- Gamma
- Gamma regression
- InvGaussian
- Inverse Gaussian regression with the canonical link function $
g_0(\mu) = -1/(2\mu^2)$. - Irls
Step - Represents a step in the IRLS. Holds the current guess and likelihood.
- Linear
- Linear regression with constant variance (Ordinary least squares).
- Logistic
- Logistic regression
- Poisson
- Poisson regression over an unsigned integer type.
Type Aliases§
- Array1
- one-dimensional array
- Array2
- two-dimensional array
- Array
View1 - one-dimensional array view
- Array
View2 - two-dimensional array view