Skip to main content

ndarray_glm/
lib.rs

1//! A rust library for performing GLM regression with data represented in
2//! [`ndarray`](file:///home/felix/Projects/ndarray-glm/target/doc/ndarray/index.html)s.
3//! The [`ndarray-linalg`](https://docs.rs/ndarray-linalg/) crate is used to allow
4//! optimization of linear algebra operations with BLAS.
5//!
6//! This crate is in beta and the interface may change significantly. The tests include several
7//! comparisons with R's `glm` and `glmnet` packages, but some cases may not be covered directly or
8//! involve inherent ambiguities or imprecisions.
9//!
10//! # Feature summary:
11//!
12//! * Linear, logistic, Poisson, and binomial regression (more to come)
13//! * Generic over floating-point type
14//! * L1 (lasso), L2 (ridge), and elastic net regularization
15//! * Statistical tests of fit result
16//! * Alternative and custom link functions
17//!
18//!
19//! # Setting up BLAS backend
20//!
21//! See the [backend features of
22//! `ndarray-linalg`](https://github.com/rust-ndarray/ndarray-linalg#backend-features)
23//! for a description of the available BLAS configuartions. You do not need to
24//! include `ndarray-linalg` in your crate; simply provide the feature you need to
25//! `ndarray-glm` and it will be forwarded to `ndarray-linalg`.
26//!
27//! Examples using OpenBLAS are shown here. In principle you should also be able to use
28//! Netlib or Intel MKL, although these backends are untested.
29//!
30//! ## System OpenBLAS (recommended)
31//!
32//! Ensure that the development OpenBLAS library is installed on your system. In
33//! Debian/Ubuntu, for instance, this means installing `libopenblas-dev`. Then, put the
34//! following into your crate's `Cargo.toml`:
35//! ```text
36//! ndarray = { version = "0.17", features = ["blas"]}
37//! ndarray-glm = { version = "0.0.15", features = ["openblas-system"] }
38//! ```
39//!
40//! ## Compile OpenBLAS from source
41//!
42//! This option does not require OpenBLAS to be installed on your system, but the
43//! initial compile time will be very long. Use the folling lines in your crate's
44//! `Cargo.toml`.
45//! ```text
46//! ndarray = { version = "0.17", features = ["blas"]}
47//! ndarray-glm = { version = "0.0.15", features = ["openblas-static"] }
48//! ```
49//!
50//! # Examples:
51//!
52//! Basic linear regression:
53//! ```
54//! use ndarray_glm::{array, Linear, ModelBuilder};
55//!
56//! let data_y = array![0.3, 1.3, 0.7];
57//! let data_x = array![[0.1, 0.2], [-0.4, 0.1], [0.2, 0.4]];
58//! let model = ModelBuilder::<Linear>::data(&data_y, &data_x).build().unwrap();
59//! let fit = model.fit().unwrap();
60//! // The result is a flat array with the first term as the intercept.
61//! println!("Fit result: {}", fit.result);
62//! ```
63//!
64//! L2 regularization:
65//! ```
66//! use ndarray_glm::{array, Linear, ModelBuilder};
67//!
68//! let data_y = array![0.3, 1.3, 0.7];
69//! let data_x = array![[0.1, 0.2], [-0.4, 0.1], [0.2, 0.4]];
70//! // The model builder standardizes the design matrix internally by default, so the penalty
71//! // is applied uniformly across features regardless of their scale. The returned coefficients
72//! // are always in the original (un-standardized) coordinate system.
73//! let model = ModelBuilder::<Linear>::data(&data_y, &data_x).build().unwrap();
74//! // L2 (ridge) regularization can be applied with l2_reg().
75//! let fit = model.fit_options().l2_reg(1e-5).fit().unwrap();
76//! println!("Fit result: {}", fit.result);
77//! ```
78//!
79//! Logistic regression with a non-canonical link function. The fit options may need adjusting
80//! as these are typically more difficult to converge:
81//! ```
82//! use ndarray_glm::{array, Logistic, logistic_link::Cloglog, ModelBuilder};
83//!
84//! let data_y = array![true, false, false, true, true];
85//! let data_x = array![[0.5, 0.2], [0.1, 0.3], [0.2, 0.6], [0.6, 0.3], [0.4, 0.4]];
86//! let model = ModelBuilder::<Logistic<Cloglog>>::data(&data_y, &data_x).build().unwrap();
87//! let fit = model.fit_options().max_iter(256).l2_reg(0.1).fit().unwrap();
88//! println!("Fit result: {}", fit.result);
89//! ```
90//!
91//! # Generalized linear models
92//!
93//! A generalized linear model (GLM) describes the expected value of a response
94//! variable $`y`$ through a *link function* $`g`$ applied to a linear combination of
95//! $`K-1`$ feature variables $`x_k`$ (plus an intercept term):
96//!
97//! ```math
98//! g(\text{E}[y]) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{K-1} x_{K-1}
99//! ```
100//!
101//! The right-hand side is the *linear predictor* $`\omega = \mathbf{x}^\top \boldsymbol{\beta}`$.
102//! With $`N`$ observations the data matrix $`\mathbf{X}`$ has a leading column of
103//! ones (for the intercept) and the model relates the response vector
104//! $`\mathbf{y}`$ to $`\mathbf{X}`$ and parameters $`\boldsymbol{\beta}`$.
105//!
106//! ## Exponential family
107//!
108//! GLMs assume the response follows a distribution from the exponential family.
109//! In canonical form the density is
110//!
111//! ```math
112//! f(y;\eta) = \exp\!\bigl[\eta\, y - A(\eta) + B(y)\bigr]
113//! ```
114//!
115//! where $`\eta`$ is the *natural parameter*, $`A(\eta)`$ is the *log-partition
116//! function*, and $`B(y)`$ depends only on the observation.
117//! The expected value and variance of $`y`$ follow directly from $`A`$:
118//!
119//! ```math
120//! \text{E}[y] = A'(\eta), \qquad \text{Var}[y] = \phi\, A''(\eta)
121//! ```
122//!
123//! where $`\phi`$ is the *dispersion parameter* (fixed to 1 for some families and estimated from
124//! the data for others).
125//!
126//! ## Link and variance functions
127//!
128//! The *link function* $`g`$ maps the expected response $`\mu = \text{E}[y]`$ to
129//! the linear predictor:
130//!
131//! ```math
132//! g(\mu) = \omega, \qquad \mu = g^{-1}(\omega)
133//! ```
134//!
135//! The *canonical link* is the one for which $`\eta = \omega`$, i.e. the natural
136//! parameter equals the linear predictor. In general, in terms of an arbitrary link function $`g`$
137//! and canonical link function $`g_0`$, we have $`\eta(\omega) = g_0(g(\omega)). Non-canonical
138//! links are supported in principle (see [`link`]).
139//!
140//! The *variance function* $`V(\mu)`$ characterizes how the variance of $`y`$
141//! depends on the mean, independent of the choice of link:
142//!
143//! ```math
144//! \text{Var}[y] = \phi\, V(\mu)
145//! ```
146//!
147//! ## Supported families
148//!
149//! | Family | Canonical link | $`V(\mu)`$ | $`\phi`$ |
150//! |--------|---------------|-----------|---------|
151//! | [`Linear`] (Gaussian) | Identity | $`1`$ | estimated |
152//! | [`Logistic`] (Bernoulli) | Logit | $`\mu(1-\mu)`$ | $`1`$ |
153//! | [`Poisson`] | Log | $`\mu`$ | $`1`$ |
154//! | [`Binomial`] (fixed $`n`$) | Logit | $`\mu(1-\mu/n)`$ | $`1`$ |
155//!
156//! ## Fitting via IRLS
157//!
158//! Parameters are estimated by maximum likelihood using iteratively reweighted
159//! least squares (IRLS). Each step solves for the update
160//! $`\Delta\boldsymbol{\beta}`$:
161//!
162//! ```math
163//! -\mathbf{H}(\boldsymbol{\beta}) \cdot \Delta\boldsymbol{\beta} = \mathbf{J}(\boldsymbol{\beta})
164//! ```
165//!
166//! where $`\mathbf{J}`$ and $`\mathbf{H}`$ are the gradient and Hessian of the
167//! log-likelihood. With the canonical link and a diagonal variance matrix
168//! $`\mathbf{S}`$ ($`S_{ii} = \text{Var}[y^{(i)} | \eta]`$) this simplifies to:
169//!
170//! ```math
171//! (\mathbf{X}^\top \mathbf{S} \mathbf{X})\, \Delta\boldsymbol{\beta}
172//!   = \mathbf{X}^\top \bigl[\mathbf{y} - g^{-1}(\mathbf{X}\boldsymbol{\beta})\bigr]
173//! ```
174//!
175//! Observation weights $`\mathbf{W}`$ (inverse dispersion per observation)
176//! generalize this to:
177//!
178//! ```math
179//! (\mathbf{X}^\top \mathbf{W} \mathbf{S} \mathbf{X})\, \Delta\boldsymbol{\beta}
180//!   = \mathbf{X}^\top \mathbf{W} \bigl[\mathbf{y} - g^{-1}(\mathbf{X}\boldsymbol{\beta})\bigr]
181//! ```
182//!
183//! The iteration converges when the log-likelihood is concave, which is
184//! guaranteed with the canonical link.
185//!
186//! ## Regularization
187//!
188//! Optional penalty terms discourage large parameter values:
189//!
190//! - **L2 (ridge):** adds $`\frac{\lambda_2}{2} \sum |\beta_k|^2`$ to the
191//!   negative log-likelihood, equivalent to a Gaussian prior on the
192//!   coefficients.
193//! - **L1 (lasso):** adds $`\lambda_1 \sum |\beta_k|`$, implemented via ADMM,
194//!   which drives small coefficients to exactly zero.
195//! - **Elastic net:** combines both L1 and L2 penalties.
196//!
197//! In all cases the intercept ($`\beta_0`$) is excluded from the penalty.
198//!
199//! For a more complete mathematical reference, see the
200//! [derivation notes](https://felix-clark.github.io/src/tex/glm-math/main.pdf).
201
202#![doc(html_root_url = "https://docs.rs/crate/ndarray-glm")]
203mod data;
204pub mod error;
205mod fit;
206mod glm;
207mod irls;
208pub mod link;
209mod math;
210pub mod model;
211pub mod num;
212mod regularization;
213mod response;
214
215// Import some common names into the top-level namespace
216pub use {
217    fit::Fit,
218    model::ModelBuilder,
219    response::logistic::link as logistic_link,
220    response::{binomial::Binomial, linear::Linear, logistic::Logistic, poisson::Poisson},
221};
222
223// re-export common structs from ndarray
224pub use ndarray::{Array1, Array2, ArrayView1, ArrayView2, array};