ndarray_glm/lib.rs
1//! A rust library for performing GLM regression with data represented in
2//! [`ndarray`](file:///home/felix/Projects/ndarray-glm/target/doc/ndarray/index.html)s.
3//! The [`ndarray-linalg`](https://docs.rs/ndarray-linalg/) crate is used to allow
4//! optimization of linear algebra operations with BLAS.
5//!
6//! Numerical accuracy cannot be 100% guaranteed but the tests do include several comparisons with
7//! R's `glm` and `glmnet` packages, but some cases may not be covered directly or involve inherent
8//! ambiguities or imprecisions.
9//!
10//! # Feature summary:
11//!
12//! * Many common exponential families (see table below for full list)
13//! * Generic over floating-point type
14//! * L1 (lasso), L2 (ridge), and elastic net regularization
15//! * Statistical tests of fit result
16//! * Weighted observations (both frequency and variance)
17//! * Alternative and custom link functions
18//!
19//!
20//! # Setting up BLAS backend
21//!
22//! See the [backend features of
23//! `ndarray-linalg`](https://github.com/rust-ndarray/ndarray-linalg#backend-features)
24//! for a description of the available BLAS configuartions. You do not need to
25//! include `ndarray-linalg` in your crate; simply provide the feature you need to
26//! `ndarray-glm` and it will be forwarded to `ndarray-linalg`.
27//!
28//! Examples using OpenBLAS are shown here. In principle you should also be able to use
29//! Netlib or Intel MKL, although these backends are untested.
30//!
31//! ## System OpenBLAS (recommended)
32//!
33//! Ensure that the development OpenBLAS library is installed on your system. In
34//! Debian/Ubuntu, for instance, this means installing `libopenblas-dev`. Then, put the
35//! following into your crate's `Cargo.toml`:
36//! ```text
37//! ndarray = { version = "0.17", features = ["blas"]}
38//! ndarray-glm = { version = "0.1", features = ["openblas-system"] }
39//! ```
40//!
41//! ## Compile OpenBLAS from source
42//!
43//! This option does not require OpenBLAS to be installed on your system, but the
44//! initial compile time will be very long. Use the folling lines in your crate's
45//! `Cargo.toml`.
46//! ```text
47//! ndarray = { version = "0.17", features = ["blas"]}
48//! ndarray-glm = { version = "0.1", features = ["openblas-static"] }
49//! ```
50//!
51//! # Examples:
52//!
53//! Basic linear regression:
54//! ```
55//! use ndarray_glm::{array, Linear, ModelBuilder};
56//!
57//! let data_y = array![0.3, 1.3, 0.7];
58//! let data_x = array![[0.1, 0.2], [-0.4, 0.1], [0.2, 0.4]];
59//! let model = ModelBuilder::<Linear>::data(&data_y, &data_x).build().unwrap();
60//! let fit = model.fit().unwrap();
61//! // The result is a flat array with the first term as the intercept.
62//! println!("Fit result: {}", fit.result);
63//! ```
64//!
65//! L2 regularization:
66//! ```
67//! use ndarray_glm::{array, Linear, ModelBuilder};
68//!
69//! let data_y = array![0.3, 1.3, 0.7];
70//! let data_x = array![[0.1, 0.2], [-0.4, 0.1], [0.2, 0.4]];
71//! // The model builder standardizes the design matrix internally by default, so the penalty
72//! // is applied uniformly across features regardless of their scale. The returned coefficients
73//! // are always in the original (un-standardized) coordinate system.
74//! let model = ModelBuilder::<Linear>::data(&data_y, &data_x).build().unwrap();
75//! // L2 (ridge) regularization can be applied with l2_reg().
76//! let fit = model.fit_options().l2_reg(1e-5).fit().unwrap();
77//! println!("Fit result: {}", fit.result);
78//! ```
79//!
80//! Logistic regression with a non-canonical link function. The fit options may need adjusting
81//! as these are typically more difficult to converge:
82//! ```
83//! use ndarray_glm::{array, Logistic, logistic_link::Cloglog, ModelBuilder};
84//!
85//! let data_y = array![true, false, false, true, true];
86//! let data_x = array![[0.5, 0.2], [0.1, 0.3], [0.2, 0.6], [0.6, 0.3], [0.4, 0.4]];
87//! let model = ModelBuilder::<Logistic<Cloglog>>::data(&data_y, &data_x).build().unwrap();
88//! let fit = model.fit_options().max_iter(256).l2_reg(0.1).fit().unwrap();
89//! println!("Fit result: {}", fit.result);
90//! ```
91//!
92//! # Generalized linear models
93//!
94//! A generalized linear model (GLM) describes the expected value of a response
95//! variable $`y`$ through a *link function* $`g`$ applied to a linear combination of
96//! $`K-1`$ feature variables $`x_k`$ (plus an intercept term):
97//!
98//! ```math
99//! g(\text{E}[y]) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{K-1} x_{K-1}
100//! ```
101//!
102//! The right-hand side is the *linear predictor* $`\omega = \mathbf{x}^\top \boldsymbol{\beta}`$.
103//! With $`N`$ observations the data matrix $`\mathbf{X}`$ has a leading column of
104//! ones (for the intercept) and the model relates the response vector
105//! $`\mathbf{y}`$ to $`\mathbf{X}`$ and parameters $`\boldsymbol{\beta}`$.
106//!
107//! ## Exponential family
108//!
109//! GLMs assume the response follows a distribution from the exponential family.
110//! In canonical form the density is
111//!
112//! ```math
113//! f(y;\eta) = \exp\!\bigl[\eta\, y - A(\eta) + B(y)\bigr]
114//! ```
115//!
116//! where $`\eta`$ is the *natural parameter*, $`A(\eta)`$ is the *log-partition
117//! function*, and $`B(y)`$ depends only on the observation.
118//! The expected value and variance of $`y`$ follow directly from $`A`$:
119//!
120//! ```math
121//! \text{E}[y] = A'(\eta), \qquad \text{Var}[y] = \phi\, A''(\eta)
122//! ```
123//!
124//! where $`\phi`$ is the *dispersion parameter* (fixed to 1 for some families and estimated from
125//! the data for others).
126//!
127//! ## Link and variance functions
128//!
129//! The *link function* $`g`$ maps the expected response $`\mu = \text{E}[y]`$ to
130//! the linear predictor:
131//!
132//! ```math
133//! g(\mu) = \omega, \qquad \mu = g^{-1}(\omega)
134//! ```
135//!
136//! The *canonical link* is the one for which $`\eta = \omega`$, i.e. the natural
137//! parameter equals the linear predictor. In general, in terms of an arbitrary link function $`g`$
138//! and canonical link function $`g_0`$, we have $`\eta(\omega) = g_0(g^{-1}(\omega))`$. Non-canonical
139//! links are supported in principle (see [`link`]).
140//!
141//! The *variance function* $`V(\mu)`$ characterizes how the variance of $`y`$
142//! depends on the mean, independent of the choice of link:
143//!
144//! ```math
145//! \text{Var}[y] = \phi\, V(\mu)
146//! ```
147//!
148//! ## Supported families
149//!
150//! | Family | Canonical link | $`V(\mu)`$ | $`\phi`$ |
151//! |--------|---------------|-----------|---------|
152//! | [`Linear`] (Gaussian) | Identity | $`1`$ | estimated |
153//! | [`Logistic`] (Bernoulli) | Logit | $`\mu(1-\mu)`$ | $`1`$ |
154//! | [`Poisson`] | Log | $`\mu`$ | $`1`$ |
155//! | [`Binomial`] (fixed $`n`$) | Logit | $`\mu(1-\mu/n)`$ | $`1`$ |
156//! | [`Exponential`] | $`-1/\mu`$ | $`\mu^2`$ | $`1`$ |
157//! | [`Gamma`] | $`-1/\mu`$ | $`\mu^2`$ | estimated |
158//! | [`InvGaussian`] (inverse Gaussian) | $`-1/(2\mu^2)`$ | $`\mu^3`$ | estimated |
159//!
160//! ## Fitting via IRLS
161//!
162//! Parameters are estimated by maximum likelihood using iteratively reweighted
163//! least squares (IRLS). Each step solves for the update
164//! $`\Delta\boldsymbol{\beta}`$:
165//!
166//! ```math
167//! -\mathbf{H}(\boldsymbol{\beta}) \cdot \Delta\boldsymbol{\beta} = \mathbf{J}(\boldsymbol{\beta})
168//! ```
169//!
170//! where $`\mathbf{J}`$ and $`\mathbf{H}`$ are the gradient and Hessian of the
171//! log-likelihood. With the canonical link and a diagonal variance matrix
172//! $`\mathbf{S}`$ ($`S_{ii} = \text{Var}[y^{(i)} | \eta]`$) this simplifies to:
173//!
174//! ```math
175//! (\mathbf{X}^\top \mathbf{S} \mathbf{X})\, \Delta\boldsymbol{\beta}
176//! = \mathbf{X}^\top \bigl[\mathbf{y} - g^{-1}(\mathbf{X}\boldsymbol{\beta})\bigr]
177//! ```
178//!
179//! Observation weights $`\mathbf{W}`$ (inverse dispersion per observation)
180//! generalize this to:
181//!
182//! ```math
183//! (\mathbf{X}^\top \mathbf{W} \mathbf{S} \mathbf{X})\, \Delta\boldsymbol{\beta}
184//! = \mathbf{X}^\top \mathbf{W} \bigl[\mathbf{y} - g^{-1}(\mathbf{X}\boldsymbol{\beta})\bigr]
185//! ```
186//!
187//! The iteration converges when the log-likelihood is concave, which is
188//! guaranteed with the canonical link.
189//!
190//! ## Regularization
191//!
192//! Optional penalty terms discourage large parameter values:
193//!
194//! - **L2 (ridge):** adds $`\frac{\lambda_2}{2} \sum |\beta_k|^2`$ to the
195//! negative log-likelihood, equivalent to a Gaussian prior on the
196//! coefficients.
197//! - **L1 (lasso):** adds $`\lambda_1 \sum |\beta_k|`$, implemented via ADMM,
198//! which drives small coefficients to exactly zero.
199//! - **Elastic net:** combines both L1 and L2 penalties.
200//!
201//! In all cases the intercept ($`\beta_0`$) is excluded from the penalty.
202//!
203//! For a more complete mathematical reference, see the
204//! [derivation notes](https://felix-clark.github.io/src/tex/glm-math/main.pdf).
205
206mod data;
207pub mod error;
208mod fit;
209mod glm;
210mod irls;
211pub mod link;
212mod math;
213pub mod model;
214pub mod num;
215mod regularization;
216mod response;
217
218// Import some common names into the top-level namespace
219pub use {
220 data::Dataset,
221 fit::{Fit, options::FitOptions},
222 irls::IrlsStep,
223 model::ModelBuilder,
224 response::exponential::link as exp_link,
225 response::gamma::link as gamma_link,
226 response::inverse_gaussian::link as inv_gauss_link,
227 response::logistic::link as logistic_link,
228 response::{
229 binomial::Binomial, exponential::Exponential, gamma::Gamma, inverse_gaussian::InvGaussian,
230 linear::Linear, logistic::Logistic, poisson::Poisson,
231 },
232};
233
234// re-export common structs from ndarray
235pub use ndarray::{Array1, Array2, ArrayView1, ArrayView2, array};