ndarray_glm/lib.rs
1//! A rust library for performing GLM regression with data represented in
2//! [`ndarray`](file:///home/felix/Projects/ndarray-glm/target/doc/ndarray/index.html)s.
3//! The [`ndarray-linalg`](https://docs.rs/ndarray-linalg/) crate is used to allow
4//! optimization of linear algebra operations with BLAS.
5//!
6//! This crate is in beta and the interface may change significantly. The tests include several
7//! comparisons with R's `glmnet` package, but some cases may not be covered directly or involve
8//! inherent ambiguities or imprecisions.
9//!
10//! # Feature summary:
11//!
12//! * Linear, logistic, Poisson, and binomial regression (more to come)
13//! * Generic over floating-point type
14//! * L1 (lasso), L2 (ridge), and elastic net regularization
15//! * Statistical tests of fit result
16//! * Alternative and custom link functions
17//!
18//!
19//! # Setting up BLAS backend
20//!
21//! See the [backend features of
22//! `ndarray-linalg`](https://github.com/rust-ndarray/ndarray-linalg#backend-features)
23//! for a description of the available BLAS configuartions. You do not need to
24//! include `ndarray-linalg` in your crate; simply provide the feature you need to
25//! `ndarray-glm` and it will be forwarded to `ndarray-linalg`.
26//!
27//! Examples using OpenBLAS are shown here. In principle you should also be able to use
28//! Netlib or Intel MKL, although these backends are untested.
29//!
30//! ## System OpenBLAS (recommended)
31//!
32//! Ensure that the development OpenBLAS library is installed on your system. In
33//! Debian/Ubuntu, for instance, this means installing `libopenblas-dev`. Then, put the
34//! following into your crate's `Cargo.toml`:
35//! ```text
36//! ndarray = { version = "0.17", features = ["blas"]}
37//! ndarray-glm = { version = "0.0.14", features = ["openblas-system"] }
38//! ```
39//!
40//! ## Compile OpenBLAS from source
41//!
42//! This option does not require OpenBLAS to be installed on your system, but the
43//! initial compile time will be very long. Use the folling lines in your crate's
44//! `Cargo.toml`.
45//! ```text
46//! ndarray = { version = "0.17", features = ["blas"]}
47//! ndarray-glm = { version = "0.0.14", features = ["openblas-static"] }
48//! ```
49//!
50//! # Examples:
51//!
52//! Basic linear regression:
53//! ```
54//! use ndarray_glm::{array, Linear, ModelBuilder};
55//!
56//! let data_y = array![0.3, 1.3, 0.7];
57//! let data_x = array![[0.1, 0.2], [-0.4, 0.1], [0.2, 0.4]];
58//! let model = ModelBuilder::<Linear>::data(&data_y, &data_x).build().unwrap();
59//! let fit = model.fit().unwrap();
60//! // The result is a flat array with the first term as the intercept.
61//! println!("Fit result: {}", fit.result);
62//! ```
63//!
64//! Data standardization and L2 regularization:
65//! ```
66//! use ndarray_glm::{array, Linear, ModelBuilder, utility::standardize};
67//!
68//! let data_y = array![0.3, 1.3, 0.7];
69//! let data_x = array![[0.1, 0.2], [-0.4, 0.1], [0.2, 0.4]];
70//! // The design matrix can optionally be standardized, where the mean of each independent
71//! // variable is subtracted and each is then divided by the standard deviation of that variable.
72//! let data_x = standardize(data_x);
73//! let model = ModelBuilder::<Linear>::data(&data_y, &data_x).build().unwrap();
74//! // L2 (ridge) regularization can be applied with l2_reg().
75//! let fit = model.fit_options().l2_reg(1e-5).fit().unwrap();
76//! println!("Fit result: {}", fit.result);
77//! ```
78//!
79//! Logistic regression with a non-canonical link function. The fit options may need adjusting
80//! as these are typically more difficult to converge:
81//! ```
82//! use ndarray_glm::{array, Logistic, logistic_link::Cloglog, ModelBuilder};
83//!
84//! let data_y = array![true, false, false, true, true];
85//! let data_x = array![[0.5, 0.2], [0.1, 0.3], [0.2, 0.6], [0.6, 0.3], [0.4, 0.4]];
86//! let model = ModelBuilder::<Logistic<Cloglog>>::data(&data_y, &data_x).build().unwrap();
87//! let fit = model.fit_options().max_iter(64).l2_reg(1e-3).fit().unwrap();
88//! println!("Fit result: {}", fit.result);
89//! ```
90//!
91//! # Generalized linear models
92//!
93//! A generalized linear model (GLM) describes the expected value of a response
94//! variable $`y`$ through a *link function* $`g`$ applied to a linear combination of
95//! $`K-1`$ feature variables $`x_k`$ (plus an intercept term):
96//!
97//! ```math
98//! g(\text{E}[y]) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{K-1} x_{K-1}
99//! ```
100//!
101//! The right-hand side is the *linear predictor* $`\omega = \mathbf{x}^\top \boldsymbol{\beta}`$.
102//! With $`N`$ observations the data matrix $`\mathbf{X}`$ has a leading column of
103//! ones (for the intercept) and the model relates the response vector
104//! $`\mathbf{y}`$ to $`\mathbf{X}`$ and parameters $`\boldsymbol{\beta}`$.
105//!
106//! ## Exponential family
107//!
108//! GLMs assume the response follows a distribution from the exponential family.
109//! In canonical form the density is
110//!
111//! ```math
112//! f(y;\eta) = \exp\!\bigl[\eta\, y - A(\eta) + B(y)\bigr]
113//! ```
114//!
115//! where $`\eta`$ is the *natural parameter*, $`A(\eta)`$ is the *log-partition
116//! function*, and $`B(y)`$ depends only on the observation.
117//! The expected value and variance of $`y`$ follow directly from $`A`$:
118//!
119//! ```math
120//! \text{E}[y] = A'(\eta), \qquad \text{Var}[y] = \phi\, A''(\eta)
121//! ```
122//!
123//! where $`\phi`$ is the *dispersion parameter* (fixed to 1 for logistic and
124//! Poisson families, estimated for the linear/Gaussian family).
125//!
126//! ## Link and variance functions
127//!
128//! The *link function* $`g`$ maps the expected response $`\mu = \text{E}[y]`$ to
129//! the linear predictor:
130//!
131//! ```math
132//! g(\mu) = \omega, \qquad \mu = g^{-1}(\omega)
133//! ```
134//!
135//! The *canonical link* is the one for which $`\eta = \omega`$, i.e. the natural
136//! parameter equals the linear predictor. Non-canonical links are also supported
137//! (see [`link`]).
138//!
139//! The *variance function* $`V(\mu)`$ characterizes how the variance of $`y`$
140//! depends on the mean, independent of the choice of link:
141//!
142//! ```math
143//! \text{Var}[y] = \phi\, V(\mu)
144//! ```
145//!
146//! ## Supported families
147//!
148//! | Family | Canonical link | $`V(\mu)`$ | $`\phi`$ |
149//! |--------|---------------|-----------|---------|
150//! | [`Linear`] (Gaussian) | Identity | $`1`$ | estimated |
151//! | [`Logistic`] (Bernoulli) | Logit | $`\mu(1-\mu)`$ | $`1`$ |
152//! | [`Poisson`] | Log | $`\mu`$ | $`1`$ |
153//! | [`Binomial`] (fixed $`n`$) | Logit | $`\mu(n-\mu)/n`$ | $`1`$ |
154//!
155//! ## Fitting via IRLS
156//!
157//! Parameters are estimated by maximum likelihood using iteratively reweighted
158//! least squares (IRLS). Each step solves for the update
159//! $`\Delta\boldsymbol{\beta}`$:
160//!
161//! ```math
162//! -\mathbf{H}(\boldsymbol{\beta}) \cdot \Delta\boldsymbol{\beta} = \mathbf{J}(\boldsymbol{\beta})
163//! ```
164//!
165//! where $`\mathbf{J}`$ and $`\mathbf{H}`$ are the gradient and Hessian of the
166//! log-likelihood. With the canonical link and a diagonal variance matrix
167//! $`\mathbf{S}`$ ($`S_{ii} = \text{Var}[y^{(i)} | \eta]`$) this simplifies to:
168//!
169//! ```math
170//! (\mathbf{X}^\top \mathbf{S} \mathbf{X})\, \Delta\boldsymbol{\beta}
171//! = \mathbf{X}^\top \bigl[\mathbf{y} - g^{-1}(\mathbf{X}\boldsymbol{\beta})\bigr]
172//! ```
173//!
174//! Observation weights $`\mathbf{W}`$ (inverse dispersion per observation)
175//! generalize this to:
176//!
177//! ```math
178//! (\mathbf{X}^\top \mathbf{W} \mathbf{S} \mathbf{X})\, \Delta\boldsymbol{\beta}
179//! = \mathbf{X}^\top \mathbf{W} \bigl[\mathbf{y} - g^{-1}(\mathbf{X}\boldsymbol{\beta})\bigr]
180//! ```
181//!
182//! The iteration converges when the log-likelihood is concave, which is
183//! guaranteed with the canonical link.
184//!
185//! ## Regularization
186//!
187//! Optional penalty terms discourage large parameter values:
188//!
189//! - **L2 (ridge):** adds $`\frac{\lambda_2}{2} \sum |\beta_k|^2`$ to the
190//! negative log-likelihood, equivalent to a Gaussian prior on the
191//! coefficients.
192//! - **L1 (lasso):** adds $`\lambda_1 \sum |\beta_k|`$, implemented via ADMM,
193//! which drives small coefficients to exactly zero.
194//! - **Elastic net:** combines both L1 and L2 penalties.
195//!
196//! In all cases the intercept ($`\beta_0`$) is excluded from the penalty.
197//!
198//! For a more complete mathematical reference, see the
199//! [derivation notes](https://felix-clark.github.io/src/tex/glm-math/main.pdf).
200
201#![doc(html_root_url = "https://docs.rs/crate/ndarray-glm")]
202pub mod error;
203mod fit;
204mod glm;
205mod irls;
206pub mod link;
207mod math;
208pub mod model;
209pub mod num;
210mod regularization;
211mod response;
212pub mod utility;
213
214// Import some common names into the top-level namespace
215pub use {
216 fit::Fit,
217 model::ModelBuilder,
218 response::logistic::link as logistic_link,
219 response::{binomial::Binomial, linear::Linear, logistic::Logistic, poisson::Poisson},
220};
221
222// re-export common structs from ndarray
223pub use ndarray::{Array1, Array2, ArrayView1, ArrayView2, array};