ndarray-glm 0.1.0

Performs regression for generalized linear models using IRLS on data stored in arrays.

Coverage
80.95%
68 out of 84 items documented2 out of 31 items with examples
Size
Source code size: 609.92 kB This is the summed size of all the files inside the crates.io package for this release.
Documentation size: 20.05 MB This is the summed size of all files generated by rustdoc for all configured targets
Ø build duration
this release: 1m 19s Average build duration of successful builds.
all releases: 52s Average build duration of successful builds in releases after 2024-10-23.
Links
felix-clark/ndarray-glm
34 0 9
crates.io
Dependencies
Versions
Owners

ndarray-glm

Rust library for solving linear, logistic, and other generalized linear models (GLMs), using the ndarray-linalg module.

Downloads

Status

Numerical accuracy cannot be fully guaranteed, but the tests do include several checks against R's glm and glmnet packages. Some edge cases may be excluded and others may involve inherent ambiguities or imprecisions.

The regression algorithm uses iteratively re-weighted least squares (IRLS) with a line-search procedure applied when the next iteration of guesses does not increase the likelihood.

Suggestions (via issues) and pull requests are welcome.

Prerequisites

The recommended approach is to use a system BLAS implementation. For instance, to install OpenBLAS on Debian/Ubuntu:

sudo apt update && sudo apt install -y libopenblas-dev

or on Arch:

sudo pacman -Syu blas-openblas

(or perhaps just openblas, which is a dependency of blas-openblas). Regardless of the installation method, these libraries permit use of this crate with the openblas-system feature.

To use an alternative backend or to build a static BLAS implementation, refer to the ndarray-linalg documentation. Use this crate with the appropriate feature flag and it will be forwarded to ndarray-linalg.

Example

To use in your crate, add the following to the Cargo.toml:

ndarray = { version = "0.17", features = ["blas"]}
ndarray-glm = { version = "0.1", features = ["openblas-system"] }

An example for linear regression is shown below. The library is generic over floating point type (f32 or f64).

use ndarray_glm::{array, Linear, ModelBuilder};

// define some test data
let data_y = array![0.3, 1.3, 0.7];
let data_x = array![[0.1, 0.2], [-0.4, 0.1], [0.2, 0.4]];
let model = ModelBuilder::<Linear>::data(&data_y, &data_x).build()?;

let fit = model.fit()?;
// Or instead, to e.g. apply L2 (ridge) regularization:
let fit = model.fit_options().l2_reg(1e-5).fit()?;

// The result is a simple array of the MLE estimators, including the intercept
// term in the 0th index.
println!("Fit result: {}", fit.result);

By default, the X data is standardized (mean-subtracted and scaled by the std dev) for internal calculations, but the regression results are transformed back to the external scale for the user. This reduces the risk of scale-dependent numerical issues, and puts all features on the same footing with regards to any regularization. This can be disabled with no_standardize() in the ModelBuilder but is designed to be hands-off for the user, so it's recommended to keep it in most cases.

Some common non-canonical link functions are available (see e.g. exp_link and logistic_link), and additional custom ones can be defined by the user. The link module documentation covers how to select a provided link, how to implement a custom link (canonical or non-canonical), and how to use the TestLink trait to run built-in consistency tests against a custom implementation.

Features

Exponential family distributions
- Linear
- Logistic
- Poisson
- Binomial
- Exponential
- Gamma
- Inverse Gaussian
Linear offsets
Generic over floating point type
Regularization
- L2 (ridge)
- L1 (lasso) via ADMM
- Elastic Net (L1 + L2)
Automatic internal data standardization (can be disabled)
Weighted regressions (frequency and/or variance weights)
Non-canonical link functions
Fit statistics (see documentation on Fit struct)
- Residuals
  - Response, Pearson, deviance
  - Standardized and Studentized variants
  - Working/partial
  - Quantile (with stats feature)
- Coefficient covariance matrix
- Dispersion (for families with free dispersion)
- Leave-one-out (LOO) computations
  - One-step approximations
  - Exact re-fits
  - Residuals
  - Coefficients
- Information criteria
- Model significance
  - Wald test
  - Score test
  - Likelihood ratio test
- P-values for model and covariates (with the stats feature)
- ... and others

Troubleshooting

Lasso/L1 regularization can converge slowly in some cases, particularly when the data is poorly-behaved, seperable, etc.

The following tips are recommended things to try if facing convergence issues generally, but are more likely to be necessary in a L1 regularization problem.

Standardize the feature data
Use f32 instead of f64
Increase the tolerance and/or the maximum number of iterations
Include a small L2 regularization as well.

If you encounter problems that persist even after these techniques are applied, please file an issue so the algorithm can be improved.

Rendering equations in docs

To render the docs from source with the equations properly rendered, the KaTeX header must be included explicitly.

RUSTDOCFLAGS="--html-in-header katex-header.html" cargo doc --no-deps --open

References

notes on generalized linear models
Generalized Linear Models and Extensions by Hardin & Hilbe