anofox-regression 0.3.2

A robust statistics library for regression analysis
Documentation

anofox-regression

CI Crates.io Documentation codecov MIT licensed

A robust statistics library for regression analysis in Rust.

This library provides sklearn-style regression estimators with full statistical inference support including standard errors, t-statistics, p-values, confidence intervals, and prediction intervals.

Features

  • Multiple Regression Methods

    • Ordinary Least Squares (OLS)
    • Weighted Least Squares (WLS)
    • Ridge Regression (L2 regularization)
    • Elastic Net (L1 + L2 regularization)
    • Recursive Least Squares (RLS) with online learning
    • Bounded Least Squares (BLS/NNLS) with box constraints
    • Dynamic Linear Model (LmDynamic) with time-varying coefficients
    • Tweedie GLM (Gaussian, Poisson, Gamma, Inverse-Gaussian, Compound Poisson-Gamma)
    • Poisson GLM (Log, Identity, Sqrt links) with offset support
    • Negative Binomial GLM (overdispersed count data with theta estimation)
    • Binomial GLM (Logistic, Probit, Complementary log-log)
    • Augmented Linear Model (ALM) with 24+ distributions (Normal, Laplace, Student-t, Gamma, Beta, etc.)
  • Smoothing & Classification

    • LOWESS (Locally Weighted Scatterplot Smoothing)
    • AID (Automatic Identification of Demand) classifier for demand pattern classification
  • Loss Functions

    • MAE, MSE, RMSE, MAPE, sMAPE, MASE, pinball loss
  • Statistical Inference

    • Coefficient standard errors, t-statistics, and p-values
    • Confidence intervals for coefficients
    • Prediction intervals (R-style predict(..., interval="prediction"))
    • Confidence intervals for mean response
  • Model Diagnostics

    • R², Adjusted R², RMSE
    • F-statistic and p-value
    • AIC, AICc, BIC, Log-likelihood
    • Residual analysis (standardized, studentized)
    • GLM residuals (Pearson, deviance, working)
    • Leverage and influence measures (Cook's distance, DFFITS)
    • Variance Inflation Factor (VIF) for multicollinearity detection
  • Robust Handling

    • Automatic detection of collinear/constant columns
    • Rank-deficient matrix handling
    • Edge cases (extreme weights, near-singular matrices)
    • R-compatible NA handling (na.omit, na.exclude, na.fail, na.pass)

Installation

Add to your Cargo.toml:

[dependencies]
anofox-regression = "0.3"

Quick Start

Basic OLS Regression

use anofox_regression::prelude::*;
use faer::{Mat, Col};

// Create sample data
let x = Mat::from_fn(100, 2, |i, j| (i + j) as f64 * 0.1);
let y = Col::from_fn(100, |i| 1.0 + 2.0 * i as f64 * 0.1);

// Fit OLS model
let model = OlsRegressor::builder()
    .with_intercept(true)
    .build();
let fitted = model.fit(&x, &y).unwrap();

// Access results
println!("R² = {:.4}", fitted.r_squared());
println!("Coefficients: {:?}", fitted.coefficients());
println!("Intercept: {:?}", fitted.intercept());

// Make predictions
let x_new = Mat::from_fn(5, 2, |i, j| (i + j + 100) as f64 * 0.1);
let predictions = fitted.predict(&x_new);

Prediction Intervals

use anofox_regression::prelude::*;

let fitted = OlsRegressor::builder()
    .with_intercept(true)
    .build()
    .fit(&x, &y)
    .unwrap();

// 95% prediction intervals for new observations
let result = fitted.predict_with_interval(
    &x_new,
    Some(IntervalType::Prediction),
    0.95,
);
println!("Fit: {:?}", result.fit);
println!("Lower: {:?}", result.lower);
println!("Upper: {:?}", result.upper);

// 95% confidence intervals for mean response
let result = fitted.predict_with_interval(
    &x_new,
    Some(IntervalType::Confidence),
    0.95,
);

Weighted Least Squares

use anofox_regression::prelude::*;

let weights = Col::from_fn(100, |i| 1.0 / (i + 1) as f64);

let model = WlsRegressor::builder()
    .with_intercept(true)
    .weights(weights)
    .build();
let fitted = model.fit(&x, &y).unwrap();

Ridge Regression

use anofox_regression::prelude::*;

let model = RidgeRegressor::builder()
    .with_intercept(true)
    .lambda(0.1)
    .compute_inference(true)
    .build();
let fitted = model.fit(&x, &y).unwrap();

// Access inference statistics
let result = fitted.result();
if let Some(se) = &result.std_errors {
    println!("Standard errors: {:?}", se);
}

Elastic Net

use anofox_regression::prelude::*;

let model = ElasticNetRegressor::builder()
    .with_intercept(true)
    .lambda(1.0)
    .alpha(0.5)  // 0 = Ridge, 1 = Lasso
    .max_iterations(1000)
    .tolerance(1e-6)
    .build();
let fitted = model.fit(&x, &y).unwrap();

println!("Non-zero coefficients: {}", fitted.n_nonzero());

Recursive Least Squares (Online Learning)

use anofox_regression::prelude::*;

let model = RlsRegressor::builder()
    .with_intercept(true)
    .forgetting_factor(0.99)  // Recent data weighted more
    .build();
let mut fitted = model.fit(&x, &y).unwrap();

// Online update with new observation
let x_new = Col::from_fn(2, |j| j as f64);
let y_new = 5.0;
let prediction = fitted.update(&x_new, y_new);

Bounded Least Squares (NNLS)

use anofox_regression::prelude::*;

// Non-negative least squares (all coefficients >= 0)
let model = BlsRegressor::nnls().build();
let fitted = model.fit(&x, &y).unwrap();

// Custom box constraints: lower <= coefficients <= upper
let model = BlsRegressor::builder()
    .lower_bounds(vec![0.0, -1.0, 0.0])
    .upper_bounds(vec![1.0, 1.0, f64::INFINITY])
    .build();
let fitted = model.fit(&x, &y).unwrap();

Tweedie GLM

use anofox_regression::prelude::*;

// Gamma regression with log link (insurance claims, positive continuous data)
let model = TweedieRegressor::gamma()
    .with_intercept(true)
    .build();
let fitted = model.fit(&x, &y).unwrap();

println!("Deviance: {}", fitted.deviance);
println!("Null deviance: {}", fitted.null_deviance);

// Poisson regression (count data)
let model = TweedieRegressor::poisson()
    .with_intercept(true)
    .build();

// Compound Poisson-Gamma (zero-inflated continuous data)
let model = TweedieRegressor::builder()
    .var_power(1.5)  // Between 1 (Poisson) and 2 (Gamma)
    .link_power(0.0) // Log link
    .with_intercept(true)
    .build();

Poisson GLM (Count Data)

use anofox_regression::prelude::*;

// Poisson regression with log link (count data)
let model = PoissonRegressor::log()
    .with_intercept(true)
    .compute_inference(true)
    .build();
let fitted = model.fit(&x, &y).unwrap();

println!("Deviance: {}", fitted.deviance);
println!("Dispersion: {}", fitted.dispersion);

// Predict counts
let counts = fitted.predict_count(&x_new);

// Poisson with identity link
let model = PoissonRegressor::identity()
    .with_intercept(true)
    .build();

// Poisson with sqrt link
let model = PoissonRegressor::sqrt()
    .with_intercept(true)
    .build();

// Rate modeling with offset (for exposure)
// y_i ~ Poisson(exposure_i * rate), log(E[y]) = log(exposure) + Xβ
let exposures = Col::from_fn(100, |i| (i + 1) as f64);
let offset = Col::from_fn(100, |i| exposures[i].ln());
let model = PoissonRegressor::log()
    .with_intercept(true)
    .offset(offset)
    .build();
let fitted = model.fit(&x, &y).unwrap();

// Predict with new offset (exposure = 2 for all new observations)
let x_new = Mat::from_fn(5, 2, |i, j| (i + j) as f64);
let new_offset = Col::from_fn(5, |_| 2.0_f64.ln());
let rates = fitted.predict_with_offset(&x_new, &new_offset);

Negative Binomial GLM (Overdispersed Count Data)

use anofox_regression::prelude::*;

// Negative binomial with automatic theta estimation (like MASS::glm.nb)
let model = NegativeBinomialRegressor::builder()
    .with_intercept(true)
    .estimate_theta(true)  // Estimate dispersion parameter
    .build();
let fitted = model.fit(&x, &y).unwrap();

println!("Estimated theta: {}", fitted.theta);
println!("Overdispersion ratio: {}", fitted.overdispersion_ratio());
println!("Deviance: {}", fitted.deviance);

// Negative binomial with fixed theta
let model = NegativeBinomialRegressor::with_theta(2.0)
    .with_intercept(true)
    .build();
let fitted = model.fit(&x, &y).unwrap();

// Rate modeling with offset
let exposures = Col::from_fn(100, |i| (i + 1) as f64);
let offset = Col::from_fn(100, |i| exposures[i].ln());
let model = NegativeBinomialRegressor::builder()
    .with_intercept(true)
    .offset(offset)
    .build();
let fitted = model.fit(&x, &y).unwrap();

// GLM residuals
let pearson = fitted.pearson_residuals();
let deviance = fitted.deviance_residuals();

Binomial GLM (Logistic Regression)

use anofox_regression::prelude::*;

// Logistic regression (binary classification)
let model = BinomialRegressor::logistic()
    .with_intercept(true)
    .compute_inference(true)
    .build();
let fitted = model.fit(&x, &y).unwrap();

// Predict probabilities
let probs = fitted.predict_probability(&x_new);

// Predict with standard errors and confidence intervals
let pred = fitted.predict_with_se(
    &x_new,
    PredictionType::Response,  // Probability scale
    Some(IntervalType::Confidence),
    0.95,
);
println!("Predicted probability: {:?}", pred.fit);
println!("Standard error: {:?}", pred.se);
println!("95% CI: [{:?}, {:?}]", pred.lower, pred.upper);

// Probit regression
let model = BinomialRegressor::probit()
    .with_intercept(true)
    .build();

// Complementary log-log regression
let model = BinomialRegressor::cloglog()
    .with_intercept(true)
    .build();

// GLM residuals
let pearson = fitted.pearson_residuals();
let deviance = fitted.deviance_residuals();
let working = fitted.working_residuals();

Augmented Linear Model (ALM)

The ALM supports 24+ distributions for maximum likelihood regression, based on the greybox R package.

use anofox_regression::prelude::*;

// Laplace (LAD) regression - robust to outliers
let model = AlmRegressor::builder()
    .distribution(AlmDistribution::Laplace)
    .with_intercept(true)
    .build();
let fitted = model.fit(&x, &y).unwrap();

println!("Log-likelihood: {}", fitted.log_likelihood);
println!("AIC: {}", fitted.result().aic);

// Student-t regression - heavy-tailed errors
let model = AlmRegressor::builder()
    .distribution(AlmDistribution::StudentT)
    .with_intercept(true)
    .build();
let fitted = model.fit(&x, &y).unwrap();

// Log-Normal for positive skewed data
let model = AlmRegressor::builder()
    .distribution(AlmDistribution::LogNormal)
    .with_intercept(true)
    .build();

// Beta regression for proportions in (0, 1)
let model = AlmRegressor::builder()
    .distribution(AlmDistribution::Beta)
    .with_intercept(true)
    .build();

// Gamma for positive continuous data
let model = AlmRegressor::builder()
    .distribution(AlmDistribution::Gamma)
    .with_intercept(true)
    .build();

// Custom link function
let model = AlmRegressor::builder()
    .distribution(AlmDistribution::Normal)
    .link(LinkFunction::Log)
    .with_intercept(true)
    .build();

NA Handling

use anofox_regression::{NaAction, NaHandler};

// Process data with NA values (represented as f64::NAN)
let result = NaHandler::process(&x, &y, NaAction::Omit).unwrap();
println!("Rows removed: {}", result.na_info.n_removed);

// na.exclude: Remove NA but expand output back to original length
let result = NaHandler::process(&x, &y, NaAction::Exclude).unwrap();
let residuals_expanded = result.na_info.expand(&fitted_residuals);

// na.fail: Error if any NA present
let result = NaHandler::process(&x, &y, NaAction::Fail);
assert!(result.is_err());

// na.pass: Keep NA values (solver must handle them)
let result = NaHandler::process(&x, &y, NaAction::Pass).unwrap();

Model Diagnostics

use anofox_regression::prelude::*;

let fitted = OlsRegressor::builder()
    .with_intercept(true)
    .build()
    .fit(&x, &y)
    .unwrap();

let result = fitted.result();

// Goodness of fit
println!("R² = {:.4}", result.r_squared);
println!("Adjusted R² = {:.4}", result.adj_r_squared);
println!("RMSE = {:.4}", result.rmse);

// F-test
println!("F-statistic = {:.4}", result.f_statistic);
println!("F p-value = {:.4}", result.f_pvalue);

// Information criteria
println!("AIC = {:.4}", result.aic);
println!("BIC = {:.4}", result.bic);

// Residual diagnostics
let std_resid = standardized_residuals(&result.residuals, result.mse);
let leverage = compute_leverage(&x);
let cooks_d = cooks_distance(&result.residuals, &leverage, result.mse, result.n_parameters);

// Detect influential points
let influential = influential_cooks(&cooks_d, result.n_observations);

// Variance Inflation Factor for multicollinearity
let vif = variance_inflation_factor(&x);

API Reference

Regression Result Fields

Field Description
coefficients Estimated regression coefficients
intercept Intercept term (if fitted)
std_errors Standard errors of coefficients
t_statistics t-statistics for coefficients
p_values Two-tailed p-values
conf_interval_lower/upper Confidence intervals for coefficients
r_squared Coefficient of determination
adj_r_squared Adjusted R²
mse Mean squared error
rmse Root mean squared error
f_statistic F-statistic for overall model
f_pvalue p-value for F-test
aic Akaike Information Criterion
aicc Corrected AIC
bic Bayesian Information Criterion
log_likelihood Log-likelihood
residuals Model residuals
fitted_values Predicted values on training data

Tweedie GLM Result Fields

Field Description
deviance Total deviance of fitted model
null_deviance Deviance of intercept-only model
dispersion Estimated dispersion parameter
iterations Number of IRLS iterations

Tweedie Family (var_power)

var_power Distribution Use Case
0 Gaussian Standard linear regression
1 Poisson Count data
1-2 Compound Poisson-Gamma Zero-inflated continuous (insurance, rainfall)
2 Gamma Positive continuous
3 Inverse-Gaussian Positive, right-skewed

Poisson GLM Result Fields

Field Description
deviance Total deviance of fitted model
null_deviance Deviance of intercept-only model
dispersion Estimated dispersion parameter
iterations Number of IRLS iterations

Poisson Link Functions

Link Function Inverse Use Case
Log ln(μ) exp(η) Canonical, most common
Identity μ η Linear relationships
Sqrt √μ η² Alternative for count data

Negative Binomial GLM Result Fields

Field Description
deviance Total deviance of fitted model
null_deviance Deviance of intercept-only model
theta Estimated or fixed dispersion parameter
dispersion Estimated dispersion parameter
iterations Number of iterations

Negative Binomial Parameters

Parameter Description
theta Dispersion parameter (size). Larger = less overdispersion.
estimate_theta If true, estimate theta via alternating ML. Default: true.

When to use Negative Binomial vs Poisson:

  • Use Poisson when Var(Y) ≈ E[Y]
  • Use Negative Binomial when Var(Y) > E[Y] (overdispersion)
  • NB variance: V(μ) = μ + μ²/θ (approaches Poisson as θ → ∞)

Binomial GLM Result Fields

Field Description
deviance Total deviance of fitted model
null_deviance Deviance of intercept-only model
iterations Number of IRLS iterations

Binomial Link Functions

Link Function Inverse Use Case
Logit log(p/(1-p)) 1/(1+exp(-η)) Standard logistic regression
Probit Φ⁻¹(p) Φ(η) Dose-response, bioassay
Cloglog log(-log(1-p)) 1-exp(-exp(η)) Asymmetric, extreme events

GLM Residual Types

Type Formula Use Case
Pearson (y - μ) / √V(μ) Outlier detection, overdispersion
Deviance sign(y - μ) × √d_i Model fit assessment
Working (y - μ) × (dη/dμ) IRLS algorithm diagnostics

ALM (Augmented Linear Model) Result Fields

Field Description
log_likelihood Maximized log-likelihood
scale Estimated scale parameter
iterations Number of IRLS iterations

ALM Distributions

Category Distributions
Continuous Normal, Laplace, Student-t, Logistic, Asymmetric Laplace, Generalised Normal, S
Log-transformed Log-Normal, Log-Laplace, Log-S, Log-Generalised Normal
Positive Gamma, Inverse Gaussian, Exponential, Folded Normal
Box-Cox Box-Cox Normal
Proportions Beta, Logit-Normal
Count data Poisson, Negative Binomial, Binomial, Geometric
Ordinal Cumulative Logistic, Cumulative Normal

ALM Link Functions

Link Function Inverse Typical Use
Identity μ = η η Normal, Laplace, Student-t
Log μ = exp(η) ln(μ) Gamma, Poisson, Log-Normal
Logit μ = 1/(1+e⁻ᶯ) ln(μ/(1-μ)) Beta, Binomial
Probit μ = Φ(η) Φ⁻¹(μ) Ordinal models
Inverse μ = 1/η 1/μ Inverse Gaussian
Sqrt μ = η² √μ Count data (alternative)
Cloglog μ = 1-e⁻ᵉˣᵖ⁽ᶯ⁾ ln(-ln(1-μ)) Asymmetric binary

Prediction Types (GLM)

Type Description
PredictionType::Response Predictions on response scale (probabilities for binomial)
PredictionType::Link Predictions on link scale (log-odds for logit)

Interval Types

  • IntervalType::Prediction - Prediction interval for new observations (wider)
  • IntervalType::Confidence - Confidence interval for mean response (narrower)

Lambda Scaling

For Ridge and Elastic Net, use LambdaScaling::Glmnet to match R's glmnet package:

let model = RidgeRegressor::builder()
    .lambda(0.1)
    .lambda_scaling(LambdaScaling::Glmnet)  // lambda * n
    .build();

Examples

The examples/ directory contains runnable examples demonstrating each regression method:

Example Description
ols.rs Ordinary Least Squares regression
wls.rs Weighted Least Squares regression
ridge.rs Ridge regression (L2 regularization)
elastic_net.rs Elastic Net (L1 + L2 regularization)
rls.rs Recursive Least Squares with online learning
bls.rs Bounded Least Squares / NNLS
tweedie.rs Tweedie GLM (Gaussian, Poisson, Gamma)
poisson.rs Poisson GLM for count data
negative_binomial.rs Negative Binomial GLM for overdispersed counts
binomial.rs Binomial GLM / Logistic regression
alm.rs Augmented Linear Model (24+ distributions)
lm_dynamic.rs Dynamic linear model with time-varying coefficients
lowess.rs LOWESS smoothing
aid.rs Automatic Intermittent Demand classification
collinearity_intervals.rs Handling collinearity in prediction intervals

Run an example with:

cargo run --example ols
cargo run --example poisson

Validation

This library is validated against R's statistical functions:

  • lm() for OLS
  • lm() with weights for WLS
  • glmnet::glmnet() for Ridge and Elastic Net
  • nnls::nnls() for Non-negative Least Squares
  • statmod::tweedie() for Tweedie GLM
  • glm(..., family=poisson) for Poisson GLM (log, identity, sqrt links)
  • MASS::glm.nb() for Negative Binomial GLM with theta estimation
  • glm(..., family=binomial) for Binomial GLM (logit, probit, cloglog)
  • glm(..., family=Gamma) for Gamma GLM
  • glm(..., family=inverse.gaussian) for Inverse Gaussian GLM
  • glm(..., offset=...) for rate modeling with offset terms
  • greybox::alm() for Augmented Linear Model (Normal, Laplace, Student-t, Log-Normal, Gamma, etc.)
  • residuals(..., type="pearson/deviance/working") for GLM residuals
  • predict(..., se.fit=TRUE) for GLM predictions with standard errors
  • na.omit(), na.exclude(), na.fail(), na.pass() for NA handling
  • predict(..., interval="prediction") for prediction intervals
  • cooks.distance(), hatvalues(), rstandard() for diagnostics
  • car::vif() for variance inflation factors

All tests ensure numerical agreement with R within appropriate tolerances (708 tests total).

Dependencies

  • faer - High-performance linear algebra
  • statrs - Statistical distributions

License

MIT License