Skip to main content

Module statistics

Module statistics 

Source
Expand description

Functions and tools for evaluating polynomial fits and scoring models

This module provides functions and types to evaluate how well a polynomial model fits a dataset, and to score models when automatically selecting polynomial degrees.

§Model Fit / Regression Diagnostics

  • r_squared: Proportion of variance explained by the model. Higher is better (0 to 1).
  • adjusted_r_squared: R² adjusted for number of predictors. Use to compare models of different degrees.
  • residual_variance: Unbiased estimate of variance of errors after fitting. Used for confidence intervals.
  • residual_normality: Likelihood that the residuals are normally distributed. Results near 0 or 1 indicate non-normality, higher results do not guarantee normality.

§Confidence Intervals

  • ConfidenceBand: Represents a confidence interval with lower and upper bounds, determined by a given probability.
  • Confidence: Enum for common confidence levels (68%, 95%, 99%).

§Model Selection

  • DegreeBound: Enum to specify constraints on polynomial degree when automatically selecting it.

§Error Metrics

  • mean_absolute_error: Average absolute difference between observed and predicted values. Lower is better.
  • mean_squared_error: Average squared difference between observed and predicted values. Lower is better.
  • root_mean_squared_error: Square root of MSE, giving error in same units as observed values. Lower is better.
  • huber_log_likelihood: Robust error metric less sensitive to outliers. Higher is better.

§Descriptive Statistics

§Model Fit vs Model Selection

  • Model Fit: How well does the model explain the data? Use r_squared or residual_variance.

    • Returns a value between 0 and 1.
    • 0 = model explains none of the variance.
    • 1 = model perfectly fits the data.
  • Model selection: Choosing the best polynomial degree to avoid overfitting.

    • Use crate::score.
    • Options:
      • AIC: Akaike Information Criterion, more lenient penalty for complexity.
      • BIC: Bayesian Information Criterion, stricter penalty for complexity.
    • Lower scores are better; but not a measure of goodness-of-fit outside the context of model selection.

§Examples

use polyfit::statistics::r_squared;
use polyfit::score::{Aic, ModelScoreProvider};

let y = vec![1.0, 2.0, 3.0];
let y_fit = vec![1.1, 1.9, 3.05];

// Goodness-of-fit
let r2 = r_squared(y.iter().copied(), y_fit.iter().copied());
println!("R² = {r2}");

// Model scoring
let score = Aic.score(y.into_iter(), y_fit.into_iter(), 3.0);
println!("AIC score = {score}");

Structs§

ConfidenceBand
Represents a predicted range for model outputs at a given confidence level. The band contains the central estimate (value) and the upper and lower bounds.
DerivationError
Error information when a derivative check fails. See is_derivative
DomainNormalizer
Normalizes values from one range to another.
UncertainValue
A value with an associated amount of uncertainty, represented by a mean and a standard deviation.

Enums§

Confidence
Standard Z-score confidence levels for fitted models.
CvStrategy
Strategy for selecting the number of folds (k) in k-fold cross-validation.
DegreeBound
In order to find the best fitting polynomial degree, we need to limit the maximum degree considered. The choice of degree bound can significantly impact the model’s performance and its ability to generalize.
Tolerance
Specifies a tolerance level for numerical comparisons.

Functions§

adjusted_r_squared
Computes the adjusted R-squared value.
bayes_factor
Computes the Bayes factor between two polynomial models.
cross_validation_split
Splits the data into k folds for cross-validation based on the specified strategy.
folded_rmse
Computes the Root Mean Square Error (RMSE) for the given data and model predictions, by splitting the data into folds.
huber_const
Returns the standard Huber constant (1.345).
huber_log_likelihood
Computes the log-likelihood of the Huber loss for a set of data points.
huber_loss
Computes the Huber loss for a single residual.
is_derivative
Checks if f_prime is the derivative of polynomial f.
mean
Computes the arithmetic mean of a sequence of values.
mean_absolute_error
Computes the mean absolute error (MAE) between two sets of values.
mean_squared_error
Computes the mean squared error (MSE) between two sets of values.
median
Computes the median of a sequence of values.
median_absolute_deviation
Computes the median absolute deviation (MAD) between two sets of values.
median_squared_deviation
Computes the median squared deviation (MSD) between two sets of values.
r_squared
Calculate the R-squared value for a set of data.
residual_normality
Returns a score measuring if the residuals can be normally distributed.
residual_variance
Computes the residual variance of a model’s predictions.
robust_r_squared
Uses huber loss to compute a robust R-squared value.
root_mean_squared_error
Computes the root mean squared error (RMSE) between two sets of values.
skewness_and_kurtosis
Computes the skewness and excess kurtosis of a dataset.
spread
Computes the range (spread) of a dataset
stddev_and_mean
Computes the standard deviation of a sequence of values.