Module statistics

Expand description

Functions and tools for evaluating polynomial fits and scoring models

This module provides functions and types to evaluate how well a polynomial model fits a dataset, and to score models when automatically selecting polynomial degrees.

§Model Fit / Regression Diagnostics

r_squared: Proportion of variance explained by the model. Higher is better (0 to 1).
adjusted_r_squared: R² adjusted for number of predictors. Use to compare models of different degrees.
residual_variance: Unbiased estimate of variance of errors after fitting. Used for confidence intervals.
residual_normality: Likelihood that the residuals are normally distributed. Results near 0 or 1 indicate non-normality, higher results do not guarantee normality.

§Confidence Intervals

ConfidenceBand: Represents a confidence interval with lower and upper bounds, determined by a given probability.
Confidence: Enum for common confidence levels (68%, 95%, 99%).

§Model Selection

DegreeBound: Enum to specify constraints on polynomial degree when automatically selecting it.

§Error Metrics

mean_absolute_error: Average absolute difference between observed and predicted values. Lower is better.
mean_squared_error: Average squared difference between observed and predicted values. Lower is better.
root_mean_squared_error: Square root of MSE, giving error in same units as observed values. Lower is better.
huber_log_likelihood: Robust error metric less sensitive to outliers. Higher is better.

§Descriptive Statistics

mean: Arithmetic mean of a dataset.
stddev_and_mean: Standard deviation and mean of a dataset.
median_absolute_deviation: Average absolute deviation from the mean.
spread: Difference between maximum and minimum values in a dataset.
skewness_and_kurtosis: Measures of asymmetry and “tailedness” of the distribution.

§Model Fit vs Model Selection

Model Fit: How well does the model explain the data? Use r_squared or residual_variance.
- Returns a value between 0 and 1.
- 0 = model explains none of the variance.
- 1 = model perfectly fits the data.
Model selection: Choosing the best polynomial degree to avoid overfitting.
- Use crate::score.
- Options:
  - AIC: Akaike Information Criterion, more lenient penalty for complexity.
  - BIC: Bayesian Information Criterion, stricter penalty for complexity.
- Lower scores are better; but not a measure of goodness-of-fit outside the context of model selection.

§Examples

use polyfit::statistics::r_squared;
use polyfit::score::{Aic, ModelScoreProvider};

let y = vec![1.0, 2.0, 3.0];
let y_fit = vec![1.1, 1.9, 3.05];

// Goodness-of-fit
let r2 = r_squared(y.iter().copied(), y_fit.iter().copied());
println!("R² = {r2}");

// Model scoring
let score = Aic.score(y.into_iter(), y_fit.into_iter(), 3.0);
println!("AIC score = {score}");

Structs§

ConfidenceBand: Represents a predicted range for model outputs at a given confidence level. The band contains the central estimate (value) and the upper and lower bounds.
DerivationError: Error information when a derivative check fails. See is_derivative
DomainNormalizer: Normalizes values from one range to another.
UncertainValue: A value with an associated amount of uncertainty, represented by a mean and a standard deviation.

Enums§

Confidence: Standard Z-score confidence levels for fitted models.
CvStrategy: Strategy for selecting the number of folds (k) in k-fold cross-validation.
DegreeBound: In order to find the best fitting polynomial degree, we need to limit the maximum degree considered. The choice of degree bound can significantly impact the model’s performance and its ability to generalize.
Tolerance: Specifies a tolerance level for numerical comparisons.

Functions§

adjusted_r_squared: Computes the adjusted R-squared value.
bayes_factor: Computes the Bayes factor between two polynomial models.
cross_validation_split: Splits the data into k folds for cross-validation based on the specified strategy.
folded_rmse: Computes the Root Mean Square Error (RMSE) for the given data and model predictions, by splitting the data into folds.
huber_const: Returns the standard Huber constant (1.345).
huber_log_likelihood: Computes the log-likelihood of the Huber loss for a set of data points.
huber_loss: Computes the Huber loss for a single residual.
is_derivative: Checks if f_prime is the derivative of polynomial f.
mean: Computes the arithmetic mean of a sequence of values.
mean_absolute_error: Computes the mean absolute error (MAE) between two sets of values.
mean_squared_error: Computes the mean squared error (MSE) between two sets of values.
median: Computes the median of a sequence of values.
median_absolute_deviation: Computes the median absolute deviation (MAD) between two sets of values.
median_squared_deviation: Computes the median squared deviation (MSD) between two sets of values.
r_squared: Calculate the R-squared value for a set of data.
residual_normality: Returns a score measuring if the residuals can be normally distributed.
residual_variance: Computes the residual variance of a model’s predictions.
robust_r_squared: Uses huber loss to compute a robust R-squared value.
root_mean_squared_error: Computes the root mean squared error (RMSE) between two sets of values.
skewness_and_kurtosis: Computes the skewness and excess kurtosis of a dataset.
spread: Computes the range (spread) of a dataset
stddev_and_mean: Computes the standard deviation of a sequence of values.

Module statistics

Module statistics Copy item path

§Model Fit / Regression Diagnostics

§Confidence Intervals

§Model Selection

§Error Metrics

§Descriptive Statistics

§Model Fit vs Model Selection

§Examples

Structs§

Enums§

Functions§

Module statistics