statsrust 0.1.0

# Mathematical Foundations and Implementation Details of the statsrust Library


![Apache 2.0 | MIT License](https://img.shields.io/badge/License-Apache%202.0%20%7C%20MIT-blue.svg)
![CC BY 4.0 License](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)

## 1. Introduction


This document provides an in-depth exploration of the mathematical foundations and implementation details of the statsrust statistical library. Unlike the high-level overview provided in the README, this document delves into the rigorous mathematical theory, numerical considerations, and algorithmic implementations that form the core of the library's functionality.

The statsrust library is built upon three foundational pillars:
- **Descriptive Statistics**: Precise implementation of measures of central tendency, position, and variability
- **Non-parametric Methods**: Sophisticated Kernel Density Estimation (KDE) with multiple kernel functions
- **Parametric Distributions**: Complete algebraic framework for normal distribution operations

This document details the mathematical derivations, numerical stability considerations, and implementation choices that ensure the library's accuracy and reliability in statistical computations.

## 2. Kernel Function Theory and Implementation


### 2.1 Mathematical Foundation of Kernel Functions


Kernel functions are central to non-parametric density estimation. Formally, a kernel function $K(t)$ must satisfy two critical properties:

1. **Normalization**: $\int_{-\infty}^{\infty} K(t)dt = 1$
2. **Non-negativity**: $K(t) \geq 0$ for all $t \in \mathbb{R}$

These properties guarantee that the kernel represents a valid probability density function (PDF). The normalization ensures the total probability integrates to 1, while non-negativity ensures all density values are meaningful.

#### Proof of Normalization for Common Kernels


Let's verify the normalization property for the Epanechnikov kernel:

$$K(t) = \begin{cases} 
\frac{3}{4}(1 - t^2) & \text{if } |t| \leq 1 \\
0 & \text{otherwise}
\end{cases}$$

Computing the integral:
$$\int_{-\infty}^{\infty} K(t)dt = \int_{-1}^{1} \frac{3}{4}(1 - t^2)dt = \frac{3}{4}\left[t - \frac{t^3}{3}\right]_{-1}^{1} = \frac{3}{4}\left[\left(1 - \frac{1}{3}\right) - \left(-1 + \frac{1}{3}\right)\right] = \frac{3}{4} \cdot \frac{4}{3} = 1$$

This rigorous verification ensures mathematical correctness in implementation.

### 2.2 Kernel Function Characteristics and Implementation


Each kernel implementation in statsrust provides four essential mathematical components:

#### 2.2.1 Kernel Function (PDF)

The probability density function is directly used in density estimation calculations. For computational efficiency, we implement these as closure functions.

*Example: Gaussian Kernel Implementation*
```rust
Box::new(|t| (-(t * t) / 2.0).exp() / (2.0 * std::f64::consts::PI).sqrt())
```

#### 2.2.2 Cumulative Distribution Function (CDF)

The CDF is defined as $F(t) = \int_{-\infty}^{t} K(u)du$ and is crucial for random sample generation. For kernels without closed-form CDFs, we implement numerical integration.

*Example: Epanechnikov CDF Derivation*
$$F(t) = \int_{-1}^{t} \frac{3}{4}(1 - u^2)du = \frac{3}{4}\left[u - \frac{u^3}{3}\right]_{-1}^{t} = \frac{3}{4}\left(t - \frac{t^3}{3} + \frac{2}{3}\right)$$

This is implemented as:
```rust
Box::new(|t| {
    if t <= -1.0 { 0.0 }
    else if t >= 1.0 { 1.0 }
    else { 0.75 * (t - t.powi(3)/3.0 + 2.0/3.0) }
})
```

#### 2.2.3 Inverse CDF (Quantile Function)

For random sample generation, we need $F^{-1}(p) = \inf\{t \in \mathbb{R} \mid F(t) \geq p\}$ for $p \in (0,1)$. For complex kernels like Quartic, this requires solving polynomial equations.

*Example: Epanechnikov Inverse CDF Derivation*
Given $p = \frac{3}{4}\left(t - \frac{t^3}{3} + \frac{2}{3}\right)$, we solve for $t$:
$$t^3 - 3t + 4\left(p - \frac{1}{2}\right) = 0$$

This cubic equation can be solved using trigonometric methods for $|p - 0.5| \leq \frac{1}{3}$:
$$t = 2\sin\left(\frac{1}{3}\sin^{-1}\left(3\left(p - \frac{1}{2}\right)\right)\right)$$

#### 2.2.4 Support and Computational Efficiency

The support of a kernel, defined as $\{t \in \mathbb{R} \mid K(t) > 0\}$, is critical for computational efficiency. For bounded kernels, we implement binary search to limit calculations:

```rust
let i = sorted.partition_point(|&v| v < x - bandwidth);
let j = sorted.partition_point(|&v| v <= x + bandwidth);
```

This reduces complexity from $O(n)$ to $O(\log n + k)$ where $k$ is the number of points in the kernel's support.

### 2.3 Mathematical Properties and Error Analysis


For each kernel, we perform rigorous error analysis to ensure numerical stability:

- **Gaussian Kernel**: While it has infinite support, we implement a practical cutoff at $|t| > 6$ where $K(t) < 10^{-9}$
- **Bounded Kernels**: For kernels with finite support, we verify that the implementation correctly handles boundary conditions
- **Numerical Integration**: For CDF implementations without closed forms, we use adaptive quadrature with error bounds

## 3. Numerical Stability in Statistical Calculations


### 3.1 Geometric Mean Implementation


The geometric mean formula $G = \left(\prod_{i=1}^{n} x_i\right)^{\frac{1}{n}}$ is susceptible to overflow/underflow with large datasets. We implement the mathematically equivalent but numerically stable form:

$$G = \exp\left(\frac{1}{n}\sum_{i=1}^{n} \ln(x_i)\right)$$

**Mathematical Derivation**:
$$\ln(G) = \ln\left(\left(\prod_{i=1}^{n} x_i\right)^{\frac{1}{n}}\right) = \frac{1}{n}\sum_{i=1}^{n} \ln(x_i)$$

This transformation is valid for $x_i > 0$ (a requirement of geometric mean). The implementation includes domain validation:

```rust
if data.iter().any(|x| *x <= 0.0) {
    return Err(StatError::NegativeValueNotAllowed);
}
let log_sum: f64 = data.iter().map(|x| x.ln()).sum();
let geometric_mean = (log_sum / data.len() as f64).exp();
```

### 3.2 Variance Calculation: Avoiding Catastrophic Cancellation

The direct computation of variance using $\frac{1}{n-1}\left(\sum x_i^2 - n\bar{x}^2\right)$ can suffer from catastrophic cancellation when values are close together or large in magnitude.

**Mathematical Analysis**:
Consider two expressions for the sum of squared deviations:
1. $\sum_{i=1}^{n}(x_i - \bar{x})^2$
2. $\sum_{i=1}^{n}x_i^2 - n\bar{x}^2$

While mathematically equivalent, the second form is numerically unstable when $\sum x_i^2 \approx n\bar{x}^2$. This is particularly problematic for values with large magnitudes but small variance.

**Implementation Choice**:
We use the two-pass algorithm:
```rust
let mean = data.iter().sum::<f64>() / data.len() as f64;
let diff: Vec<f64> = data.iter().map(|x| x - mean).collect();
let variance = diff.iter().map(|d| d * d).sum::<f64>() / (data.len() - 1) as f64;
```

**Numerical Stability Proof**:
The relative error in the two-pass algorithm is bounded by $O(\epsilon + \epsilon\sqrt{n})$ where $\epsilon$ is machine epsilon, compared to $O(\epsilon n)$ for the one-pass algorithm in worst-case scenarios.

### 3.3 Pearson Correlation: Numerical Considerations


The Pearson correlation formula:
$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}$$

Can be numerically unstable when:
1. The denominator is close to zero (near-constant inputs)
2. There is catastrophic cancellation in the numerator

**Implementation Strategy**:
1. Check for constant inputs before computation
2. Use centered data for all calculations
3. Implement a threshold for near-zero denominators to avoid division by values close to machine epsilon

**Mathematical Justification**:
The correlation coefficient can be expressed as:
$$r = \frac{s_{xy}}{\sqrt{s_{xx} \cdot s_{yy}}}$$
where $s_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y})$, etc.

Our implementation computes these terms using:
```rust
let x_mean = mean(x)?;
let y_mean = mean(y)?;
let s_xy: f64 = x.iter().zip(y.iter())
    .map(|(&xi, &yi)| (xi - x_mean) * (yi - y_mean))
    .sum();
// Similar for s_xx and s_yy
```

This approach minimizes rounding errors while maintaining computational efficiency.

## 4. Kernel Density Estimation: Advanced Implementation Details


### 4.1 Mathematical Derivation of KDE


The kernel density estimator is defined as:
$$\hat{f}_h(x) = \frac{1}{nh} \sum_{i=1}^{n} K\left(\frac{x - x_i}{h}\right)$$

Where:
- $n$ is the sample size
- $h > 0$ is the bandwidth (smoothing parameter)
- $K$ is the kernel function

**Asymptotic Properties**:
- As $h \to 0$, $\hat{f}_h(x)$ converges to a sum of delta functions at the data points
- As $h \to \infty$, $\hat{f}_h(x)$ converges to a single kernel centered at the sample mean

**Optimal Bandwidth Selection**:
For the Gaussian kernel, the asymptotically optimal bandwidth is:
$$h_{\text{opt}} = \left(\frac{4\hat{\sigma}^5}{3n}\right)^{\frac{1}{5}}$$
where $\hat{\sigma}$ is the sample standard deviation.

### 4.2 CDF Estimation and Implementation


The cumulative distribution function is estimated as:
$$\hat{F}_h(x) = \frac{1}{n} \sum_{i=1}^{n} \Phi\left(\frac{x - x_i}{h}\right)$$
where $\Phi$ is the CDF of the kernel function.

**Mathematical Justification**:
This follows from the relationship between PDF and CDF:
$$\hat{F}_h(x) = \int_{-\infty}^{x} \hat{f}_h(t)dt = \frac{1}{nh} \int_{-\infty}^{x} \sum_{i=1}^{n} K\left(\frac{t - x_i}{h}\right)dt = \frac{1}{n} \sum_{i=1}^{n} \int_{-\infty}^{\frac{x - x_i}{h}} K(u)du = \frac{1}{n} \sum_{i=1}^{n} \Phi\left(\frac{x - x_i}{h}\right)$$

### 4.3 Random Sample Generation: Theoretical Foundation


The random sample generation algorithm is based on inverse transform sampling:

1. Select a random data point $X_i$ uniformly from the dataset
2. Generate $U \sim \text{Uniform}(0,1)$
3. Compute $X = X_i + h \cdot K^{-1}(U)$

**Theoretical Proof**:
Let $F_X(x)$ be the CDF of the generated samples. Then:
$$F_X(x) = P(X \leq x) = \frac{1}{n} \sum_{i=1}^{n} P(X_i + h \cdot K^{-1}(U) \leq x) = \frac{1}{n} \sum_{i=1}^{n} P(K^{-1}(U) \leq \frac{x - X_i}{h})$$
$$= \frac{1}{n} \sum_{i=1}^{n} P(U \leq K(\frac{x - X_i}{h})) = \frac{1}{n} \sum_{i=1}^{n} K(\frac{x - X_i}{h}) = \hat{F}_h(x)$$

This confirms that the generated samples follow the estimated density.

### 4.4 Computational Optimization for Bounded Kernels


For kernels with bounded support $[-1,1]$, the KDE calculation at point $x$ only requires data points in $[x-h, x+h]$. We implement this using binary search:

```rust
let i = sorted.partition_point(|&v| v < x - bandwidth);
let j = sorted.partition_point(|&v| v <= x + bandwidth);
```

**Complexity Analysis**:
- Naive implementation: $O(n)$ per evaluation
- Optimized implementation: $O(\log n + k)$ per evaluation, where $k$ is the number of points in $[x-h, x+h]$
- For small bandwidths, $k \ll n$, leading to substantial performance improvements

## 5. Normal Distribution Model: Mathematical Rigor


### 5.1 Distribution Operations: Theoretical Foundation


The statsrust library implements algebraic operations on normal distributions with mathematical rigor.

#### 5.1.1 Scalar Operations


For a normal distribution $\mathcal{N}(\mu, \sigma)$ and scalar $c$:

- **Addition**: $\mathcal{N}(\mu, \sigma) + c = \mathcal{N}(\mu + c, \sigma)$
- **Subtraction**: $\mathcal{N}(\mu, \sigma) - c = \mathcal{N}(\mu - c, \sigma)$
- **Multiplication**: $\mathcal{N}(\mu, \sigma) \times c = \mathcal{N}(\mu \times c, \sigma \times |c|)$
- **Division**: $\mathcal{N}(\mu, \sigma) \div c = \mathcal{N}(\mu \div c, \sigma \div |c|)$

**Mathematical Proof for Multiplication**:
If $X \sim \mathcal{N}(\mu, \sigma)$, then $Y = cX$ has:
- $E[Y] = cE[X] = c\mu$
- $\text{Var}(Y) = c^2\text{Var}(X) = c^2\sigma^2$
- Thus, $Y \sim \mathcal{N}(c\mu, |c|\sigma)$

The absolute value in the standard deviation ensures it remains non-negative.

#### 5.1.2 Distribution Operations (Independent Distributions)


For independent normal distributions $\mathcal{N}(\mu_1, \sigma_1)$ and $\mathcal{N}(\mu_2, \sigma_2)$:

- **Addition**: $\mathcal{N}(\mu_1, \sigma_1) + \mathcal{N}(\mu_2, \sigma_2) = \mathcal{N}(\mu_1 + \mu_2, \sqrt{\sigma_1^2 + \sigma_2^2})$
- **Subtraction**: $\mathcal{N}(\mu_1, \sigma_1) - \mathcal{N}(\mu_2, \sigma_2) = \mathcal{N}(\mu_1 - \mu_2, \sqrt{\sigma_1^2 + \sigma_2^2})$

**Mathematical Proof**:
The sum of independent normal random variables is normal with:
- Mean: $E[X+Y] = E[X] + E[Y] = \mu_1 + \mu_2$
- Variance: $\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) = \sigma_1^2 + \sigma_2^2$ (by independence)

Thus, $X+Y \sim \mathcal{N}(\mu_1 + \mu_2, \sqrt{\sigma_1^2 + \sigma_2^2})$.

### 5.2 Distribution Overlap: Precise Calculation


The overlap between two normal distributions is defined as the area under the overlapping region of their PDFs.

**Mathematical Derivation**:
Given two normal distributions with PDFs $f_1(x)$ and $f_2(x)$, the overlap is:
$$\text{overlap} = \int_{-\infty}^{\infty} \min(f_1(x), f_2(x))dx$$

To compute this:
1. Find intersection points by solving $f_1(x) = f_2(x)$
2. This leads to a quadratic equation: $ax^2 + bx + c = 0$
3. The solutions (if they exist) are $x_1, x_2 = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$

For normal distributions $\mathcal{N}(\mu_1, \sigma_1)$ and $\mathcal{N}(\mu_2, \sigma_2)$:
$$a = \frac{1}{2\sigma_1^2} - \frac{1}{2\sigma_2^2}$$
$$b = \frac{\mu_2}{\sigma_2^2} - \frac{\mu_1}{\sigma_1^2}$$
$$c = \frac{\mu_1^2}{2\sigma_1^2} - \frac{\mu_2^2}{2\sigma_2^2} - \ln\left(\frac{\sigma_1}{\sigma_2}\right)$$

The overlap area can then be computed as:
$$\text{overlap} = 1.0 - |F_1(x_1) - F_2(x_1)| - |F_1(x_2) - F_2(x_2)|$$
where $F_1$ and $F_2$ are the CDFs of the respective distributions.

**Implementation Considerations**:
- Handle cases with 0, 1, or 2 intersection points
- Use precise CDF calculations to avoid numerical errors
- Ensure the result is within [0.0, 1.0]

## 6. Quantile Calculation: Mathematical Analysis


### 6.1 Inclusive vs. Exclusive Methods


The statsrust library implements two primary methods for quantile calculation with distinct mathematical foundations.

#### 6.1.1 Inclusive Method


**Assumption**: The data includes the boundaries of the population.

**Mathematical Formulation**:
$$j = \left\lfloor i \cdot \frac{N-1}{n} \right\rfloor$$
$$\delta = i \cdot (N-1) \mod n$$
$$\text{quantile} = \frac{\text{values}[j] \cdot (n-\delta) + \text{values}[j+1] \cdot \delta}{n}$$

**Theoretical Justification**:
This method corresponds to the definition where the $p$-quantile is the value below which a proportion $p$ of observations fall. It ensures the minimum and maximum values are included as possible quantiles.

**Statistical Properties**:
- Quantiles always within the range of the data
- Appropriate when data represents the entire population
- Matches Microsoft Excel's PERCENTILE.EXC function

#### 6.1.2 Exclusive Method


**Assumption**: The data does not include the boundaries of the population.

**Mathematical Formulation**:
$$j = \left\lfloor i \cdot \frac{N+1}{n} \right\rfloor$$
$$\delta = i \cdot (N+1) - j \cdot n$$
$$\text{quantile} = \frac{\text{values}[j-1] \cdot (n-\delta) + \text{values}[j] \cdot \delta}{n}$$

**Theoretical Justification**:
This method assumes the data is a sample from a larger population and provides unbiased estimates of population quantiles.

**Statistical Properties**:
- Quantiles may be outside the range of the data
- More appropriate for estimating population quantiles
- Matches Microsoft Excel's PERCENTILE.INC function

### 6.2 Error Analysis for Quantile Methods


**Bias Analysis**:
- Inclusive method: Biased toward the sample when estimating population quantiles
- Exclusive method: Less biased for population quantile estimation

**Variance Comparison**:
For small samples, the inclusive method generally has lower variance but higher bias, while the exclusive method has higher variance but lower bias.

**Optimal Method Selection**:
The choice between methods depends on:
- Whether the data represents the entire population or a sample
- The specific quantile being estimated
- The underlying distribution of the data

## 7. Advanced Numerical Methods and Error Analysis


### 7.1 Inverse CDF Approximation for Complex Kernels


For kernels like Quartic where the inverse CDF has no closed-form solution, we implement high-precision approximations.

**Quartic Kernel CDF**:
$$\Phi(t) = \int_{-1}^{t} \frac{15}{16}(1 - u^2)^2 du = \frac{15}{16}\left[t - \frac{2t^3}{3} + \frac{t^5}{5}\right]_{-1}^{t}$$

**Inverse CDF Challenge**:
Solving $\Phi(t) = p$ requires finding the root of:
$$t^5 - \frac{10}{3}t^3 + 5t - \frac{16p}{3} + \frac{8}{3} = 0$$

**Implementation Strategy**:
1. Use polynomial approximation for $p \in [0.01, 0.99]$
2. Apply Newton-Raphson method for refinement
3. Use analytical solutions for edge cases ($p < 0.01$ or $p > 0.99$)

**Error Bounds**:
Our implementation ensures the approximation error is less than $10^{-10}$ across the entire domain.

### 7.2 Comprehensive Error Analysis Framework


The statsrust library implements a rigorous error analysis framework for all statistical operations.

#### 7.2.1 Forward Error Analysis


For each operation, we derive bounds on the output error given input errors:

$$|f(x + \Delta x) - f(x)| \leq L|\Delta x| + O(|\Delta x|^2)$$

Where $L$ is the Lipschitz constant of the function.

*Example: Variance Calculation*
The condition number for variance calculation is:
$$\kappa = \frac{\sqrt{\sum(x_i - \bar{x})^4}}{\sum(x_i - \bar{x})^2}$$
This quantifies how sensitive the variance is to input perturbations.

#### 7.2.2 Backward Error Analysis


We ensure that computed results correspond to exact solutions of slightly perturbed problems:

$$\hat{f}(x) = f(x + \Delta x)$$

Where $|\Delta x| \leq \epsilon \|x\|$ and $\epsilon$ is machine epsilon.

#### 7.2.3 Error Propagation in Composite Operations


For operations involving multiple steps (e.g., correlation calculation), we track error propagation:

$$\sigma_{\text{result}}^2 = \sum_{i=1}^{n} \left(\frac{\partial f}{\partial x_i}\right)^2 \sigma_{x_i}^2$$

This guides our implementation choices to minimize overall error.

## 8. Mathematical Validation and Testing Framework


### 8.1 Statistical Property Verification


For each statistical function, we verify key mathematical properties:

*Example: Variance Verification*
1. **Non-negativity**: $\text{variance}(X) \geq 0$ for all $X$
2. **Translation Invariance**: $\text{variance}(X + c) = \text{variance}(X)$
3. **Scale Property**: $\text{variance}(cX) = c^2\text{variance}(X)$
4. **Additivity for Independent Variables**: $\text{variance}(X+Y) = \text{variance}(X) + \text{variance}(Y)$ when $X \perp Y$

### 8.2 Numerical Stability Testing


We implement rigorous numerical stability tests:

1. **Pathological Cases**:
   - Values with large magnitude but small variance
   - Sequences with values close to machine epsilon
   - Alternating positive/negative values causing cancellation

2. **Precision Comparison**:
   - Compare against high-precision reference implementations
   - Measure relative error across different input scales
   - Verify stability properties theoretically and empirically

### 8.3 Statistical Correctness Verification


For distribution-related functions, we verify:

1. **PDF Normalization**: $\int_{-\infty}^{\infty} f(x)dx = 1$ within numerical tolerance
2. **CDF Properties**: Monotonicity, limits at $\pm\infty$, and continuity
3. **Quantile-PDF Relationship**: $F(Q(p)) = p$ for all $p \in (0,1)$
4. **Distribution Operations**: Verify algebraic properties (e.g., normal + normal = normal)

## 9. Conclusion: Mathematical Integrity in Practice


The statsrust library exemplifies how rigorous mathematical theory translates into practical, reliable statistical software. Through careful attention to:

- **Mathematical correctness**: Ensuring all implementations adhere to statistical theory
- **Numerical stability**: Implementing algorithms that minimize floating-point errors
- **Error analysis**: Quantifying and bounding potential inaccuracies
- **Theoretical validation**: Verifying properties through both proof and testing

The library achieves a balance between theoretical purity and practical utility. This mathematical foundation enables statsrust to deliver accurate statistical computations across diverse applications, from data science to scientific computing.

This document is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).  
Original author: statsrust Authors