aprender 0.29.3

# Bayesian Linear Regression

Bayesian Linear Regression extends ordinary least squares (OLS) regression by treating coefficients as random variables with a prior distribution, enabling uncertainty quantification and natural regularization.

## Theory

### Model

$$
y = X\beta + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I)
$$

Where:
- $y \in \mathbb{R}^n$: target vector
- $X \in \mathbb{R}^{n \times p}$: feature matrix
- $\beta \in \mathbb{R}^p$: coefficient vector
- $\sigma^2$: noise variance

### Conjugate Prior (Normal-Inverse-Gamma)

$$
\begin{aligned}
\beta &\sim \mathcal{N}(\beta_0, \Sigma_0) \\
\sigma^2 &\sim \text{Inv-Gamma}(\alpha, \beta)
\end{aligned}
$$

### Analytical Posterior

With conjugate priors, the posterior has a closed form:

$$
\begin{aligned}
\beta | y, X &\sim \mathcal{N}(\beta_n, \Sigma_n) \\
\text{where:} \\
\Sigma_n &= (\Sigma_0^{-1} + \sigma^{-2} X^T X)^{-1} \\
\beta_n &= \Sigma_n (\Sigma_0^{-1} \beta_0 + \sigma^{-2} X^T y)
\end{aligned}
$$

### Key Properties

1. **Posterior mean**: $\beta_n$ balances prior belief ($\beta_0$) and data evidence ($X^T y$)
2. **Posterior covariance**: $\Sigma_n$ quantifies uncertainty
3. **Weak prior**: As $\Sigma_0 \to \infty$, $\beta_n \to (X^T X)^{-1} X^T y$ (OLS)
4. **Strong prior**: As $\Sigma_0 \to 0$, $\beta_n \to \beta_0$ (ignore data)

## Example: Univariate Regression with Weak Prior

```rust,ignore
use aprender::bayesian::BayesianLinearRegression;
use aprender::primitives::{Matrix, Vector};

fn main() {
    // Training data: y ≈ 2x + noise
    let x = Matrix::from_vec(10, 1, vec![
        1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0
    ]).unwrap();
    let y = Vector::from_vec(vec![
        2.1, 3.9, 6.2, 8.1, 9.8, 12.3, 13.9, 16.1, 18.2, 20.0
    ]);

    // Create model with weak prior
    let mut model = BayesianLinearRegression::new(1);

    // Fit: compute analytical posterior
    model.fit(&x, &y).unwrap();

    // Posterior estimates
    let beta = model.posterior_mean().unwrap();
    let sigma2 = model.noise_variance().unwrap();

    println!("β (slope): {:.4}", beta[0]);          // ≈ 2.0094
    println!("σ² (noise): {:.4}", sigma2);           // ≈ 0.0251

    // Make predictions
    let x_test = Matrix::from_vec(3, 1, vec![11.0, 12.0, 13.0]).unwrap();
    let predictions = model.predict(&x_test).unwrap();

    println!("Prediction at x=11: {:.2}", predictions[0]);  // ≈ 22.10
    println!("Prediction at x=12: {:.2}", predictions[1]);  // ≈ 24.11
    println!("Prediction at x=13: {:.2}", predictions[2]);  // ≈ 26.12
}
```

**Output:**
```text
β (slope): 2.0094
σ² (noise): 0.0251
Prediction at x=11: 22.10
Prediction at x=12: 24.11
Prediction at x=13: 26.12
```

With a weak prior, the posterior mean is nearly identical to the OLS estimate.

## Example: Informative Prior (Ridge-like Regularization)

```rust,ignore
use aprender::bayesian::BayesianLinearRegression;
use aprender::primitives::{Matrix, Vector};

fn main() {
    // Small dataset (prone to overfitting)
    let x = Matrix::from_vec(5, 1, vec![1.0, 2.0, 3.0, 4.0, 5.0]).unwrap();
    let y = Vector::from_vec(vec![2.5, 4.1, 5.8, 8.2, 9.9]);

    // Weak prior model
    let mut weak_model = BayesianLinearRegression::new(1);
    weak_model.fit(&x, &y).unwrap();

    // Informative prior: β ~ N(1.5, 1.0)
    let mut strong_model = BayesianLinearRegression::with_prior(
        1,
        vec![1.5],  // Prior mean: expect slope around 1.5
        1.0,        // Prior precision (variance = 1.0)
        3.0,        // Noise shape
        2.0,        // Noise scale
    ).unwrap();
    strong_model.fit(&x, &y).unwrap();

    let beta_weak = weak_model.posterior_mean().unwrap();
    let beta_strong = strong_model.posterior_mean().unwrap();

    println!("Weak prior:       β = {:.4}", beta_weak[0]);
    println!("Informative prior: β = {:.4}", beta_strong[0]);
}
```

**Output:**
```text
Weak prior:       β = 2.0073
Informative prior: β = 2.0065
```

The informative prior shrinks the coefficient toward the prior mean (1.5), acting as **L2 regularization** (ridge regression).

## Example: Multivariate Regression

```rust,ignore
use aprender::bayesian::BayesianLinearRegression;
use aprender::primitives::{Matrix, Vector};

fn main() {
    // Two features: y ≈ 2x₁ + 3x₂ + noise
    let x = Matrix::from_vec(8, 2, vec![
        1.0, 1.0,  // row 0
        2.0, 1.0,  // row 1
        3.0, 2.0,  // row 2
        4.0, 2.0,  // row 3
        5.0, 3.0,  // row 4
        6.0, 3.0,  // row 5
        7.0, 4.0,  // row 6
        8.0, 4.0,  // row 7
    ]).unwrap();

    let y = Vector::from_vec(vec![
        5.1, 7.2, 11.9, 14.1, 19.2, 21.0, 25.8, 27.9
    ]);

    // Fit multivariate model
    let mut model = BayesianLinearRegression::new(2);
    model.fit(&x, &y).unwrap();

    let beta = model.posterior_mean().unwrap();
    let sigma2 = model.noise_variance().unwrap();

    println!("β₁: {:.4}", beta[0]);    // ≈ 1.9785
    println!("β₂: {:.4}", beta[1]);    // ≈ 3.0343
    println!("σ²: {:.4}", sigma2);     // ≈ 0.0262

    // Predictions
    let x_test = Matrix::from_vec(3, 2, vec![
        9.0, 5.0,   // Expected: 2*9 + 3*5 = 33
        10.0, 5.0,  // Expected: 2*10 + 3*5 = 35
        10.0, 6.0,  // Expected: 2*10 + 3*6 = 38
    ]).unwrap();

    let predictions = model.predict(&x_test).unwrap();
    for i in 0..3 {
        println!("Prediction {}: {:.2}", i, predictions[i]);
    }
}
```

**Output:**
```text
β₁: 1.9785
β₂: 3.0343
σ²: 0.0262
Prediction 0: 32.98
Prediction 1: 34.96
Prediction 2: 37.99
```

## Comparison: Bayesian vs. OLS

| Aspect | Bayesian Linear Regression | OLS Regression |
|--------|----------------------------|----------------|
| **Output** | Posterior distribution over β | Point estimate β̂ |
| **Uncertainty** | Full posterior covariance Σₙ | Standard errors (requires additional computation) |
| **Regularization** | Natural via prior (e.g., ridge) | Requires explicit penalty term |
| **Interpretation** | Probability statements: P(β ∈ [a, b] \| data) | Frequentist confidence intervals |
| **Computation** | Analytical (conjugate case) | Analytical (normal equations) |
| **Small Data** | Regularizes via prior | May overfit |

## Implementation Details

### Simplified Approach (Aprender v0.6)

Aprender uses a **simplified diagonal prior**:
- $\Sigma_0 = \frac{1}{\lambda} I$ (scalar precision $\lambda$)
- Reduces computational cost from $O(p^3)$ to $O(p)$ for prior
- Still requires $O(p^3)$ for $(X^T X)^{-1}$ via Cholesky decomposition

### Algorithm

1. **Compute sufficient statistics**: $X^T X$ (Gram matrix), $X^T y$
2. **Estimate noise variance**: $\hat{\sigma}^2 = \frac{1}{n-p} ||y - X\beta_{OLS}||^2$
3. **Compute posterior precision**: $\Sigma_n^{-1} = \lambda I + \frac{1}{\hat{\sigma}^2} X^T X$
4. **Solve for posterior mean**: $\beta_n = \Sigma_n (\lambda \beta_0 + \frac{1}{\hat{\sigma}^2} X^T y)$

### Numerical Stability

- Uses **Cholesky decomposition** to solve linear systems
- Numerically stable for well-conditioned $X^T X$
- Prior precision $\lambda > 0$ ensures positive definiteness

## Bayesian Interpretation of Ridge Regression

Ridge regression minimizes:
$$
L(\beta) = ||y - X\beta||^2 + \alpha ||\beta||^2
$$

This is equivalent to **MAP estimation** with:
- Prior: $\beta \sim \mathcal{N}(0, \frac{1}{\alpha} I)$
- Likelihood: $y \sim \mathcal{N}(X\beta, \sigma^2 I)$

Bayesian regression extends this by computing the **full posterior**, not just the mode.

## When to Use

**Use Bayesian Linear Regression when:**
- You want **uncertainty quantification** (prediction intervals)
- You have **small datasets** (prior regularizes)
- You have **domain knowledge** (informative prior)
- You need **probabilistic predictions** for downstream tasks

**Use OLS when:**
- You only need **point estimates**
- You have **large datasets** (prior has little effect)
- You want **computational speed** (slightly faster than Bayesian)

## Further Reading

- Kevin Murphy, *Machine Learning: A Probabilistic Perspective*, Chapter 7
- Christopher Bishop, *Pattern Recognition and Machine Learning*, Chapter 3
- Andrew Gelman et al., *Bayesian Data Analysis*, Chapter 14

## See Also

- [Normal-Inverse-Gamma Inference](./normal-inverse-gamma-inference.md) - Conjugate prior details
- [Ridge Regression](#) - Frequentist regularization (coming soon)
- [Bayesian Model Comparison](#) - Marginal likelihood (coming soon)