# Mathematical Foundations of Machine Learning
---
## 1. Linear Algebra Essentials
### 1.1 Vector Operations
The dot product of two vectors is defined as:
$$
\vec{a} \cdot \vec{b} = \sum_{i=1}^{n} a_i b_i
$$
The cross product in three dimensions:
$$
\vec{a} \times \vec{b} = \begin{vmatrix} \hat{i} & \hat{j} & \hat{k} \\ a_1 & a_2 & a_3 \\ b_1 & b_2 & b_3 \end{vmatrix}
$$
### 1.2 Matrix Decomposition
The Singular Value Decomposition (SVD) factorizes a matrix as:
$$
A = U \Sigma V^T
$$
where $U \in \mathbb{R}^{m \times m}$ is orthogonal, $\Sigma$ is diagonal, and $V \in \mathbb{R}^{n \times n}$ is orthogonal.
The eigenvalue equation:
$$
A \vec{v} = \lambda \vec{v}
$$
---
## 2. Calculus and Optimization
### 2.1 Gradient Descent
The gradient descent update rule:
$$
\theta_{t+1} = \theta_t - \alpha \nabla_\theta J(\theta_t)
$$
where $\alpha$ is the learning rate and $J(\theta)$ is the cost function.
### 2.2 Chain Rule for Backpropagation
$$
\frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial a_j} \cdot \frac{\partial a_j}{\partial z_j} \cdot \frac{\partial z_j}{\partial w_{ij}}
$$
### 2.3 Integration
The expected value of a continuous random variable:
$$
E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx
$$
The Gaussian integral:
$$
\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi}
$$
---
## 3. Probability and Statistics
### 3.1 Bayes' Theorem
$$
### 3.2 Normal Distribution
The probability density function of the normal distribution:
$$
f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)
$$
### 3.3 Maximum Likelihood Estimation
The log-likelihood function:
$$
\ell(\theta) = \sum_{i=1}^{n} \log P(x_i | \theta)
$$
The MLE estimator maximizes this:
$$
\hat{\theta}_{MLE} = \arg\max_\theta \ell(\theta)
$$
---
## 4. Neural Network Formulations
### 4.1 Forward Pass
For a single neuron:
$z = \sum_{i=1}^{n} w_i x_i + b$
$a = \sigma(z)$
Common activation functions:
| Sigmoid | 1/(1+e^-x) | (0, 1) | f(x)(1-f(x)) |
| Tanh | (e^x - e^-x)/(e^x + e^-x) | (-1, 1) | 1 - f(x)^2 |
| ReLU | max(0, x) | [0, inf) | 0 or 1 |
| Softmax | e^xi / sum(e^xj) | (0, 1) | Complex |
### 4.2 Loss Functions
Cross-entropy loss:
$$
L = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)
$$
Mean squared error:
$$
L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$
### 4.3 Regularization
L2 regularization (Ridge):
$$
J_{reg}(\theta) = J(\theta) + \frac{\lambda}{2} \sum_{j=1}^{p} \theta_j^2
$$
L1 regularization (Lasso):
$$
J_{reg}(\theta) = J(\theta) + \lambda \sum_{j=1}^{p} |\theta_j|
$$
---
## 5. Information Theory
### 5.1 Entropy
Shannon entropy:
$$
H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i)
$$
### 5.2 KL Divergence
$$
D_{KL}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}
$$
### 5.3 Mutual Information
$$
---
## 6. Advanced Topics
### 6.1 Attention Mechanism (Transformers)
The scaled dot-product attention:
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
$$
### 6.2 Variational Inference (ELBO)
$$
\log P(x) \geq E_{q(z|x)}[\log P(x|z)] - D_{KL}(q(z|x) \| P(z))
$$
### 6.3 Fourier Transform
$$
\hat{f}(\xi) = \int_{-\infty}^{\infty} f(x) e^{-2\pi i x \xi} dx
$$
---
## Code Example: Gradient Descent in Rust
```rust
struct GradientDescent {
learning_rate: f64,
max_iterations: usize,
tolerance: f64,
}
impl GradientDescent {
fn minimize<F, G>(&self, f: F, grad: G, x0: &[f64]) -> Vec<f64>
where
F: Fn(&[f64]) -> f64,
G: Fn(&[f64]) -> Vec<f64>,
{
let mut x = x0.to_vec();
for _ in 0..self.max_iterations {
let g = grad(&x);
let norm: f64 = g.iter().map(|gi| gi * gi).sum::<f64>().sqrt();
if norm < self.tolerance {
break;
}
for (xi, gi) in x.iter_mut().zip(g.iter()) {
*xi -= self.learning_rate * gi;
}
}
x
}
}
```
## Code Example: Matrix Operations in Python
```python
import numpy as np
def svd_compress(matrix, k):
"""Compress matrix using top-k singular values."""
U, S, Vt = np.linalg.svd(matrix, full_matrices=False)
U_k = U[:, :k]
S_k = np.diag(S[:k])
Vt_k = Vt[:k, :]
return U_k @ S_k @ Vt_k
def gradient_descent(f, grad_f, x0, lr=0.01, epochs=1000):
"""Simple gradient descent optimizer."""
x = np.array(x0, dtype=float)
history = [f(x)]
for _ in range(epochs):
g = grad_f(x)
x -= lr * g
history.append(f(x))
return x, history
```
---
## Summary Table
| Gradient Descent | theta = theta - alpha * grad(J) | Optimization |
| Backpropagation | dL/dw = dL/da * da/dz * dz/dw | Training |
| Bayes' Theorem | P(A|B) = P(B|A)P(A)/P(B) | Classification |
| Cross-Entropy | L = -sum(y*log(y_hat)) | Loss function |
| Attention | softmax(QK^T/sqrt(d))V | Transformers |
| KL Divergence | sum(P*log(P/Q)) | Distribution comparison |
| Fourier Transform | integral(f*e^(-2pi*i*x*xi)) | Signal processing |
---
*Generated by pdf-rs v0.1.0 — Mathematical Foundations Series*