# Probability Calibration Theory
Calibration ensures that predicted probabilities reflect true likelihoods: when a model predicts 70% confidence, it should be correct 70% of the time.
## Why Calibration Matters
### Miscalibrated Models
```
Prediction: 90% confident it's a cat
Reality: Only 60% of 90%-confident predictions are cats
```
**Consequences:**
- Decision-making based on wrong probabilities
- Risk underestimation in safety-critical systems
- Ensemble weighting fails
### Calibrated Models
```
Prediction: 70% confident it's a cat
Reality: 70% of 70%-confident predictions are cats
```
## Measuring Calibration
### Reliability Diagram
Plot predicted probability vs actual frequency:
```
Accuracy │ ·
│ ·
│ · Perfect calibration (diagonal)
│ ·
│·
└──────────
Confidence
```
### Expected Calibration Error (ECE)
```
Where:
- B = number of bins
- nᵦ = samples in bin b
- acc(b) = accuracy in bin b
- conf(b) = mean confidence in bin b
### Maximum Calibration Error (MCE)
```
MCE = max_b |acc(b) - conf(b)|
```
Worst-case miscalibration.
### Brier Score
```
BS = (1/N) Σᵢ (pᵢ - yᵢ)²
```
Combines calibration and refinement.
## Calibration Methods
### Temperature Scaling
Simple and effective post-hoc calibration:
```
p_calibrated = softmax(logits / T)
```
Optimize T on validation set:
```
T* = argmin_T NLL(softmax(logits/T), y_val)
```
Typically T > 1 (softens overconfident predictions).
### Platt Scaling
Logistic regression on model outputs:
```
Learn a, b on validation set.
### Isotonic Regression
Non-parametric calibration:
```
Map predicted probability to calibrated probability
using monotonic (isotonic) function
```
No parametric assumptions, but needs more data.
### Histogram Binning
```
For each confidence bin [a, b):
calibrated_prob = empirical_accuracy_in_bin
```
Simple but discontinuous.
### Beta Calibration
```
P_calibrated = 1 / (1 + 1/(exp(c)·((1-p)/p)^a·p^(b-a)))
```
Three-parameter model, handles asymmetric errors.
## When Models Miscalibrate
### Overconfidence
Modern neural networks are typically overconfident:
| ResNet-110 | 4.5% | 1.2% |
| DenseNet-40 | 3.8% | 0.9% |
**Causes:**
- Cross-entropy loss encourages extreme predictions
- Batch normalization
- Overparameterization
### Underconfidence
Less common, but occurs with:
- Heavy regularization
- Ensemble disagreement
- Out-of-distribution inputs
## Calibration for Multi-Class
### Per-Class Calibration
```
Separate calibrator per class.
### Focal Calibration
```
L = -Σᵢ (1-pᵢ)^γ log(pᵢ)
```
Focal loss during training improves calibration.
## Calibration Under Distribution Shift
Challenge: Calibration degrades on OOD data.
### Domain-Aware Calibration
```
T_domain = T_base · domain_adjustment
```
### Ensemble Temperature
```
p = Σₖ wₖ · softmax(logits/Tₖ)
```
## Conformal Prediction
Provide prediction sets with coverage guarantee:
```
C(x) = {y : s(x,y) ≤ τ}
```
Where τ chosen so that:
```
P(y* ∈ C(x)) ≥ 1 - α
```
**Properties:**
- Distribution-free
- Finite-sample guarantee
- No model assumptions
## Selective Prediction
Abstain when uncertain:
```
If max(p) < threshold:
return "I don't know"
```
Trade-off: coverage vs accuracy on non-abstained predictions.
## References
- Guo, C., et al. (2017). "On Calibration of Modern Neural Networks." ICML.
- Platt, J. (1999). "Probabilistic Outputs for Support Vector Machines."
- Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities with Supervised Learning." ICML.
- Angelopoulos, A., & Bates, S. (2021). "A Gentle Introduction to Conformal Prediction." arXiv.