# Classification Metrics Theory
**Chapter Status**: ✅ 100% Working (All metrics verified)
| ✅ Working | 4+ | All verified in src/metrics/mod.rs |
| ⏳ In Progress | 0 | - |
| ⬜ Not Implemented | 0 | - |
*Last tested: 2025-11-19*
*Aprender version: 0.3.0*
*Test file: src/metrics/mod.rs tests*
---
## Overview
Classification metrics evaluate how well a model predicts discrete classes. Unlike regression, we're not measuring "how far off"—we're measuring "right or wrong."
**Key Metrics**:
- **Accuracy**: Fraction of correct predictions
- **Precision**: Of predicted positives, how many are correct?
- **Recall**: Of actual positives, how many did we find?
- **F1 Score**: Harmonic mean of precision and recall
**Why This Matters**:
Accuracy alone can be misleading. A spam filter with 99% accuracy that marks all email as "not spam" is useless. We need precision and recall to understand performance fully.
---
## Mathematical Foundation
### The Confusion Matrix
All classification metrics derive from the **confusion matrix**:
```text
Predicted
Pos Neg
Actual Pos TP FN
Neg FP TN
TP = True Positives (correctly predicted positive)
TN = True Negatives (correctly predicted negative)
FP = False Positives (incorrectly predicted positive - Type I error)
FN = False Negatives (incorrectly predicted negative - Type II error)
```
### Accuracy
**Definition**:
```text
Accuracy = (TP + TN) / (TP + TN + FP + FN)
= Correct / Total
```
**Range**: [0, 1], higher is better
**Weakness**: Misleading with imbalanced classes
**Example**:
```text
Dataset: 95% negative, 5% positive
Model: Always predict negative
Accuracy = 95% (looks good!)
But: Model is useless (finds zero positives)
```
### Precision
**Definition**:
```text
Precision = TP / (TP + FP)
= True Positives / All Predicted Positives
```
**Interpretation**: "Of all items I labeled positive, what fraction are actually positive?"
**Use Case**: When false positives are costly
- Spam filter marking important email as spam
- Medical diagnosis triggering unnecessary treatment
### Recall (Sensitivity, True Positive Rate)
**Definition**:
```text
Recall = TP / (TP + FN)
= True Positives / All Actual Positives
```
**Interpretation**: "Of all actual positives, what fraction did I find?"
**Use Case**: When false negatives are costly
- Cancer screening missing actual cases
- Fraud detection missing actual fraud
### F1 Score
**Definition**:
```text
F1 = 2 * (Precision * Recall) / (Precision + Recall)
= Harmonic mean of Precision and Recall
```
**Why harmonic mean?** Punishes extreme imbalance. If either precision or recall is very low, F1 is low.
**Example**:
- Precision = 1.0, Recall = 0.01 → Arithmetic mean = 0.505 (misleading)
- F1 = 2 * (1.0 * 0.01) / (1.0 + 0.01) = 0.02 (realistic)
---
## Implementation in Aprender
### Example: Binary Classification Metrics
```rust,ignore
use aprender::metrics::{accuracy, precision, recall, f1_score};
use aprender::primitives::Vector;
let y_true = Vector::from_vec(vec![1.0, 0.0, 1.0, 1.0, 0.0, 1.0]);
let y_pred = Vector::from_vec(vec![1.0, 0.0, 0.0, 1.0, 0.0, 1.0]);
// TP TN FN TP TN TP
// Confusion Matrix:
// TP = 3, TN = 2, FP = 0, FN = 1
// Accuracy: (3+2)/(3+2+0+1) = 5/6 = 0.833
let acc = accuracy(&y_true, &y_pred);
println!("Accuracy: {:.3}", acc); // 0.833
// Precision: 3/(3+0) = 1.0 (no false positives)
let prec = precision(&y_true, &y_pred);
println!("Precision: {:.3}", prec); // 1.000
// Recall: 3/(3+1) = 0.75 (one false negative)
let rec = recall(&y_true, &y_pred);
println!("Recall: {:.3}", rec); // 0.750
// F1: 2*(1.0*0.75)/(1.0+0.75) = 0.857
let f1 = f1_score(&y_true, &y_pred);
println!("F1: {:.3}", f1); // 0.857
```
**Test References**:
- `src/metrics/mod.rs::tests::test_accuracy`
- `src/metrics/mod.rs::tests::test_precision`
- `src/metrics/mod.rs::tests::test_recall`
- `src/metrics/mod.rs::tests::test_f1_score`
---
## Choosing the Right Metric
### Decision Guide
```text
Are classes balanced (roughly 50/50)?
├─ YES → Accuracy is reasonable
└─ NO → Use Precision/Recall/F1
Which error is more costly?
├─ False Positives worse → Maximize Precision
├─ False Negatives worse → Maximize Recall
└─ Both equally bad → Maximize F1
Examples:
- Email spam (FP bad): High Precision
- Cancer screening (FN bad): High Recall
- General classification: F1 Score
```
### Metric Comparison
| Metric | Formula | Range | Best For | Weakness |
|--------|---------|-------|----------|----------|
| **Accuracy** | (TP+TN)/Total | [0,1] | Balanced classes | Imbalanced data |
| **Precision** | TP/(TP+FP) | [0,1] | Minimizing FP | Ignores FN |
| **Recall** | TP/(TP+FN) | [0,1] | Minimizing FN | Ignores FP |
| **F1** | 2PR/(P+R) | [0,1] | Balancing P&R | Equal weight to P&R |
---
## Precision-Recall Trade-off
**Key Insight**: You can't maximize both precision and recall simultaneously (except for perfect classifier).
### Example: Spam Filter Threshold
```text
Threshold | Precision | Recall | F1
----------|-----------|--------|----
0.9 | 0.95 | 0.60 | 0.74 (conservative)
0.5 | 0.80 | 0.85 | 0.82 (balanced)
0.1 | 0.50 | 0.98 | 0.66 (aggressive)
```
**Choosing threshold**:
- High threshold → High precision, low recall (few predictions, mostly correct)
- Low threshold → Low precision, high recall (many predictions, some wrong)
- Middle ground → Maximize F1
---
## Practical Considerations
### Imbalanced Classes
**Problem**: 1% positive class (fraud detection, rare disease)
**Bad Baseline**:
```rust,ignore
// Always predict negative
// Accuracy = 99% (misleading!)
// Recall = 0% (finds no positives - useless)
```
**Solution**: Use precision, recall, F1 instead of accuracy
### Multi-class Classification
For multi-class, compute metrics per class then average:
- **Macro-average**: Average across classes (each class weighted equally)
- **Micro-average**: Aggregate TP/FP/FN across all classes
**Example** (3 classes):
```text
Class A: Precision = 0.9
Class B: Precision = 0.8
Class C: Precision = 0.5
Macro-avg Precision = (0.9 + 0.8 + 0.5) / 3 = 0.73
```
---
## Verification Through Tests
Classification metrics have comprehensive test coverage:
**Property Tests**:
1. Perfect predictions → All metrics = 1.0
2. All wrong predictions → All metrics = 0.0
3. Metrics are in [0, 1] range
4. F1 ≤ min(Precision, Recall)
**Test Reference**: `src/metrics/mod.rs` validates these properties
---
## Real-World Application
### Evaluating Logistic Regression
```rust,ignore
use aprender::classification::LogisticRegression;
use aprender::metrics::{accuracy, precision, recall, f1_score};
use aprender::traits::Classifier;
// Train model
let mut model = LogisticRegression::new();
model.fit(&x_train, &y_train).unwrap();
// Predict on test set
let y_pred = model.predict(&x_test);
// Evaluate with multiple metrics
let acc = accuracy(&y_test, &y_pred);
let prec = precision(&y_test, &y_pred);
let rec = recall(&y_test, &y_pred);
let f1 = f1_score(&y_test, &y_pred);
println!("Accuracy: {:.3}", acc); // e.g., 0.892
println!("Precision: {:.3}", prec); // e.g., 0.875
println!("Recall: {:.3}", rec); // e.g., 0.910
println!("F1 Score: {:.3}", f1); // e.g., 0.892
// Decision: F1 > 0.85 → Accept model
```
**Case Study**: [Logistic Regression](../examples/logistic-regression.md) uses these metrics
---
## Further Reading
### Peer-Reviewed Paper
**Powers (2011)** - *Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation*
- **Relevance**: Comprehensive survey of classification metrics
- **Link**: [arXiv](https://arxiv.org/abs/2010.16061) (publicly accessible)
- **Key Contribution**: Unifies many metrics under single framework
- **Advanced Topics**: ROC curves, AUC, informedness
- **Applied in**: `src/metrics/mod.rs`
### Related Chapters
- [Logistic Regression Theory](./logistic-regression.md) - Binary classification model
- [Regression Metrics Theory](./regression-metrics.md) - For continuous targets
- [Cross-Validation Theory](./cross-validation.md) - Using metrics in CV
---
## Summary
**What You Learned**:
- ✅ Confusion matrix: TP, TN, FP, FN
- ✅ Accuracy: Simple but misleading with imbalance
- ✅ Precision: Minimizes false positives
- ✅ Recall: Minimizes false negatives
- ✅ F1: Balances precision and recall
- ✅ Choose metric based on: class balance, cost of errors
**Verification Guarantee**: All classification metrics extensively tested (10+ tests) in `src/metrics/mod.rs`. Property tests verify mathematical properties.
**Quick Reference**:
- **Balanced classes**: Accuracy
- **Imbalanced classes**: Precision/Recall/F1
- **FP costly**: Precision
- **FN costly**: Recall
- **Balance both**: F1
---
**Next Chapter**: [Cross-Validation Theory](./cross-validation.md)
**Previous Chapter**: [Logistic Regression Theory](./logistic-regression.md)