aprender 0.29.3

# Classification Metrics Theory

<!-- DOC_STATUS_START -->
**Chapter Status**: ✅ 100% Working (All metrics verified)

| Status | Count | Examples |
|--------|-------|----------|
| ✅ Working | 4+ | All verified in src/metrics/mod.rs |
| ⏳ In Progress | 0 | - |
| ⬜ Not Implemented | 0 | - |

*Last tested: 2025-11-19*
*Aprender version: 0.3.0*
*Test file: src/metrics/mod.rs tests*
<!-- DOC_STATUS_END -->

---

## Overview

Classification metrics evaluate how well a model predicts discrete classes. Unlike regression, we're not measuring "how far off"—we're measuring "right or wrong."

**Key Metrics**:
- **Accuracy**: Fraction of correct predictions
- **Precision**: Of predicted positives, how many are correct?
- **Recall**: Of actual positives, how many did we find?
- **F1 Score**: Harmonic mean of precision and recall

**Why This Matters**:
Accuracy alone can be misleading. A spam filter with 99% accuracy that marks all email as "not spam" is useless. We need precision and recall to understand performance fully.

---

## Mathematical Foundation

### The Confusion Matrix

All classification metrics derive from the **confusion matrix**:

```text
                Predicted
                Pos    Neg
Actual  Pos    TP     FN
        Neg    FP     TN

TP = True Positives  (correctly predicted positive)
TN = True Negatives  (correctly predicted negative)
FP = False Positives (incorrectly predicted positive - Type I error)
FN = False Negatives (incorrectly predicted negative - Type II error)
```

### Accuracy

**Definition**:
```text
Accuracy = (TP + TN) / (TP + TN + FP + FN)
         = Correct / Total
```

**Range**: [0, 1], higher is better

**Weakness**: Misleading with imbalanced classes

**Example**:
```text
Dataset: 95% negative, 5% positive
Model: Always predict negative
Accuracy = 95% (looks good!)
But: Model is useless (finds zero positives)
```

### Precision

**Definition**:
```text
Precision = TP / (TP + FP)
          = True Positives / All Predicted Positives
```

**Interpretation**: "Of all items I labeled positive, what fraction are actually positive?"

**Use Case**: When false positives are costly
- Spam filter marking important email as spam
- Medical diagnosis triggering unnecessary treatment

### Recall (Sensitivity, True Positive Rate)

**Definition**:
```text
Recall = TP / (TP + FN)
       = True Positives / All Actual Positives
```

**Interpretation**: "Of all actual positives, what fraction did I find?"

**Use Case**: When false negatives are costly
- Cancer screening missing actual cases
- Fraud detection missing actual fraud

### F1 Score

**Definition**:
```text
F1 = 2 * (Precision * Recall) / (Precision + Recall)
   = Harmonic mean of Precision and Recall
```

**Why harmonic mean?** Punishes extreme imbalance. If either precision or recall is very low, F1 is low.

**Example**:
- Precision = 1.0, Recall = 0.01 → Arithmetic mean = 0.505 (misleading)
- F1 = 2 * (1.0 * 0.01) / (1.0 + 0.01) = 0.02 (realistic)

---

## Implementation in Aprender

### Example: Binary Classification Metrics

```rust,ignore
use aprender::metrics::{accuracy, precision, recall, f1_score};
use aprender::primitives::Vector;

let y_true = Vector::from_vec(vec![1.0, 0.0, 1.0, 1.0, 0.0, 1.0]);
let y_pred = Vector::from_vec(vec![1.0, 0.0, 0.0, 1.0, 0.0, 1.0]);
//                                  TP   TN   FN   TP   TN   TP
// Confusion Matrix:
// TP = 3, TN = 2, FP = 0, FN = 1

// Accuracy: (3+2)/(3+2+0+1) = 5/6 = 0.833
let acc = accuracy(&y_true, &y_pred);
println!("Accuracy: {:.3}", acc); // 0.833

// Precision: 3/(3+0) = 1.0 (no false positives)
let prec = precision(&y_true, &y_pred);
println!("Precision: {:.3}", prec); // 1.000

// Recall: 3/(3+1) = 0.75 (one false negative)
let rec = recall(&y_true, &y_pred);
println!("Recall: {:.3}", rec); // 0.750

// F1: 2*(1.0*0.75)/(1.0+0.75) = 0.857
let f1 = f1_score(&y_true, &y_pred);
println!("F1: {:.3}", f1); // 0.857
```

**Test References**:
- `src/metrics/mod.rs::tests::test_accuracy`
- `src/metrics/mod.rs::tests::test_precision`
- `src/metrics/mod.rs::tests::test_recall`
- `src/metrics/mod.rs::tests::test_f1_score`

---

## Choosing the Right Metric

### Decision Guide

```text
Are classes balanced (roughly 50/50)?
├─ YES → Accuracy is reasonable
└─ NO → Use Precision/Recall/F1

Which error is more costly?
├─ False Positives worse → Maximize Precision
├─ False Negatives worse → Maximize Recall
└─ Both equally bad → Maximize F1

Examples:
- Email spam (FP bad): High Precision
- Cancer screening (FN bad): High Recall
- General classification: F1 Score
```

### Metric Comparison

| Metric | Formula | Range | Best For | Weakness |
|--------|---------|-------|----------|----------|
| **Accuracy** | (TP+TN)/Total | [0,1] | Balanced classes | Imbalanced data |
| **Precision** | TP/(TP+FP) | [0,1] | Minimizing FP | Ignores FN |
| **Recall** | TP/(TP+FN) | [0,1] | Minimizing FN | Ignores FP |
| **F1** | 2PR/(P+R) | [0,1] | Balancing P&R | Equal weight to P&R |

---

## Precision-Recall Trade-off

**Key Insight**: You can't maximize both precision and recall simultaneously (except for perfect classifier).

### Example: Spam Filter Threshold

```text
Threshold | Precision | Recall | F1
----------|-----------|--------|----
  0.9     |   0.95    |  0.60  | 0.74  (conservative)
  0.5     |   0.80    |  0.85  | 0.82  (balanced)
  0.1     |   0.50    |  0.98  | 0.66  (aggressive)
```

**Choosing threshold**:
- High threshold → High precision, low recall (few predictions, mostly correct)
- Low threshold → Low precision, high recall (many predictions, some wrong)
- Middle ground → Maximize F1

---

## Practical Considerations

### Imbalanced Classes

**Problem**: 1% positive class (fraud detection, rare disease)

**Bad Baseline**:
```rust,ignore
// Always predict negative
// Accuracy = 99% (misleading!)
// Recall = 0% (finds no positives - useless)
```

**Solution**: Use precision, recall, F1 instead of accuracy

### Multi-class Classification

For multi-class, compute metrics per class then average:

- **Macro-average**: Average across classes (each class weighted equally)
- **Micro-average**: Aggregate TP/FP/FN across all classes

**Example** (3 classes):
```text
Class A: Precision = 0.9
Class B: Precision = 0.8
Class C: Precision = 0.5

Macro-avg Precision = (0.9 + 0.8 + 0.5) / 3 = 0.73
```

---

## Verification Through Tests

Classification metrics have comprehensive test coverage:

**Property Tests**:
1. Perfect predictions → All metrics = 1.0
2. All wrong predictions → All metrics = 0.0
3. Metrics are in [0, 1] range
4. F1 ≤ min(Precision, Recall)

**Test Reference**: `src/metrics/mod.rs` validates these properties

---

## Real-World Application

### Evaluating Logistic Regression

```rust,ignore
use aprender::classification::LogisticRegression;
use aprender::metrics::{accuracy, precision, recall, f1_score};
use aprender::traits::Classifier;

// Train model
let mut model = LogisticRegression::new();
model.fit(&x_train, &y_train).unwrap();

// Predict on test set
let y_pred = model.predict(&x_test);

// Evaluate with multiple metrics
let acc = accuracy(&y_test, &y_pred);
let prec = precision(&y_test, &y_pred);
let rec = recall(&y_test, &y_pred);
let f1 = f1_score(&y_test, &y_pred);

println!("Accuracy:  {:.3}", acc);   // e.g., 0.892
println!("Precision: {:.3}", prec);  // e.g., 0.875
println!("Recall:    {:.3}", rec);   // e.g., 0.910
println!("F1 Score:  {:.3}", f1);    // e.g., 0.892

// Decision: F1 > 0.85 → Accept model
```

**Case Study**: [Logistic Regression](../examples/logistic-regression.md) uses these metrics

---

## Further Reading

### Peer-Reviewed Paper

**Powers (2011)** - *Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation*
- **Relevance**: Comprehensive survey of classification metrics
- **Link**: [arXiv](https://arxiv.org/abs/2010.16061) (publicly accessible)
- **Key Contribution**: Unifies many metrics under single framework
- **Advanced Topics**: ROC curves, AUC, informedness
- **Applied in**: `src/metrics/mod.rs`

### Related Chapters

- [Logistic Regression Theory](./logistic-regression.md) - Binary classification model
- [Regression Metrics Theory](./regression-metrics.md) - For continuous targets
- [Cross-Validation Theory](./cross-validation.md) - Using metrics in CV

---

## Summary

**What You Learned**:
- ✅ Confusion matrix: TP, TN, FP, FN
- ✅ Accuracy: Simple but misleading with imbalance
- ✅ Precision: Minimizes false positives
- ✅ Recall: Minimizes false negatives
- ✅ F1: Balances precision and recall
- ✅ Choose metric based on: class balance, cost of errors

**Verification Guarantee**: All classification metrics extensively tested (10+ tests) in `src/metrics/mod.rs`. Property tests verify mathematical properties.

**Quick Reference**:
- **Balanced classes**: Accuracy
- **Imbalanced classes**: Precision/Recall/F1
- **FP costly**: Precision
- **FN costly**: Recall
- **Balance both**: F1

---

**Next Chapter**: [Cross-Validation Theory](./cross-validation.md)

**Previous Chapter**: [Logistic Regression Theory](./logistic-regression.md)