Evaluation Metrics
Accuracy alone is misleading. Precision, recall, and F1 score give you a complete picture of model performance — especially when classes are imbalanced or when the cost of false positives and false negatives is different.
Intuition First
Imagine a hospital alarm that detects cancer from a scan.
If it never fires, it has 0% of all cancers detected — terrible. If it always fires, it detects 100% of cancers but also flags everyone who's healthy — also terrible.
Accuracy doesn't capture this tradeoff. A model that predicts "no cancer" for everyone scores 99% accuracy if only 1% of patients have cancer — and is completely useless.
You need metrics that separate being cautious from being thorough.
What's Actually Happening
All classification metrics are built from four outcomes when the model makes a binary prediction:
| Model says Positive | Model says Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) ✓ | False Negative (FN) ✗ |
| Actually Negative | False Positive (FP) ✗ | True Negative (TN) ✓ |
- TP: Correctly detected a positive case (caught the cancer)
- TN: Correctly identified a negative case (correctly cleared a healthy patient)
- FP: False alarm (flagged a healthy patient as sick) — Type I error
- FN: Missed a real positive (cleared a sick patient) — Type II error
Every metric is a different combination of these four numbers.
Build the Idea Step-by-Step
Formal Explanation
Accuracy
Accuracy = (TP + TN) / (TP + TN + FP + FN)
What fraction of all predictions were correct? Simple, but misleading when classes are imbalanced.
When it fails: 99% of emails are not spam. A model that says "not spam" for everything gets 99% accuracy — and catches zero spam.
Precision
Precision = TP / (TP + FP)
Of all the times the model predicted positive, how often was it actually positive?
High precision matters when false positives are costly. Example: flagging a legal transaction as fraud. Each false alarm costs customer trust and manual review time.
Recall (Sensitivity)
Recall = TP / (TP + FN)
Of all the actual positive cases, how many did the model catch?
High recall matters when false negatives are costly. Example: missing a cancer diagnosis. A false negative (missed cancer) is much worse than a false positive (unnecessary follow-up).
F1 Score
F1 = 2 × (Precision × Recall) / (Precision + Recall)
The harmonic mean of precision and recall. It rewards high scores on both — if either is very low, F1 is pulled down.
Use F1 when:
- Classes are imbalanced
- You care about both false positives and false negatives
- You need a single number that reflects the precision/recall tradeoff
Key Properties / Rules
| Metric | Formula | Best when... | Fails when... |
|---|---|---|---|
| Accuracy | (TP+TN)/total | Balanced classes | Imbalanced classes |
| Precision | TP/(TP+FP) | FP cost is high | FN cost is high |
| Recall | TP/(TP+FN) | FN cost is high | FP cost is high |
| F1 | 2·P·R/(P+R) | Both costs matter | Only one cost matters |
The Precision-Recall Tradeoff
Adjusting the classification threshold (default: 0.5) moves you along the tradeoff:
- Lower threshold → model flags more things as positive → higher recall, lower precision
- Higher threshold → model is more selective → higher precision, lower recall
This is a design decision based on the cost of each error type.
Why It Matters
In real systems, accuracy is almost never the right metric:
- Fraud detection: High recall (catch most fraud), acceptable precision (some false alarms are OK)
- Medical diagnosis: High recall (don't miss disease), careful about precision
- Search / recommendations: Precision matters most (return relevant results, not noise)
- Spam filtering: High precision for inbox (don't filter legitimate email), some recall loss OK
Choosing the right metric is part of defining the problem. A model optimized for the wrong metric will behave unexpectedly in production.
Common Pitfalls
- Reporting accuracy on imbalanced datasets. Always also report precision, recall, and F1. If the positive class is rare, accuracy is nearly always misleading.
- Confusing precision and recall. A useful mnemonic: Precision = "of what I predicted positive, how many actually were?" Recall = "of all actual positives, how many did I recover?"
- Optimizing F1 when you care more about one error type. If false negatives are catastrophic (cancer, security threats), optimize for recall directly — not F1.
- Forgetting to set threshold explicitly. The default 0.5 threshold is arbitrary. Use a precision-recall curve to pick a threshold that matches your cost tradeoffs.
- Using macro vs. micro average without thinking. For multi-class, macro-F1 treats all classes equally (useful for imbalanced). Micro-F1 is dominated by the largest class.
Examples
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report
)
import numpy as np
# Imbalanced scenario: 90% negative, 10% positive
y_true = np.array([0]*90 + [1]*10)
# Model A: always predicts negative
y_pred_bad = np.zeros(100, dtype=int)
print("Model A (always negative):")
print(f" Accuracy: {accuracy_score(y_true, y_pred_bad):.2f}") # 0.90 — looks great!
print(f" Precision: {precision_score(y_true, y_pred_bad, zero_division=0):.2f}") # 0.00
print(f" Recall: {recall_score(y_true, y_pred_bad):.2f}") # 0.00
print(f" F1: {f1_score(y_true, y_pred_bad):.2f}") # 0.00
# Model B: a real classifier
y_pred_good = np.array([0]*85 + [1]*5 + [0]*2 + [1]*8)
print("\nModel B (real classifier):")
print(f" Accuracy: {accuracy_score(y_true, y_pred_good):.2f}") # 0.93
print(f" Precision: {precision_score(y_true, y_pred_good):.2f}") # 0.62
print(f" Recall: {recall_score(y_true, y_pred_good):.2f}") # 0.80
print(f" F1: {f1_score(y_true, y_pred_good):.2f}") # 0.70
# Full report
print(classification_report(y_true, y_pred_good))
Manual computation:
Given: TP=8, FP=5, TN=85, FN=2
Accuracy = (8 + 85) / 100 = 0.93
Precision = 8 / (8 + 5) = 0.62
Recall = 8 / (8 + 2) = 0.80
F1 = 2 × 0.62 × 0.80 / (0.62 + 0.80) = 0.70
Threshold tuning:
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# Get probability scores from model
y_scores = np.random.rand(100) # replace with model.predict_proba(X)[:,1]
precisions, recalls, thresholds = precision_recall_curve(y_true, y_scores)
# Pick threshold that gives recall >= 0.80
target_recall = 0.80
idx = np.argmax(recalls[::-1] >= target_recall)
chosen_threshold = thresholds[-(idx+1)]
print(f"Threshold for recall ≥ 0.80: {chosen_threshold:.3f}")