Evaluation Metrics

Accuracy alone is misleading. Precision, recall, and F1 score give you a complete picture of model performance — especially when classes are imbalanced or when the cost of false positives and false negatives is different.

Intuition First

Imagine a hospital alarm that detects cancer from a scan.

If it never fires, it has 0% of all cancers detected — terrible. If it always fires, it detects 100% of cancers but also flags everyone who's healthy — also terrible.

Accuracy doesn't capture this tradeoff. A model that predicts "no cancer" for everyone scores 99% accuracy if only 1% of patients have cancer — and is completely useless.

You need metrics that separate being cautious from being thorough.

What's Actually Happening

All classification metrics are built from four outcomes when the model makes a binary prediction:

	Model says Positive	Model says Negative
Actually Positive	True Positive (TP) ✓	False Negative (FN) ✗
Actually Negative	False Positive (FP) ✗	True Negative (TN) ✓

TP: Correctly detected a positive case (caught the cancer)
TN: Correctly identified a negative case (correctly cleared a healthy patient)
FP: False alarm (flagged a healthy patient as sick) — Type I error
FN: Missed a real positive (cleared a sick patient) — Type II error

Every metric is a different combination of these four numbers.

Build the Idea Step-by-Step

Make predictions

→

Build confusion matrix: TP, FP, TN, FN

→

Accuracy: overall correctness

→

Precision: of all positives predicted, how many were real?

→

Recall: of all real positives, how many did we catch?

→

F1: balance precision and recall into one number

Formal Explanation

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

What fraction of all predictions were correct? Simple, but misleading when classes are imbalanced.

When it fails: 99% of emails are not spam. A model that says "not spam" for everything gets 99% accuracy — and catches zero spam.

Precision

Precision = TP / (TP + FP)

Of all the times the model predicted positive, how often was it actually positive?

High precision matters when false positives are costly. Example: flagging a legal transaction as fraud. Each false alarm costs customer trust and manual review time.

Recall (Sensitivity)

Recall = TP / (TP + FN)

Of all the actual positive cases, how many did the model catch?

High recall matters when false negatives are costly. Example: missing a cancer diagnosis. A false negative (missed cancer) is much worse than a false positive (unnecessary follow-up).

F1 Score

F1 = 2 × (Precision × Recall) / (Precision + Recall)

The harmonic mean of precision and recall. It rewards high scores on both — if either is very low, F1 is pulled down.

Use F1 when:

Classes are imbalanced
You care about both false positives and false negatives
You need a single number that reflects the precision/recall tradeoff

Key Properties / Rules

Metric	Formula	Best when...	Fails when...
Accuracy	(TP+TN)/total	Balanced classes	Imbalanced classes
Precision	TP/(TP+FP)	FP cost is high	FN cost is high
Recall	TP/(TP+FN)	FN cost is high	FP cost is high
F1	2·P·R/(P+R)	Both costs matter	Only one cost matters

The Precision-Recall Tradeoff

Adjusting the classification threshold (default: 0.5) moves you along the tradeoff:

Lower threshold → model flags more things as positive → higher recall, lower precision
Higher threshold → model is more selective → higher precision, lower recall

This is a design decision based on the cost of each error type.

Why It Matters

In real systems, accuracy is almost never the right metric:

Fraud detection: High recall (catch most fraud), acceptable precision (some false alarms are OK)
Medical diagnosis: High recall (don't miss disease), careful about precision
Search / recommendations: Precision matters most (return relevant results, not noise)
Spam filtering: High precision for inbox (don't filter legitimate email), some recall loss OK

Choosing the right metric is part of defining the problem. A model optimized for the wrong metric will behave unexpectedly in production.

Common Pitfalls

Reporting accuracy on imbalanced datasets. Always also report precision, recall, and F1. If the positive class is rare, accuracy is nearly always misleading.
Confusing precision and recall. A useful mnemonic: Precision = "of what I predicted positive, how many actually were?" Recall = "of all actual positives, how many did I recover?"
Optimizing F1 when you care more about one error type. If false negatives are catastrophic (cancer, security threats), optimize for recall directly — not F1.
Forgetting to set threshold explicitly. The default 0.5 threshold is arbitrary. Use a precision-recall curve to pick a threshold that matches your cost tradeoffs.
Using macro vs. micro average without thinking. For multi-class, macro-F1 treats all classes equally (useful for imbalanced). Micro-F1 is dominated by the largest class.

Examples

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report
)
import numpy as np

# Imbalanced scenario: 90% negative, 10% positive
y_true = np.array([0]*90 + [1]*10)

# Model A: always predicts negative
y_pred_bad = np.zeros(100, dtype=int)
print("Model A (always negative):")
print(f"  Accuracy:  {accuracy_score(y_true, y_pred_bad):.2f}")   # 0.90 — looks great!
print(f"  Precision: {precision_score(y_true, y_pred_bad, zero_division=0):.2f}")  # 0.00
print(f"  Recall:    {recall_score(y_true, y_pred_bad):.2f}")      # 0.00
print(f"  F1:        {f1_score(y_true, y_pred_bad):.2f}")          # 0.00

# Model B: a real classifier
y_pred_good = np.array([0]*85 + [1]*5 + [0]*2 + [1]*8)
print("\nModel B (real classifier):")
print(f"  Accuracy:  {accuracy_score(y_true, y_pred_good):.2f}")   # 0.93
print(f"  Precision: {precision_score(y_true, y_pred_good):.2f}")  # 0.62
print(f"  Recall:    {recall_score(y_true, y_pred_good):.2f}")     # 0.80
print(f"  F1:        {f1_score(y_true, y_pred_good):.2f}")         # 0.70

# Full report
print(classification_report(y_true, y_pred_good))

Manual computation:

Given: TP=8, FP=5, TN=85, FN=2

Accuracy  = (8 + 85) / 100 = 0.93
Precision = 8 / (8 + 5)   = 0.62
Recall    = 8 / (8 + 2)   = 0.80
F1        = 2 × 0.62 × 0.80 / (0.62 + 0.80) = 0.70

Threshold tuning:

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

# Get probability scores from model
y_scores = np.random.rand(100)  # replace with model.predict_proba(X)[:,1]

precisions, recalls, thresholds = precision_recall_curve(y_true, y_scores)

# Pick threshold that gives recall >= 0.80
target_recall = 0.80
idx = np.argmax(recalls[::-1] >= target_recall)
chosen_threshold = thresholds[-(idx+1)]
print(f"Threshold for recall ≥ 0.80: {chosen_threshold:.3f}")