MnemosyneMnemosyne

Evaluation Metrics

Accuracy alone is misleading. Precision, recall, and F1 score give you a complete picture of model performance — especially when classes are imbalanced or when the cost of false positives and false negatives is different.

Intuition First

Imagine a hospital alarm that detects cancer from a scan.

If it never fires, it has 0% of all cancers detected — terrible. If it always fires, it detects 100% of cancers but also flags everyone who's healthy — also terrible.

Accuracy doesn't capture this tradeoff. A model that predicts "no cancer" for everyone scores 99% accuracy if only 1% of patients have cancer — and is completely useless.

You need metrics that separate being cautious from being thorough.


What's Actually Happening

All classification metrics are built from four outcomes when the model makes a binary prediction:

Model says PositiveModel says Negative
Actually PositiveTrue Positive (TP) ✓False Negative (FN) ✗
Actually NegativeFalse Positive (FP) ✗True Negative (TN) ✓
  • TP: Correctly detected a positive case (caught the cancer)
  • TN: Correctly identified a negative case (correctly cleared a healthy patient)
  • FP: False alarm (flagged a healthy patient as sick) — Type I error
  • FN: Missed a real positive (cleared a sick patient) — Type II error

Every metric is a different combination of these four numbers.


Build the Idea Step-by-Step

Make predictions
Build confusion matrix: TP, FP, TN, FN
Accuracy: overall correctness
Precision: of all positives predicted, how many were real?
Recall: of all real positives, how many did we catch?
F1: balance precision and recall into one number

Formal Explanation

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

What fraction of all predictions were correct? Simple, but misleading when classes are imbalanced.

When it fails: 99% of emails are not spam. A model that says "not spam" for everything gets 99% accuracy — and catches zero spam.


Precision

Precision = TP / (TP + FP)

Of all the times the model predicted positive, how often was it actually positive?

High precision matters when false positives are costly. Example: flagging a legal transaction as fraud. Each false alarm costs customer trust and manual review time.


Recall (Sensitivity)

Recall = TP / (TP + FN)

Of all the actual positive cases, how many did the model catch?

High recall matters when false negatives are costly. Example: missing a cancer diagnosis. A false negative (missed cancer) is much worse than a false positive (unnecessary follow-up).


F1 Score

F1 = 2 × (Precision × Recall) / (Precision + Recall)

The harmonic mean of precision and recall. It rewards high scores on both — if either is very low, F1 is pulled down.

Use F1 when:

  • Classes are imbalanced
  • You care about both false positives and false negatives
  • You need a single number that reflects the precision/recall tradeoff

Key Properties / Rules

MetricFormulaBest when...Fails when...
Accuracy(TP+TN)/totalBalanced classesImbalanced classes
PrecisionTP/(TP+FP)FP cost is highFN cost is high
RecallTP/(TP+FN)FN cost is highFP cost is high
F12·P·R/(P+R)Both costs matterOnly one cost matters

The Precision-Recall Tradeoff

Adjusting the classification threshold (default: 0.5) moves you along the tradeoff:

  • Lower threshold → model flags more things as positive → higher recall, lower precision
  • Higher threshold → model is more selective → higher precision, lower recall

This is a design decision based on the cost of each error type.


Why It Matters

In real systems, accuracy is almost never the right metric:

  • Fraud detection: High recall (catch most fraud), acceptable precision (some false alarms are OK)
  • Medical diagnosis: High recall (don't miss disease), careful about precision
  • Search / recommendations: Precision matters most (return relevant results, not noise)
  • Spam filtering: High precision for inbox (don't filter legitimate email), some recall loss OK

Choosing the right metric is part of defining the problem. A model optimized for the wrong metric will behave unexpectedly in production.


Common Pitfalls

  • Reporting accuracy on imbalanced datasets. Always also report precision, recall, and F1. If the positive class is rare, accuracy is nearly always misleading.
  • Confusing precision and recall. A useful mnemonic: Precision = "of what I predicted positive, how many actually were?" Recall = "of all actual positives, how many did I recover?"
  • Optimizing F1 when you care more about one error type. If false negatives are catastrophic (cancer, security threats), optimize for recall directly — not F1.
  • Forgetting to set threshold explicitly. The default 0.5 threshold is arbitrary. Use a precision-recall curve to pick a threshold that matches your cost tradeoffs.
  • Using macro vs. micro average without thinking. For multi-class, macro-F1 treats all classes equally (useful for imbalanced). Micro-F1 is dominated by the largest class.

Examples

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report
)
import numpy as np

# Imbalanced scenario: 90% negative, 10% positive
y_true = np.array([0]*90 + [1]*10)

# Model A: always predicts negative
y_pred_bad = np.zeros(100, dtype=int)
print("Model A (always negative):")
print(f"  Accuracy:  {accuracy_score(y_true, y_pred_bad):.2f}")   # 0.90 — looks great!
print(f"  Precision: {precision_score(y_true, y_pred_bad, zero_division=0):.2f}")  # 0.00
print(f"  Recall:    {recall_score(y_true, y_pred_bad):.2f}")      # 0.00
print(f"  F1:        {f1_score(y_true, y_pred_bad):.2f}")          # 0.00

# Model B: a real classifier
y_pred_good = np.array([0]*85 + [1]*5 + [0]*2 + [1]*8)
print("\nModel B (real classifier):")
print(f"  Accuracy:  {accuracy_score(y_true, y_pred_good):.2f}")   # 0.93
print(f"  Precision: {precision_score(y_true, y_pred_good):.2f}")  # 0.62
print(f"  Recall:    {recall_score(y_true, y_pred_good):.2f}")     # 0.80
print(f"  F1:        {f1_score(y_true, y_pred_good):.2f}")         # 0.70

# Full report
print(classification_report(y_true, y_pred_good))

Manual computation:

Given: TP=8, FP=5, TN=85, FN=2

Accuracy  = (8 + 85) / 100 = 0.93
Precision = 8 / (8 + 5)   = 0.62
Recall    = 8 / (8 + 2)   = 0.80
F1        = 2 × 0.62 × 0.80 / (0.62 + 0.80) = 0.70

Threshold tuning:

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

# Get probability scores from model
y_scores = np.random.rand(100)  # replace with model.predict_proba(X)[:,1]

precisions, recalls, thresholds = precision_recall_curve(y_true, y_scores)

# Pick threshold that gives recall >= 0.80
target_recall = 0.80
idx = np.argmax(recalls[::-1] >= target_recall)
chosen_threshold = thresholds[-(idx+1)]
print(f"Threshold for recall ≥ 0.80: {chosen_threshold:.3f}")

Review Questions