MnemosyneMnemosyne

Bayes' Theorem

Bayes' theorem is the rule for updating your beliefs when you get new evidence. Prior belief + new data → updated belief. It's the mathematical foundation for how any rational agent should reason under uncertainty.

Intuition First

Imagine you're a doctor. A patient tests positive for a rare disease. How worried should you be?

Your answer depends on two things:

  1. How reliable is the test? (Does it actually detect the disease?)
  2. How rare is the disease in the first place?

Bayes' theorem is the formula that combines these. It says: your updated belief (posterior) depends on both the strength of the evidence and your starting belief (prior).

Without Bayes, people often dramatically overreact to a positive test for a rare condition — they forget the prior.


What's Actually Happening

You start with a prior: your belief before seeing evidence. Example: "This disease affects 1% of people."

You observe evidence: a positive test result.

The likelihood tells you how probable that evidence was, assuming the hypothesis is true: "If you have the disease, the test is positive 95% of the time."

Bayes' theorem combines these to give you the posterior: your updated belief after the evidence.

The key insight: rare events stay rare even after positive evidence. If only 1 in 100 people has the disease, most positive tests still come from healthy people — because there are so many more healthy people.


Build the Idea Step-by-Step

Prior: P(hypothesis) — starting belief
Likelihood: P(evidence | hypothesis) — how probable is this evidence?
Evidence: P(evidence) — how common is this evidence overall?
Posterior: P(hypothesis | evidence) = prior × likelihood / evidence

Formal Explanation

P(A | B) = P(B | A) · P(A) / P(B)

In the context of hypothesis H and evidence E:

P(H | E) = P(E | H) · P(H) / P(E)

Where:

  • P(H)prior: belief in H before seeing E
  • P(E | H)likelihood: probability of evidence E if H is true
  • P(E)marginal likelihood: overall probability of seeing E (normalizes everything)
  • P(H | E)posterior: belief in H after seeing E

Computing P(E) using the law of total probability:

P(E) = P(E | H) · P(H) + P(E | not H) · P(not H)

Key Properties / Rules

TermNameMeaning
P(H)PriorBelief before evidence
P(E|H)LikelihoodHow probable is E if H is true?
P(E)Evidence / MarginalNormalizing constant
P(H|E)PosteriorBelief after evidence
P(H|E) ∝ P(E|H)·P(H)ProportionalityOften compute unnormalized first

Why It Matters

Spam filters use Bayes: P(spam | word "free") = P(word "free" | spam) · P(spam) / P(word "free"). The prior is the base rate of spam; the likelihood is how often "free" appears in spam emails. Update for every word → posterior probability this email is spam.

L2 regularization = Gaussian prior. When you train a neural network with weight decay, you're implicitly doing MAP (Maximum A Posteriori) estimation with a Gaussian prior on the weights. The regularization term is the log of that prior.

Bayesian updating is the principled way to update a model as new data arrives — no retraining from scratch, just update the posterior.


Common Pitfalls

  • Ignoring the prior (base rate neglect). The most common mistake. A 95%-accurate test for a 1-in-1000 disease still mostly gives false positives. Always ask: "how rare is this in the first place?"
  • Confusing P(A|B) and P(B|A). P(disease|positive test) and P(positive test|disease) are completely different. They can differ by 10× or more. This confusion is called the prosecutor's fallacy and has sent innocent people to prison.
  • Treating the posterior as certain. The posterior is still a probability distribution, not a fact. You can be 90% sure and still be wrong 10% of the time.

Examples

# Medical test example
p_disease = 0.01           # Prior: 1% of population has the disease
p_pos_given_disease = 0.95  # Sensitivity: 95% true positive rate
p_pos_given_healthy = 0.05  # False positive rate: 5%

# P(positive test) — law of total probability
p_healthy = 1 - p_disease
p_positive = (p_pos_given_disease * p_disease +
              p_pos_given_healthy * p_healthy)
print(f"P(positive) = {p_positive:.4f}")   # ≈ 0.059

# Bayes' theorem: P(disease | positive)
p_disease_given_pos = (p_pos_given_disease * p_disease) / p_positive
print(f"P(disease | positive test) = {p_disease_given_pos:.3f}")  # ≈ 0.161

# Despite 95% sensitivity, only ~16% of positive tests are truly diseased
# Because the disease is rare (1%) — the prior dominates

# Updating beliefs with multiple pieces of evidence
# After two independent positive tests:
# New prior = posterior from first test
p_disease_after_test1 = p_disease_given_pos
p_positive2 = (p_pos_given_disease * p_disease_after_test1 +
               p_pos_given_healthy * (1 - p_disease_after_test1))
p_disease_after_test2 = (p_pos_given_disease * p_disease_after_test1) / p_positive2
print(f"After two positive tests: {p_disease_after_test2:.3f}")  # ≈ 0.784

Review Questions