Bayes' Theorem
Bayes' theorem is the rule for updating your beliefs when you get new evidence. Prior belief + new data → updated belief. It's the mathematical foundation for how any rational agent should reason under uncertainty.
Intuition First
Imagine you're a doctor. A patient tests positive for a rare disease. How worried should you be?
Your answer depends on two things:
- How reliable is the test? (Does it actually detect the disease?)
- How rare is the disease in the first place?
Bayes' theorem is the formula that combines these. It says: your updated belief (posterior) depends on both the strength of the evidence and your starting belief (prior).
Without Bayes, people often dramatically overreact to a positive test for a rare condition — they forget the prior.
What's Actually Happening
You start with a prior: your belief before seeing evidence. Example: "This disease affects 1% of people."
You observe evidence: a positive test result.
The likelihood tells you how probable that evidence was, assuming the hypothesis is true: "If you have the disease, the test is positive 95% of the time."
Bayes' theorem combines these to give you the posterior: your updated belief after the evidence.
The key insight: rare events stay rare even after positive evidence. If only 1 in 100 people has the disease, most positive tests still come from healthy people — because there are so many more healthy people.
Build the Idea Step-by-Step
Formal Explanation
P(A | B) = P(B | A) · P(A) / P(B)
In the context of hypothesis H and evidence E:
P(H | E) = P(E | H) · P(H) / P(E)
Where:
- P(H) — prior: belief in H before seeing E
- P(E | H) — likelihood: probability of evidence E if H is true
- P(E) — marginal likelihood: overall probability of seeing E (normalizes everything)
- P(H | E) — posterior: belief in H after seeing E
Computing P(E) using the law of total probability:
P(E) = P(E | H) · P(H) + P(E | not H) · P(not H)
Key Properties / Rules
| Term | Name | Meaning |
|---|---|---|
| P(H) | Prior | Belief before evidence |
| P(E|H) | Likelihood | How probable is E if H is true? |
| P(E) | Evidence / Marginal | Normalizing constant |
| P(H|E) | Posterior | Belief after evidence |
| P(H|E) ∝ P(E|H)·P(H) | Proportionality | Often compute unnormalized first |
Why It Matters
Spam filters use Bayes: P(spam | word "free") = P(word "free" | spam) · P(spam) / P(word "free"). The prior is the base rate of spam; the likelihood is how often "free" appears in spam emails. Update for every word → posterior probability this email is spam.
L2 regularization = Gaussian prior. When you train a neural network with weight decay, you're implicitly doing MAP (Maximum A Posteriori) estimation with a Gaussian prior on the weights. The regularization term is the log of that prior.
Bayesian updating is the principled way to update a model as new data arrives — no retraining from scratch, just update the posterior.
Common Pitfalls
- Ignoring the prior (base rate neglect). The most common mistake. A 95%-accurate test for a 1-in-1000 disease still mostly gives false positives. Always ask: "how rare is this in the first place?"
- Confusing P(A|B) and P(B|A). P(disease|positive test) and P(positive test|disease) are completely different. They can differ by 10× or more. This confusion is called the prosecutor's fallacy and has sent innocent people to prison.
- Treating the posterior as certain. The posterior is still a probability distribution, not a fact. You can be 90% sure and still be wrong 10% of the time.
Examples
# Medical test example
p_disease = 0.01 # Prior: 1% of population has the disease
p_pos_given_disease = 0.95 # Sensitivity: 95% true positive rate
p_pos_given_healthy = 0.05 # False positive rate: 5%
# P(positive test) — law of total probability
p_healthy = 1 - p_disease
p_positive = (p_pos_given_disease * p_disease +
p_pos_given_healthy * p_healthy)
print(f"P(positive) = {p_positive:.4f}") # ≈ 0.059
# Bayes' theorem: P(disease | positive)
p_disease_given_pos = (p_pos_given_disease * p_disease) / p_positive
print(f"P(disease | positive test) = {p_disease_given_pos:.3f}") # ≈ 0.161
# Despite 95% sensitivity, only ~16% of positive tests are truly diseased
# Because the disease is rare (1%) — the prior dominates
# Updating beliefs with multiple pieces of evidence
# After two independent positive tests:
# New prior = posterior from first test
p_disease_after_test1 = p_disease_given_pos
p_positive2 = (p_pos_given_disease * p_disease_after_test1 +
p_pos_given_healthy * (1 - p_disease_after_test1))
p_disease_after_test2 = (p_pos_given_disease * p_disease_after_test1) / p_positive2
print(f"After two positive tests: {p_disease_after_test2:.3f}") # ≈ 0.784