Bayes' Theorem

Bayes' theorem is the rule for updating your beliefs when you get new evidence. Prior belief + new data → updated belief. It's the mathematical foundation for how any rational agent should reason under uncertainty.

Intuition First

Imagine you're a doctor. A patient tests positive for a rare disease. How worried should you be?

Your answer depends on two things:

How reliable is the test? (Does it actually detect the disease?)
How rare is the disease in the first place?

Bayes' theorem is the formula that combines these. It says: your updated belief (posterior) depends on both the strength of the evidence and your starting belief (prior).

Without Bayes, people often dramatically overreact to a positive test for a rare condition — they forget the prior.

What's Actually Happening

You start with a prior: your belief before seeing evidence. Example: "This disease affects 1% of people."

You observe evidence: a positive test result.

The likelihood tells you how probable that evidence was, assuming the hypothesis is true: "If you have the disease, the test is positive 95% of the time."

Bayes' theorem combines these to give you the posterior: your updated belief after the evidence.

The key insight: rare events stay rare even after positive evidence. If only 1 in 100 people has the disease, most positive tests still come from healthy people — because there are so many more healthy people.

Build the Idea Step-by-Step

Prior: P(hypothesis) — starting belief

→

Likelihood: P(evidence | hypothesis) — how probable is this evidence?

→

Evidence: P(evidence) — how common is this evidence overall?

→

Posterior: P(hypothesis | evidence) = prior × likelihood / evidence

Formal Explanation

P(A | B) = P(B | A) · P(A) / P(B)

In the context of hypothesis H and evidence E:

P(H | E) = P(E | H) · P(H) / P(E)

Where:

P(H) — prior: belief in H before seeing E
P(E | H) — likelihood: probability of evidence E if H is true
P(E) — marginal likelihood: overall probability of seeing E (normalizes everything)
P(H | E) — posterior: belief in H after seeing E

Computing P(E) using the law of total probability:

P(E) = P(E | H) · P(H) + P(E | not H) · P(not H)

Key Properties / Rules

Term	Name	Meaning
P(H)	Prior	Belief before evidence
P(E\|H)	Likelihood	How probable is E if H is true?
P(E)	Evidence / Marginal	Normalizing constant
P(H\|E)	Posterior	Belief after evidence
P(H\|E) ∝ P(E\|H)·P(H)	Proportionality	Often compute unnormalized first

Why It Matters

Spam filters use Bayes: P(spam | word "free") = P(word "free" | spam) · P(spam) / P(word "free"). The prior is the base rate of spam; the likelihood is how often "free" appears in spam emails. Update for every word → posterior probability this email is spam.

L2 regularization = Gaussian prior. When you train a neural network with weight decay, you're implicitly doing MAP (Maximum A Posteriori) estimation with a Gaussian prior on the weights. The regularization term is the log of that prior.

Bayesian updating is the principled way to update a model as new data arrives — no retraining from scratch, just update the posterior.

Common Pitfalls

Ignoring the prior (base rate neglect). The most common mistake. A 95%-accurate test for a 1-in-1000 disease still mostly gives false positives. Always ask: "how rare is this in the first place?"
Confusing P(A|B) and P(B|A). P(disease|positive test) and P(positive test|disease) are completely different. They can differ by 10× or more. This confusion is called the prosecutor's fallacy and has sent innocent people to prison.
Treating the posterior as certain. The posterior is still a probability distribution, not a fact. You can be 90% sure and still be wrong 10% of the time.

Examples

# Medical test example
p_disease = 0.01           # Prior: 1% of population has the disease
p_pos_given_disease = 0.95  # Sensitivity: 95% true positive rate
p_pos_given_healthy = 0.05  # False positive rate: 5%

# P(positive test) — law of total probability
p_healthy = 1 - p_disease
p_positive = (p_pos_given_disease * p_disease +
              p_pos_given_healthy * p_healthy)
print(f"P(positive) = {p_positive:.4f}")   # ≈ 0.059

# Bayes' theorem: P(disease | positive)
p_disease_given_pos = (p_pos_given_disease * p_disease) / p_positive
print(f"P(disease | positive test) = {p_disease_given_pos:.3f}")  # ≈ 0.161

# Despite 95% sensitivity, only ~16% of positive tests are truly diseased
# Because the disease is rare (1%) — the prior dominates

# Updating beliefs with multiple pieces of evidence
# After two independent positive tests:
# New prior = posterior from first test
p_disease_after_test1 = p_disease_given_pos
p_positive2 = (p_pos_given_disease * p_disease_after_test1 +
               p_pos_given_healthy * (1 - p_disease_after_test1))
p_disease_after_test2 = (p_pos_given_disease * p_disease_after_test1) / p_positive2
print(f"After two positive tests: {p_disease_after_test2:.3f}")  # ≈ 0.784