Probability Rules
Probability is a number between 0 and 1 that measures how likely something is. A handful of rules let you combine and reason about probabilities — they're the foundation of everything in statistics and ML.
Intuition First
Probability answers the question: "how likely is this?" On a scale from 0 (impossible) to 1 (certain).
- A fair coin flip: P(heads) = 0.5
- Rolling a 6 on a fair die: P(6) = 1/6 ≈ 0.17
- The sun rising tomorrow: P(sun rises) ≈ 1.0
From just a few rules, you can calculate the probability of any combination of events: "A or B," "A and B," "A given B already happened."
Conditional probability is the most important rule — it's how you update your beliefs when you get new information.
What's Actually Happening
Say you have two events: A = "it rains" and B = "I carry an umbrella."
- Complement: if there's a 30% chance of rain, there's a 70% chance of no rain.
- Union (OR): P(rain OR umbrella) — needs inclusion-exclusion to avoid double-counting the case where both happen.
- Intersection (AND): P(rain AND umbrella) — both happen together.
- Conditional: P(rain | umbrella) — given I have an umbrella, what's the chance it rains? This restricts the world to only cases where I'm carrying an umbrella.
The key insight: conditional probability shrinks the sample space. You're not asking "over all days, how often does it rain?" — you're asking "over days I carry an umbrella, how often does it rain?"
Build the Idea Step-by-Step
Formal Explanation
Basic rules:
0 ≤ P(A) ≤ 1
P(not A) = 1 − P(A) (complement)
Addition rule (OR):
P(A or B) = P(A) + P(B) − P(A and B)
(subtract the overlap so it isn't counted twice)
Multiplication rule (AND):
P(A and B) = P(A | B) · P(B)
= P(B | A) · P(A)
Conditional probability:
P(A | B) = P(A and B) / P(B)
Read as: "probability of A, given that B has already occurred."
Chain rule (extends to any number of events):
P(A, B, C) = P(A | B, C) · P(B | C) · P(C)
Key Properties / Rules
| Rule | Formula | When to use |
|---|---|---|
| Complement | P(not A) = 1 − P(A) | Easier to compute the opposite |
| Addition | P(A∪B) = P(A)+P(B)−P(A∩B) | "At least one of" |
| Multiplication | P(A∩B) = P(A|B)·P(B) | "Both" happen |
| Conditional | P(A|B) = P(A∩B)/P(B) | Given B, what about A? |
| Mutually exclusive | P(A∩B) = 0 | Can't both happen |
| Exhaustive | Σ P(Aᵢ) = 1 | All possibilities sum to 1 |
Why It Matters
Language models generate text using conditional probabilities:
P(next token | all previous tokens)
Each generation step applies the conditional probability rule — the model's whole job is estimating this.
Softmax output produces a valid probability distribution: all values ≥ 0, sum to 1. This satisfies the probability axioms. Cross-entropy loss measures how far this distribution is from the true one.
The addition rule is why precision and recall can't both be perfect on imbalanced datasets — increasing one pulls against the other.
Common Pitfalls
- Forgetting inclusion-exclusion. P(A or B) ≠ P(A) + P(B) unless A and B are mutually exclusive (can't both happen). Always subtract the overlap.
- Conditioning doesn't mean causation. P(cancer | smoker) > P(cancer) tells you smoking correlates with cancer, not that any given person's cancer was caused by smoking.
- Confusing P(A|B) with P(B|A). These are completely different quantities. P(spam | word "free") ≠ P(word "free" | spam). The confusion between them leads to real errors in reasoning (see Bayes' theorem).
Examples
# Probability rules in Python
p_rain = 0.30
p_umbrella = 0.40
p_rain_and_umbrella = 0.20 # both happen on 20% of days
# Complement
p_no_rain = 1 - p_rain # 0.70
# Addition rule (OR)
p_rain_or_umbrella = p_rain + p_umbrella - p_rain_and_umbrella # 0.50
# Conditional probability: P(rain | umbrella)
p_rain_given_umbrella = p_rain_and_umbrella / p_umbrella # 0.50
# Conditional probability: P(umbrella | rain)
p_umbrella_given_rain = p_rain_and_umbrella / p_rain # 0.67
print(f"P(no rain) = {p_no_rain}")
print(f"P(rain or umbrella) = {p_rain_or_umbrella}")
print(f"P(rain | umbrella) = {p_rain_given_umbrella:.2f}")
print(f"P(umbrella | rain) = {p_umbrella_given_rain:.2f}")
# Note: P(rain|umbrella) ≠ P(umbrella|rain)