Probability Rules

Probability is a number between 0 and 1 that measures how likely something is. A handful of rules let you combine and reason about probabilities — they're the foundation of everything in statistics and ML.

Intuition First

Probability answers the question: "how likely is this?" On a scale from 0 (impossible) to 1 (certain).

A fair coin flip: P(heads) = 0.5
Rolling a 6 on a fair die: P(6) = 1/6 ≈ 0.17
The sun rising tomorrow: P(sun rises) ≈ 1.0

From just a few rules, you can calculate the probability of any combination of events: "A or B," "A and B," "A given B already happened."

Conditional probability is the most important rule — it's how you update your beliefs when you get new information.

What's Actually Happening

Say you have two events: A = "it rains" and B = "I carry an umbrella."

Complement: if there's a 30% chance of rain, there's a 70% chance of no rain.
Union (OR): P(rain OR umbrella) — needs inclusion-exclusion to avoid double-counting the case where both happen.
Intersection (AND): P(rain AND umbrella) — both happen together.
Conditional: P(rain | umbrella) — given I have an umbrella, what's the chance it rains? This restricts the world to only cases where I'm carrying an umbrella.

The key insight: conditional probability shrinks the sample space. You're not asking "over all days, how often does it rain?" — you're asking "over days I carry an umbrella, how often does it rain?"

Build the Idea Step-by-Step

Event A with probability P(A)

→

Complement: P(not A) = 1 − P(A)

→

A and B together: multiplication rule

→

A or B: addition rule − overlap

→

A given B: conditional probability P(A|B)

Formal Explanation

Basic rules:

0 ≤ P(A) ≤ 1
P(not A) = 1 − P(A)               (complement)

Addition rule (OR):

P(A or B) = P(A) + P(B) − P(A and B)

(subtract the overlap so it isn't counted twice)

Multiplication rule (AND):

P(A and B) = P(A | B) · P(B)
           = P(B | A) · P(A)

Conditional probability:

P(A | B) = P(A and B) / P(B)

Read as: "probability of A, given that B has already occurred."

Chain rule (extends to any number of events):

P(A, B, C) = P(A | B, C) · P(B | C) · P(C)

Key Properties / Rules

Rule	Formula	When to use
Complement	`P(not A) = 1 − P(A)`	Easier to compute the opposite
Addition	`P(A∪B) = P(A)+P(B)−P(A∩B)`	"At least one of"
Multiplication	`P(A∩B) = P(A\|B)·P(B)`	"Both" happen
Conditional	`P(A\|B) = P(A∩B)/P(B)`	Given B, what about A?
Mutually exclusive	`P(A∩B) = 0`	Can't both happen
Exhaustive	`Σ P(Aᵢ) = 1`	All possibilities sum to 1

Why It Matters

Language models generate text using conditional probabilities:

P(next token | all previous tokens)

Each generation step applies the conditional probability rule — the model's whole job is estimating this.

Softmax output produces a valid probability distribution: all values ≥ 0, sum to 1. This satisfies the probability axioms. Cross-entropy loss measures how far this distribution is from the true one.

The addition rule is why precision and recall can't both be perfect on imbalanced datasets — increasing one pulls against the other.

Common Pitfalls

Forgetting inclusion-exclusion. P(A or B) ≠ P(A) + P(B) unless A and B are mutually exclusive (can't both happen). Always subtract the overlap.
Conditioning doesn't mean causation. P(cancer | smoker) > P(cancer) tells you smoking correlates with cancer, not that any given person's cancer was caused by smoking.
Confusing P(A|B) with P(B|A). These are completely different quantities. P(spam | word "free") ≠ P(word "free" | spam). The confusion between them leads to real errors in reasoning (see Bayes' theorem).

Examples

# Probability rules in Python
p_rain = 0.30
p_umbrella = 0.40
p_rain_and_umbrella = 0.20  # both happen on 20% of days

# Complement
p_no_rain = 1 - p_rain   # 0.70

# Addition rule (OR)
p_rain_or_umbrella = p_rain + p_umbrella - p_rain_and_umbrella   # 0.50

# Conditional probability: P(rain | umbrella)
p_rain_given_umbrella = p_rain_and_umbrella / p_umbrella   # 0.50

# Conditional probability: P(umbrella | rain)
p_umbrella_given_rain = p_rain_and_umbrella / p_rain   # 0.67

print(f"P(no rain) = {p_no_rain}")
print(f"P(rain or umbrella) = {p_rain_or_umbrella}")
print(f"P(rain | umbrella) = {p_rain_given_umbrella:.2f}")
print(f"P(umbrella | rain) = {p_umbrella_given_rain:.2f}")
# Note: P(rain|umbrella) ≠ P(umbrella|rain)