Independence
Two events are independent if knowing one happened tells you nothing about whether the other happened. Independence is the assumption that makes most ML algorithms tractable.
Intuition First
Flip two separate coins. Does the first coin landing heads tell you anything about whether the second coin will land heads? No — they're completely unrelated. That's independence.
Now consider: does knowing it's raining tell you anything about whether someone carries an umbrella? Yes — people are much more likely to carry umbrellas when it rains. These events are dependent.
The test is simple: does learning about one event change your prediction for the other?
What's Actually Happening
For independent events A and B:
- Knowing B happened doesn't change your estimate of A
- The probability of both happening is just the product of their individual probabilities
For dependent events:
- Knowing B shifts your belief about A
- The joint probability doesn't factor cleanly
Conditional independence is subtler: A and B might be dependent overall, but once you know C, they become independent. Example: "I got wet" and "I slipped" are dependent — rainy days cause both. But if you already know "it's raining," knowing I got wet tells you nothing extra about whether I slipped. They become conditionally independent given rain.
Build the Idea Step-by-Step
Formal Explanation
Independence: A and B are independent if:
P(A and B) = P(A) · P(B)
Equivalently: P(A | B) = P(A)
P(B | A) = P(B)
Conditional independence: A is conditionally independent of B given C if:
P(A and B | C) = P(A | C) · P(B | C)
Notation: A ⊥ B | C
i.i.d. (independent and identically distributed): each sample xᵢ is drawn independently from the same distribution P(x). The joint probability of the full dataset factors as:
P(x₁, x₂, ..., xₙ) = P(x₁) · P(x₂) · ... · P(xₙ)
Key Properties / Rules
| Concept | Meaning |
|---|---|
| Independence test | P(A∩B) = P(A)·P(B) |
| Equivalent test | P(A|B) = P(A) |
| Not the same as mutually exclusive | Mutually exclusive events are maximally dependent |
| Conditional independence | A ⊥ B | C — independent once C is known |
| i.i.d. assumption | Each data point drawn independently from same distribution |
Important: mutually exclusive events (can't both happen) are NOT independent — if A happened, B definitely didn't, so knowing A completely determines B.
Why It Matters
Gradient descent computes gradients on mini-batches. The gradient estimate is only unbiased if the batch samples are independent. If they're correlated (e.g., consecutive frames in a video), the estimate is biased. This is why training data gets shuffled.
Naive Bayes classifiers assume all features are conditionally independent given the class. This makes the joint probability easy to compute:
P(features | class) = P(f₁|class) · P(f₂|class) · ... · P(fₙ|class)
It's "naive" because this is almost never exactly true — but it often works well anyway.
Reinforcement learning (experience replay) stores transitions in a buffer and samples randomly. This breaks the temporal correlation between consecutive actions, making updates more stable.
Common Pitfalls
- Uncorrelated ≠ independent. Zero correlation (linear relationship) doesn't mean no relationship. X and X² are uncorrelated (Cov = 0) but completely dependent — knowing X determines X².
- Mutually exclusive events are NOT independent. If P(A) > 0 and P(B) > 0, and A and B can't both happen, then P(A|B) = 0 ≠ P(A). They're maximally dependent.
- Independence is often an assumption, not a fact. Most real-world features are correlated. When we assume i.i.d., we're making an approximation that simplifies the math — often a useful one, but worth knowing it's an assumption.
Examples
import numpy as np
# Test independence: P(A and B) == P(A) * P(B)?
p_A = 0.4
p_B = 0.3
p_A_and_B = 0.12 # exactly p_A * p_B → independent
print(f"P(A)·P(B) = {p_A * p_B}") # 0.12
print(f"P(A and B) = {p_A_and_B}") # 0.12
print(f"Independent: {np.isclose(p_A * p_B, p_A_and_B)}") # True
# Dependent events
p_rain = 0.3
p_umbrella = 0.4
p_rain_and_umbrella = 0.20 # > 0.3*0.4=0.12 → dependent!
print(f"\nRain and umbrella independent? {np.isclose(p_rain * p_umbrella, p_rain_and_umbrella)}")
# i.i.d.: joint probability of dataset factors as product
# Likelihood of observing [heads, tails, heads] from fair coin:
p_heads = 0.5
observations = ['H', 'T', 'H']
# P(H,T,H) = P(H)*P(T)*P(H) because i.i.d.
p_joint = p_heads * (1 - p_heads) * p_heads
print(f"\nP(H,T,H) = {p_joint}") # 0.125