MnemosyneMnemosyne

Independence

Two events are independent if knowing one happened tells you nothing about whether the other happened. Independence is the assumption that makes most ML algorithms tractable.

Intuition First

Flip two separate coins. Does the first coin landing heads tell you anything about whether the second coin will land heads? No — they're completely unrelated. That's independence.

Now consider: does knowing it's raining tell you anything about whether someone carries an umbrella? Yes — people are much more likely to carry umbrellas when it rains. These events are dependent.

The test is simple: does learning about one event change your prediction for the other?


What's Actually Happening

For independent events A and B:

  • Knowing B happened doesn't change your estimate of A
  • The probability of both happening is just the product of their individual probabilities

For dependent events:

  • Knowing B shifts your belief about A
  • The joint probability doesn't factor cleanly

Conditional independence is subtler: A and B might be dependent overall, but once you know C, they become independent. Example: "I got wet" and "I slipped" are dependent — rainy days cause both. But if you already know "it's raining," knowing I got wet tells you nothing extra about whether I slipped. They become conditionally independent given rain.


Build the Idea Step-by-Step

Events A and B
Independent: P(A|B) = P(A)
Joint probability: P(A,B) = P(A)·P(B)
Conditional independence: A⊥B|C
i.i.d.: each sample is independent + same distribution

Formal Explanation

Independence: A and B are independent if:

P(A and B) = P(A) · P(B)

Equivalently: P(A | B) = P(A)
              P(B | A) = P(B)

Conditional independence: A is conditionally independent of B given C if:

P(A and B | C) = P(A | C) · P(B | C)

Notation: A ⊥ B | C

i.i.d. (independent and identically distributed): each sample xᵢ is drawn independently from the same distribution P(x). The joint probability of the full dataset factors as:

P(x₁, x₂, ..., xₙ) = P(x₁) · P(x₂) · ... · P(xₙ)

Key Properties / Rules

ConceptMeaning
Independence testP(A∩B) = P(A)·P(B)
Equivalent testP(A|B) = P(A)
Not the same as mutually exclusiveMutually exclusive events are maximally dependent
Conditional independenceA ⊥ B | C — independent once C is known
i.i.d. assumptionEach data point drawn independently from same distribution

Important: mutually exclusive events (can't both happen) are NOT independent — if A happened, B definitely didn't, so knowing A completely determines B.


Why It Matters

Gradient descent computes gradients on mini-batches. The gradient estimate is only unbiased if the batch samples are independent. If they're correlated (e.g., consecutive frames in a video), the estimate is biased. This is why training data gets shuffled.

Naive Bayes classifiers assume all features are conditionally independent given the class. This makes the joint probability easy to compute:

P(features | class) = P(f₁|class) · P(f₂|class) · ... · P(fₙ|class)

It's "naive" because this is almost never exactly true — but it often works well anyway.

Reinforcement learning (experience replay) stores transitions in a buffer and samples randomly. This breaks the temporal correlation between consecutive actions, making updates more stable.


Common Pitfalls

  • Uncorrelated ≠ independent. Zero correlation (linear relationship) doesn't mean no relationship. X and X² are uncorrelated (Cov = 0) but completely dependent — knowing X determines X².
  • Mutually exclusive events are NOT independent. If P(A) > 0 and P(B) > 0, and A and B can't both happen, then P(A|B) = 0 ≠ P(A). They're maximally dependent.
  • Independence is often an assumption, not a fact. Most real-world features are correlated. When we assume i.i.d., we're making an approximation that simplifies the math — often a useful one, but worth knowing it's an assumption.

Examples

import numpy as np

# Test independence: P(A and B) == P(A) * P(B)?
p_A = 0.4
p_B = 0.3
p_A_and_B = 0.12   # exactly p_A * p_B → independent

print(f"P(A)·P(B) = {p_A * p_B}")         # 0.12
print(f"P(A and B) = {p_A_and_B}")         # 0.12
print(f"Independent: {np.isclose(p_A * p_B, p_A_and_B)}")  # True

# Dependent events
p_rain = 0.3
p_umbrella = 0.4
p_rain_and_umbrella = 0.20  # > 0.3*0.4=0.12 → dependent!
print(f"\nRain and umbrella independent? {np.isclose(p_rain * p_umbrella, p_rain_and_umbrella)}")

# i.i.d.: joint probability of dataset factors as product
# Likelihood of observing [heads, tails, heads] from fair coin:
p_heads = 0.5
observations = ['H', 'T', 'H']
# P(H,T,H) = P(H)*P(T)*P(H) because i.i.d.
p_joint = p_heads * (1 - p_heads) * p_heads
print(f"\nP(H,T,H) = {p_joint}")   # 0.125

Review Questions