Expectation
The expectation of a random variable is its long-run average — the value you'd expect if you repeated the experiment many times. It's the foundation of loss functions, gradient estimates, and reasoning about model performance.
Intuition First
Roll a fair die 1 million times and average the results. You'd get very close to 3.5. Not because 3.5 is a possible outcome — it isn't — but because the values 6 balance out to 3.5 on average. That's the expectation.
The expectation (also called the mean or expected value) is the probability-weighted average of all possible values. It's the "center of mass" of a distribution.
What's Actually Happening
For a discrete random variable, you average each possible value weighted by its probability:
- A 6 only shows up 1/6 of the time, so it contributes 6 × (1/6) = 1
- A 1 also shows up 1/6 of the time, contributing 1 × (1/6) = 1/6
- Sum all contributions → 3.5
For continuous variables, the sum becomes an integral.
The most powerful property of expectation is linearity: the expectation of a sum equals the sum of expectations — always, even for dependent variables. This makes expectations easy to reason about in complex systems.
Build the Idea Step-by-Step
Formal Explanation
Discrete:
E[X] = Σₓ x · P(X = x)
Continuous:
E[X] = ∫ x · f(x) dx
Linearity of expectation (holds always, no independence required):
E[aX + b] = a · E[X] + b
E[X + Y] = E[X] + E[Y]
Expectation of a function of X:
E[g(X)] = Σₓ g(x) · P(X = x) (discrete)
E[g(X)] = ∫ g(x) · f(x) dx (continuous)
Note: E[g(X)] ≠ g(E[X]) in general — this is Jensen's inequality (for convex g, E[g(X)] ≥ g(E[X])).
Key Properties / Rules
| Property | Formula | Notes |
|---|---|---|
| Definition (discrete) | Σ x·P(X=x) | Probability-weighted average |
| Definition (continuous) | ∫ x·f(x) dx | Integrate with density |
| Linearity (shift) | E[aX+b] = a·E[X]+b | Always holds |
| Linearity (sum) | E[X+Y] = E[X]+E[Y] | No independence needed |
| Constants | E[c] = c | Constants are certain |
| Normal | E[N(μ, σ²)] = μ | Mean is the parameter μ |
| Bernoulli | E[Bernoulli(p)] = p | The probability is the mean |
Why It Matters
The training loss is an empirical expectation. When you average the loss across a mini-batch:
L = (1/n) Σᵢ ℓ(yᵢ, ŷᵢ) ≈ E_{(x,y)~data}[ℓ(y, ŷ)]
You're estimating the true expected loss over the data distribution. Gradient descent minimizes this expected loss.
Larger batches → better estimates of E[loss]. The sample mean converges to the true expectation as batch size grows. Smaller batches = noisier gradient estimates = higher variance updates.
Linearity is why batch averaging works. E[mean of batch gradients] = mean of E[gradients] = E[true gradient]. The batch estimate is unbiased because of linearity.
Common Pitfalls
- E[X²] ≠ (E[X])². The expectation of X squared is not the square of the expectation. Their difference is the variance: Var(X) = E[X²] − (E[X])².
- E[f(X)] ≠ f(E[X]). For non-linear functions, you can't just plug in the mean. E[X²] for a coin flip ≠ (E[X])². This non-equality is the whole content of Jensen's inequality.
- The expectation might not be achievable. E[die roll] = 3.5 is never actually rolled. The expectation is a property of the distribution, not a typical outcome.
Examples
import numpy as np
from scipy import stats
# Discrete: expected value of a fair die
values = np.arange(1, 7)
probs = np.ones(6) / 6
E_die = np.sum(values * probs)
print(f"E[die roll] = {E_die}") # 3.5
# Verify with many samples
samples = np.random.randint(1, 7, size=100_000)
print(f"Empirical mean: {samples.mean():.3f}") # ≈ 3.5
# Linearity: E[2X + 1] = 2·E[X] + 1
print(f"E[2X+1] via formula: {2 * E_die + 1}") # 8.0
print(f"E[2X+1] empirical: {(2*samples+1).mean():.3f}") # ≈ 8.0
# E[X + Y] = E[X] + E[Y] even when dependent
X = np.random.normal(0, 1, size=10_000)
Y = 2 * X + np.random.normal(0, 0.5, size=10_000) # Y depends on X
print(f"\nE[X] = {X.mean():.3f}") # ≈ 0
print(f"E[Y] = {Y.mean():.3f}") # ≈ 0
print(f"E[X+Y] = {(X+Y).mean():.3f}") # ≈ 0
print(f"E[X]+E[Y] = {X.mean()+Y.mean():.3f}") # ≈ 0 — linearity holds even for dependent!
# E[f(X)] ≠ f(E[X]) for non-linear f
X2 = np.random.normal(0, 1, size=100_000)
print(f"\nE[X²] = {(X2**2).mean():.3f}") # ≈ 1 (variance of N(0,1))
print(f"(E[X])² = {X2.mean()**2:.6f}") # ≈ 0 — very different!