MnemosyneMnemosyne

Expectation

The expectation of a random variable is its long-run average — the value you'd expect if you repeated the experiment many times. It's the foundation of loss functions, gradient estimates, and reasoning about model performance.

Intuition First

Roll a fair die 1 million times and average the results. You'd get very close to 3.5. Not because 3.5 is a possible outcome — it isn't — but because the values 6 balance out to 3.5 on average. That's the expectation.

The expectation (also called the mean or expected value) is the probability-weighted average of all possible values. It's the "center of mass" of a distribution.


What's Actually Happening

For a discrete random variable, you average each possible value weighted by its probability:

  • A 6 only shows up 1/6 of the time, so it contributes 6 × (1/6) = 1
  • A 1 also shows up 1/6 of the time, contributing 1 × (1/6) = 1/6
  • Sum all contributions → 3.5

For continuous variables, the sum becomes an integral.

The most powerful property of expectation is linearity: the expectation of a sum equals the sum of expectations — always, even for dependent variables. This makes expectations easy to reason about in complex systems.


Build the Idea Step-by-Step

Random variable X with outcomes and probabilities
Weight each value by its probability
Sum = E[X] (discrete) or integrate (continuous)
Linearity: E[X+Y] = E[X]+E[Y] always
Batch average ≈ E[loss] — what gradient descent minimizes

Formal Explanation

Discrete:

E[X] = Σₓ x · P(X = x)

Continuous:

E[X] = ∫ x · f(x) dx

Linearity of expectation (holds always, no independence required):

E[aX + b] = a · E[X] + b
E[X + Y]  = E[X] + E[Y]

Expectation of a function of X:

E[g(X)] = Σₓ g(x) · P(X = x)   (discrete)
E[g(X)] = ∫ g(x) · f(x) dx     (continuous)

Note: E[g(X)] ≠ g(E[X]) in general — this is Jensen's inequality (for convex g, E[g(X)] ≥ g(E[X])).


Key Properties / Rules

PropertyFormulaNotes
Definition (discrete)Σ x·P(X=x)Probability-weighted average
Definition (continuous)∫ x·f(x) dxIntegrate with density
Linearity (shift)E[aX+b] = a·E[X]+bAlways holds
Linearity (sum)E[X+Y] = E[X]+E[Y]No independence needed
ConstantsE[c] = cConstants are certain
NormalE[N(μ, σ²)] = μMean is the parameter μ
BernoulliE[Bernoulli(p)] = pThe probability is the mean

Why It Matters

The training loss is an empirical expectation. When you average the loss across a mini-batch:

L = (1/n) Σᵢ ℓ(yᵢ, ŷᵢ) ≈ E_{(x,y)~data}[ℓ(y, ŷ)]

You're estimating the true expected loss over the data distribution. Gradient descent minimizes this expected loss.

Larger batches → better estimates of E[loss]. The sample mean converges to the true expectation as batch size grows. Smaller batches = noisier gradient estimates = higher variance updates.

Linearity is why batch averaging works. E[mean of batch gradients] = mean of E[gradients] = E[true gradient]. The batch estimate is unbiased because of linearity.


Common Pitfalls

  • E[X²] ≠ (E[X])². The expectation of X squared is not the square of the expectation. Their difference is the variance: Var(X) = E[X²] − (E[X])².
  • E[f(X)] ≠ f(E[X]). For non-linear functions, you can't just plug in the mean. E[X²] for a coin flip ≠ (E[X])². This non-equality is the whole content of Jensen's inequality.
  • The expectation might not be achievable. E[die roll] = 3.5 is never actually rolled. The expectation is a property of the distribution, not a typical outcome.

Examples

import numpy as np
from scipy import stats

# Discrete: expected value of a fair die
values = np.arange(1, 7)
probs = np.ones(6) / 6
E_die = np.sum(values * probs)
print(f"E[die roll] = {E_die}")    # 3.5

# Verify with many samples
samples = np.random.randint(1, 7, size=100_000)
print(f"Empirical mean: {samples.mean():.3f}")  # ≈ 3.5

# Linearity: E[2X + 1] = 2·E[X] + 1
print(f"E[2X+1] via formula: {2 * E_die + 1}")   # 8.0
print(f"E[2X+1] empirical:   {(2*samples+1).mean():.3f}")  # ≈ 8.0

# E[X + Y] = E[X] + E[Y] even when dependent
X = np.random.normal(0, 1, size=10_000)
Y = 2 * X + np.random.normal(0, 0.5, size=10_000)  # Y depends on X
print(f"\nE[X] = {X.mean():.3f}")    # ≈ 0
print(f"E[Y] = {Y.mean():.3f}")      # ≈ 0
print(f"E[X+Y] = {(X+Y).mean():.3f}")         # ≈ 0
print(f"E[X]+E[Y] = {X.mean()+Y.mean():.3f}") # ≈ 0 — linearity holds even for dependent!

# E[f(X)] ≠ f(E[X]) for non-linear f
X2 = np.random.normal(0, 1, size=100_000)
print(f"\nE[X²] = {(X2**2).mean():.3f}")    # ≈ 1 (variance of N(0,1))
print(f"(E[X])² = {X2.mean()**2:.6f}")      # ≈ 0 — very different!

Review Questions