Variance and Standard Deviation
Variance measures how spread out a distribution is — the average squared distance from the mean. Standard deviation is its square root, in the same units as the data. These underlie batch normalization, weight initialization, and uncertainty quantification.
Intuition First
Two students both score an average of 70 on their exams.
- Student A: scored 68, 70, 72, 69, 71 — consistently near 70.
- Student B: scored 30, 100, 50, 90, 80 — wildly unpredictable.
The mean is the same. But something is clearly different: the spread. Student B's scores vary a lot; Student A's barely move.
Variance measures that spread. Standard deviation is the same thing, just in the original units (scores, dollars, meters) rather than squared units.
What's Actually Happening
Variance asks: on average, how far are values from the mean?
But you can't just average the differences from the mean — they cancel out (the positive and negative deviations balance to zero). So you square them first (making everything positive), average, then optionally take the square root to get back to the original scale.
The larger the variance, the more "spread out" the distribution is. Low variance = values cluster tightly. High variance = values scattered widely.
Build the Idea Step-by-Step
Formal Explanation
Variance:
Var(X) = E[(X − μ)²] where μ = E[X]
Equivalent form (often easier to compute):
Var(X) = E[X²] − (E[X])²
Standard deviation:
σ = √Var(X)
Key rules:
Var(aX + b) = a² · Var(X) (constants multiply as a², additive shifts don't matter)
Var(X + Y) = Var(X) + Var(Y) ONLY if X and Y are independent
Var(X + Y) = Var(X) + Var(Y) + 2·Cov(X, Y) (general case)
Sample variance (from data, not a distribution):
s² = (1/(n−1)) · Σᵢ (xᵢ − x̄)²
The (n−1) instead of n corrects for bias (Bessel's correction).
Key Properties / Rules
| Property | Formula | Notes |
|---|---|---|
| Variance | E[(X−μ)²] = E[X²]−(E[X])² | Always ≥ 0 |
| Std deviation | σ = √Var(X) | Same units as X |
| Shift | Var(X+b) = Var(X) | Adding a constant changes nothing |
| Scale | Var(aX) = a²Var(X) | Scaling multiplies variance by a² |
| Sum (independent) | Var(X+Y) = Var(X)+Var(Y) | Only if independent |
| Normal | Var(N(μ,σ²)) = σ² | σ is the parameter directly |
| Bernoulli | Var(Bernoulli(p)) = p(1−p) | Maximum at p=0.5 |
Why It Matters
Batch normalization standardizes activations to have zero mean and unit variance:
x_norm = (x − mean) / std
This stabilizes training by ensuring activations don't drift. Without it, small variations in early layers get amplified through the network (exploding/vanishing activations).
Weight initialization (He, Xavier) carefully sets the variance of initial weights so that the variance of activations stays roughly constant across layers. Too large: activations explode. Too small: gradients vanish.
Uncertainty quantification: a model that outputs both a mean and variance for its prediction is being explicit about its confidence. Low variance prediction = confident. High variance = uncertain.
The bias-variance tradeoff: model error = bias² + variance + noise. Increasing model capacity reduces bias (it can fit more patterns) but increases variance (it's more sensitive to which training data it saw). Regularization reduces variance at the cost of some bias.
Common Pitfalls
- Std deviation is not variance. Variance is in squared units (dollars²); std deviation is in original units (dollars). Always check which one is appropriate for a given context — std deviation is usually more interpretable.
- Var(X + Y) ≠ Var(X) + Var(Y) in general. This only holds when X and Y are independent. For correlated variables, the covariance term matters. Forgetting this underestimates total uncertainty in models with correlated errors.
- Sample variance uses n−1, not n. Dividing by n gives a biased underestimate. NumPy's
np.var()uses n by default (population variance) — useddof=1for the unbiased sample variance, or usenp.std(x, ddof=1).
Examples
import numpy as np
# Manual variance calculation
x = np.array([2., 4., 4., 4., 5., 5., 7., 9.])
mu = x.mean() # 5.0
deviations = x - mu
squared_devs = deviations ** 2
variance = squared_devs.mean() # population variance (divide by n)
std_dev = np.sqrt(variance)
print(f"Mean: {mu}")
print(f"Variance: {variance}") # 4.0
print(f"Std deviation: {std_dev}") # 2.0
# Using numpy
print(f"\nnp.var (population): {np.var(x)}") # 4.0 (divides by n)
print(f"np.var (sample): {np.var(x, ddof=1)}") # 4.571 (divides by n-1)
print(f"np.std (sample): {np.std(x, ddof=1)}") # 2.138
# Key rule: Var(aX + b) = a²·Var(X)
a, b = 3, 10
transformed = a * x + b
print(f"\nVar(X) = {np.var(x):.3f}")
print(f"Var(3X+10) = {np.var(transformed):.3f}") # 9 * 4.0 = 36.0
print(f"a²·Var(X) = {a**2 * np.var(x):.3f}") # 36.0 — matches
# Bernoulli variance: max at p=0.5
for p in [0.1, 0.3, 0.5, 0.7, 0.9]:
var = p * (1 - p)
print(f"Bernoulli(p={p}): Var = {var:.3f}")
# Maximum is at p=0.5: Var = 0.25
# Batch normalization example
activations = np.random.normal(loc=5, scale=3, size=128)
normalized = (activations - activations.mean()) / activations.std()
print(f"\nNormalized mean: {normalized.mean():.6f}") # ≈ 0
print(f"Normalized std: {normalized.std():.6f}") # ≈ 1