Variance and Standard Deviation

Variance measures how spread out a distribution is — the average squared distance from the mean. Standard deviation is its square root, in the same units as the data. These underlie batch normalization, weight initialization, and uncertainty quantification.

Intuition First

Two students both score an average of 70 on their exams.

Student A: scored 68, 70, 72, 69, 71 — consistently near 70.
Student B: scored 30, 100, 50, 90, 80 — wildly unpredictable.

The mean is the same. But something is clearly different: the spread. Student B's scores vary a lot; Student A's barely move.

Variance measures that spread. Standard deviation is the same thing, just in the original units (scores, dollars, meters) rather than squared units.

What's Actually Happening

Variance asks: on average, how far are values from the mean?

But you can't just average the differences from the mean — they cancel out (the positive and negative deviations balance to zero). So you square them first (making everything positive), average, then optionally take the square root to get back to the original scale.

The larger the variance, the more "spread out" the distribution is. Low variance = values cluster tightly. High variance = values scattered widely.

Build the Idea Step-by-Step

Compute mean μ = E[X]

→

Each value's deviation from mean: (X − μ)

→

Square to make positive: (X − μ)²

→

Average the squared deviations: Var(X) = E[(X−μ)²]

→

Standard deviation: σ = √Var(X) — back to original units

Formal Explanation

Variance:

Var(X) = E[(X − μ)²]   where μ = E[X]

Equivalent form (often easier to compute):
Var(X) = E[X²] − (E[X])²

Standard deviation:

σ = √Var(X)

Key rules:

Var(aX + b) = a² · Var(X)      (constants multiply as a², additive shifts don't matter)
Var(X + Y)  = Var(X) + Var(Y)  ONLY if X and Y are independent
Var(X + Y)  = Var(X) + Var(Y) + 2·Cov(X, Y)  (general case)

Sample variance (from data, not a distribution):

s² = (1/(n−1)) · Σᵢ (xᵢ − x̄)²

The (n−1) instead of n corrects for bias (Bessel's correction).

Key Properties / Rules

Property	Formula	Notes
Variance	`E[(X−μ)²] = E[X²]−(E[X])²`	Always ≥ 0
Std deviation	`σ = √Var(X)`	Same units as X
Shift	`Var(X+b) = Var(X)`	Adding a constant changes nothing
Scale	`Var(aX) = a²Var(X)`	Scaling multiplies variance by a²
Sum (independent)	`Var(X+Y) = Var(X)+Var(Y)`	Only if independent
Normal	`Var(N(μ,σ²)) = σ²`	σ is the parameter directly
Bernoulli	`Var(Bernoulli(p)) = p(1−p)`	Maximum at p=0.5

Why It Matters

Batch normalization standardizes activations to have zero mean and unit variance:

x_norm = (x − mean) / std

This stabilizes training by ensuring activations don't drift. Without it, small variations in early layers get amplified through the network (exploding/vanishing activations).

Weight initialization (He, Xavier) carefully sets the variance of initial weights so that the variance of activations stays roughly constant across layers. Too large: activations explode. Too small: gradients vanish.

Uncertainty quantification: a model that outputs both a mean and variance for its prediction is being explicit about its confidence. Low variance prediction = confident. High variance = uncertain.

The bias-variance tradeoff: model error = bias² + variance + noise. Increasing model capacity reduces bias (it can fit more patterns) but increases variance (it's more sensitive to which training data it saw). Regularization reduces variance at the cost of some bias.

Common Pitfalls

Std deviation is not variance. Variance is in squared units (dollars²); std deviation is in original units (dollars). Always check which one is appropriate for a given context — std deviation is usually more interpretable.
Var(X + Y) ≠ Var(X) + Var(Y) in general. This only holds when X and Y are independent. For correlated variables, the covariance term matters. Forgetting this underestimates total uncertainty in models with correlated errors.
Sample variance uses n−1, not n. Dividing by n gives a biased underestimate. NumPy's np.var() uses n by default (population variance) — use ddof=1 for the unbiased sample variance, or use np.std(x, ddof=1).

Examples

import numpy as np

# Manual variance calculation
x = np.array([2., 4., 4., 4., 5., 5., 7., 9.])
mu = x.mean()   # 5.0

deviations = x - mu
squared_devs = deviations ** 2
variance = squared_devs.mean()       # population variance (divide by n)
std_dev = np.sqrt(variance)

print(f"Mean: {mu}")
print(f"Variance: {variance}")       # 4.0
print(f"Std deviation: {std_dev}")   # 2.0

# Using numpy
print(f"\nnp.var (population): {np.var(x)}")         # 4.0 (divides by n)
print(f"np.var (sample):     {np.var(x, ddof=1)}")   # 4.571 (divides by n-1)
print(f"np.std (sample):     {np.std(x, ddof=1)}")   # 2.138

# Key rule: Var(aX + b) = a²·Var(X)
a, b = 3, 10
transformed = a * x + b
print(f"\nVar(X) = {np.var(x):.3f}")
print(f"Var(3X+10) = {np.var(transformed):.3f}")    # 9 * 4.0 = 36.0
print(f"a²·Var(X) = {a**2 * np.var(x):.3f}")        # 36.0 — matches

# Bernoulli variance: max at p=0.5
for p in [0.1, 0.3, 0.5, 0.7, 0.9]:
    var = p * (1 - p)
    print(f"Bernoulli(p={p}): Var = {var:.3f}")
# Maximum is at p=0.5: Var = 0.25

# Batch normalization example
activations = np.random.normal(loc=5, scale=3, size=128)
normalized = (activations - activations.mean()) / activations.std()
print(f"\nNormalized mean: {normalized.mean():.6f}")  # ≈ 0
print(f"Normalized std:  {normalized.std():.6f}")     # ≈ 1