Distributions
The Normal, Bernoulli, and Binomial distributions are three fundamental shapes of randomness. They appear constantly in ML — as model outputs, noise models, and the implicit assumption behind loss functions.
Intuition First
Not all randomness looks the same. Some things cluster around a middle value (heights of people). Some things are just yes or no (did the user click?). Some things count how many successes happen (how many emails in a batch are spam?).
Each of these has a natural mathematical shape — a distribution that describes the pattern.
Three distributions cover a huge portion of what you encounter in ML:
- Normal (Gaussian): the bell curve. Continuous values clustering around a mean.
- Bernoulli: a single coin flip. Binary outcome (0 or 1) with probability p.
- Binomial: n independent coin flips. Counts how many came up heads.
What's Actually Happening
Normal Distribution
The bell curve. Most values cluster near the center; extreme values are rare. Described entirely by two numbers: mean μ (where it's centered) and standard deviation σ (how wide it is).
The 68-95-99.7 rule: 68% of values fall within 1σ of the mean, 95% within 2σ, 99.7% within 3σ.
Bernoulli Distribution
A single binary trial with probability p of success. Like flipping a coin: heads (1) with probability p, tails (0) with probability (1−p).
P(X = 1) = p
P(X = 0) = 1 − p
Binomial Distribution
Flip n independent coins, each with probability p. Count how many land heads. The Binomial is the sum of n independent Bernoulli trials.
P(X = k) = C(n,k) · pᵏ · (1−p)^(n−k)
Build the Idea Step-by-Step
Formal Explanation
Normal distribution X ~ N(μ, σ²):
PDF: f(x) = (1/√(2πσ²)) · exp(−(x−μ)²/(2σ²))
Mean = μ, Variance = σ²
Standard Normal: N(0, 1) — zero mean, unit variance.
Bernoulli distribution X ~ Bernoulli(p):
PMF: P(X=1) = p, P(X=0) = 1−p
Mean = p, Variance = p(1−p)
Binomial distribution X ~ Binomial(n, p):
PMF: P(X=k) = C(n,k) · pᵏ · (1−p)^(n−k)
Mean = np, Variance = np(1−p)
Relationship: Binomial(n, p) is the sum of n independent Bernoulli(p) variables. As n→∞, it approximates a Normal distribution (Central Limit Theorem).
Key Properties / Rules
| Distribution | Parameters | Support | Mean | Variance |
|---|---|---|---|---|
| Normal | μ, σ | ℝ (all reals) | μ | σ² |
| Bernoulli | p | 1 | p | p(1−p) |
| Binomial | n, p | {0, 1, ..., n} | np | np(1−p) |
Why It Matters
Loss function choice is a distributional assumption:
| Model output | Distribution assumed | Loss function |
|---|---|---|
| Continuous value (regression) | Normal | MSE |
| Binary classification | Bernoulli | Binary cross-entropy |
| Multi-class classification | Categorical | Cross-entropy |
Using MSE for a binary classification task implicitly assumes a Gaussian output — which is wrong. The model can output values outside [0,1], and gradients behave poorly. Binary cross-entropy is correct because it assumes Bernoulli.
Weight initialization often uses a Normal distribution: W ~ N(0, σ²). The variance σ² is carefully tuned (e.g., He init, Xavier init) so activations don't explode or vanish.
The Central Limit Theorem says the average of many independent random variables approaches a Normal distribution. This is why gradient estimates (averages over mini-batches) are well-behaved.
Common Pitfalls
- Normal distribution has infinite support. It can produce any real number, including negatives. Don't model probabilities (which must be in [0,1]) or counts (must be ≥ 0) with a Normal unless you're making a deliberate approximation.
- Bernoulli variance is maximized at p=0.5. The most uncertain case is a fair coin. Var(Bernoulli(p)) = p(1−p), which equals 0.25 at p=0.5 and 0 at p=0 or p=1.
- Binomial assumes independence. The n trials must be independent. If they're correlated, the variance of the count is higher than np(1−p).
Examples
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Normal distribution
normal = stats.norm(loc=2.0, scale=1.5) # mean=2, std=1.5
print(f"P(X ≤ 2) = {normal.cdf(2):.4f}") # 0.5 — median = mean
print(f"P(0 ≤ X ≤ 4) = {normal.cdf(4) - normal.cdf(0):.4f}") # ≈ 0.817
print(f"68% rule: {normal.cdf(2+1.5) - normal.cdf(2-1.5):.4f}") # ≈ 0.683
# Bernoulli distribution
p = 0.7
bernoulli = stats.bernoulli(p)
print(f"\nBernoulli(p=0.7):")
print(f"P(X=1) = {bernoulli.pmf(1):.2f}") # 0.7
print(f"Mean = {bernoulli.mean():.2f}") # 0.7
print(f"Variance = {bernoulli.var():.4f}") # p*(1-p) = 0.21
# Binomial distribution: 10 coin flips with p=0.7
n, p = 10, 0.7
binom = stats.binom(n, p)
print(f"\nBinomial(n=10, p=0.7):")
print(f"P(X=7) = {binom.pmf(7):.4f}") # most likely outcome
print(f"Mean = {binom.mean()}") # n*p = 7.0
print(f"Variance = {binom.var()}") # n*p*(1-p) = 2.1
# Connection: Binomial = sum of Bernoullis
samples = np.random.binomial(n=1, p=0.7, size=(1000, 10)) # 1000 × 10 Bernoullis
binomial_samples = samples.sum(axis=1) # sum across 10 flips
print(f"\nEmpirical mean: {binomial_samples.mean():.2f}") # ≈ 7.0
print(f"Empirical std: {binomial_samples.std():.2f}") # ≈ √2.1 ≈ 1.45