MnemosyneMnemosyne

Distributions

The Normal, Bernoulli, and Binomial distributions are three fundamental shapes of randomness. They appear constantly in ML — as model outputs, noise models, and the implicit assumption behind loss functions.

Intuition First

Not all randomness looks the same. Some things cluster around a middle value (heights of people). Some things are just yes or no (did the user click?). Some things count how many successes happen (how many emails in a batch are spam?).

Each of these has a natural mathematical shape — a distribution that describes the pattern.

Three distributions cover a huge portion of what you encounter in ML:

  • Normal (Gaussian): the bell curve. Continuous values clustering around a mean.
  • Bernoulli: a single coin flip. Binary outcome (0 or 1) with probability p.
  • Binomial: n independent coin flips. Counts how many came up heads.

What's Actually Happening

Normal Distribution

The bell curve. Most values cluster near the center; extreme values are rare. Described entirely by two numbers: mean μ (where it's centered) and standard deviation σ (how wide it is).

The 68-95-99.7 rule: 68% of values fall within 1σ of the mean, 95% within 2σ, 99.7% within 3σ.

Bernoulli Distribution

A single binary trial with probability p of success. Like flipping a coin: heads (1) with probability p, tails (0) with probability (1−p).

P(X = 1) = p
P(X = 0) = 1 − p

Binomial Distribution

Flip n independent coins, each with probability p. Count how many land heads. The Binomial is the sum of n independent Bernoulli trials.

P(X = k) = C(n,k) · pᵏ · (1−p)^(n−k)

Build the Idea Step-by-Step

Single yes/no outcome → Bernoulli(p)
n independent yes/no outcomes → Binomial(n, p)
Continuous values around a center → Normal(μ, σ)
Bernoulli output → binary cross-entropy loss
Normal output → MSE loss

Formal Explanation

Normal distribution X ~ N(μ, σ²):

PDF: f(x) = (1/√(2πσ²)) · exp(−(x−μ)²/(2σ²))

Mean = μ,  Variance = σ²

Standard Normal: N(0, 1) — zero mean, unit variance.

Bernoulli distribution X ~ Bernoulli(p):

PMF: P(X=1) = p,  P(X=0) = 1−p

Mean = p,  Variance = p(1−p)

Binomial distribution X ~ Binomial(n, p):

PMF: P(X=k) = C(n,k) · pᵏ · (1−p)^(n−k)

Mean = np,  Variance = np(1−p)

Relationship: Binomial(n, p) is the sum of n independent Bernoulli(p) variables. As n→∞, it approximates a Normal distribution (Central Limit Theorem).


Key Properties / Rules

DistributionParametersSupportMeanVariance
Normalμ, σℝ (all reals)μσ²
Bernoullip1pp(1−p)
Binomialn, p{0, 1, ..., n}npnp(1−p)

Why It Matters

Loss function choice is a distributional assumption:

Model outputDistribution assumedLoss function
Continuous value (regression)NormalMSE
Binary classificationBernoulliBinary cross-entropy
Multi-class classificationCategoricalCross-entropy

Using MSE for a binary classification task implicitly assumes a Gaussian output — which is wrong. The model can output values outside [0,1], and gradients behave poorly. Binary cross-entropy is correct because it assumes Bernoulli.

Weight initialization often uses a Normal distribution: W ~ N(0, σ²). The variance σ² is carefully tuned (e.g., He init, Xavier init) so activations don't explode or vanish.

The Central Limit Theorem says the average of many independent random variables approaches a Normal distribution. This is why gradient estimates (averages over mini-batches) are well-behaved.


Common Pitfalls

  • Normal distribution has infinite support. It can produce any real number, including negatives. Don't model probabilities (which must be in [0,1]) or counts (must be ≥ 0) with a Normal unless you're making a deliberate approximation.
  • Bernoulli variance is maximized at p=0.5. The most uncertain case is a fair coin. Var(Bernoulli(p)) = p(1−p), which equals 0.25 at p=0.5 and 0 at p=0 or p=1.
  • Binomial assumes independence. The n trials must be independent. If they're correlated, the variance of the count is higher than np(1−p).

Examples

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Normal distribution
normal = stats.norm(loc=2.0, scale=1.5)  # mean=2, std=1.5

print(f"P(X ≤ 2) = {normal.cdf(2):.4f}")          # 0.5 — median = mean
print(f"P(0 ≤ X ≤ 4) = {normal.cdf(4) - normal.cdf(0):.4f}")  # ≈ 0.817
print(f"68% rule: {normal.cdf(2+1.5) - normal.cdf(2-1.5):.4f}")  # ≈ 0.683

# Bernoulli distribution
p = 0.7
bernoulli = stats.bernoulli(p)
print(f"\nBernoulli(p=0.7):")
print(f"P(X=1) = {bernoulli.pmf(1):.2f}")   # 0.7
print(f"Mean = {bernoulli.mean():.2f}")       # 0.7
print(f"Variance = {bernoulli.var():.4f}")    # p*(1-p) = 0.21

# Binomial distribution: 10 coin flips with p=0.7
n, p = 10, 0.7
binom = stats.binom(n, p)
print(f"\nBinomial(n=10, p=0.7):")
print(f"P(X=7) = {binom.pmf(7):.4f}")        # most likely outcome
print(f"Mean = {binom.mean()}")               # n*p = 7.0
print(f"Variance = {binom.var()}")            # n*p*(1-p) = 2.1

# Connection: Binomial = sum of Bernoullis
samples = np.random.binomial(n=1, p=0.7, size=(1000, 10))  # 1000 × 10 Bernoullis
binomial_samples = samples.sum(axis=1)                       # sum across 10 flips
print(f"\nEmpirical mean: {binomial_samples.mean():.2f}")    # ≈ 7.0
print(f"Empirical std:  {binomial_samples.std():.2f}")       # ≈ √2.1 ≈ 1.45

Review Questions