Random Variables
A random variable assigns a number to each outcome of a random process. It's the bridge between abstract probability and concrete calculations — and the foundation for every distribution, expectation, and loss function in ML.
Intuition First
You roll a die. The outcome is random — you don't know in advance what it'll be. But you can ask: "what's the probability it's a 4?" or "what number do I expect on average?"
A random variable is just a way of attaching numbers to random outcomes so you can do math on them. The die roll is a random variable X. It takes the values 6, each with probability 1/6.
In machine learning, every prediction your model makes is a random variable. Every label in the dataset. Every noise term in the data-generating process.
What's Actually Happening
A random variable X has a distribution — a description of which values it can take and how likely each one is.
There are two kinds:
Discrete random variables take a countable set of values (integers, categories). You describe them with a PMF (Probability Mass Function): P(X = x) gives the exact probability of each specific value.
Continuous random variables take any value in a range (temperature, height, a neural network's logit). You describe them with a PDF (Probability Density Function): P(X = x) is always 0 for any exact value — instead you ask for P(a ≤ X ≤ b), the area under the curve.
Both types have a CDF (Cumulative Distribution Function): F(x) = P(X ≤ x). It starts at 0, ends at 1, and never decreases.
Build the Idea Step-by-Step
Formal Explanation
PMF (discrete): P(X = x) gives the probability of each exact value.
- Properties: P(X = x) ≥ 0, and Σ P(X = x) = 1
PDF (continuous): f(x) gives the density at x. Probability in an interval:
P(a ≤ X ≤ b) = ∫_a^b f(x) dx
- Properties: f(x) ≥ 0, and ∫ f(x) dx = 1
- Note: f(x) can exceed 1 (it's a density, not a probability)
CDF (both types):
F(x) = P(X ≤ x)
- F(-∞) = 0, F(+∞) = 1, F is non-decreasing
- For continuous: F'(x) = f(x) (derivative of CDF = PDF)
Key Properties / Rules
| Concept | Discrete | Continuous |
|---|---|---|
| Description | PMF: P(X=x) | PDF: f(x) |
| Probabilities | Exact values sum to 1 | Areas under curve sum to 1 |
| CDF | Sum up to x | Integral up to x |
| P(X = exact value) | Can be > 0 | Always = 0 |
| Example | Die roll, word token | Temperature, model logit |
Why It Matters
Model outputs are distributions, not just numbers. A classification model outputs P(class = k | input) — a discrete random variable over classes. A regression model with uncertainty outputs a continuous distribution. Understanding whether your model should output a PMF or a PDF determines your architecture and loss function.
Loss functions assume a distribution. MSE assumes Gaussian noise (continuous, symmetric). Binary cross-entropy assumes Bernoulli (discrete, 0 or 1). Cross-entropy assumes categorical (discrete, multiple classes). Using the wrong loss = wrong distributional assumption.
Sampling from a distribution is what text generation is. The model outputs a PMF over vocabulary; you sample a token from it. Temperature scaling stretches or sharpens that distribution before sampling.
Common Pitfalls
- PDF values can exceed 1. A PDF value of 3 at a point doesn't mean probability 3 — probability is the area under the curve. N(0, 0.1) has a peak density around 4 at x=0, which is fine because the area under the whole curve still equals 1.
- P(X = exactly 2.71828) = 0 for continuous. For continuous random variables, any exact point has zero probability. You can only ask about intervals. This is why we use PDFs and integrals, not just lookups.
- Don't confuse a distribution with a sample. The distribution describes all possible outcomes. A sample is one specific realization. A model's output distribution is not the same as the single prediction you get from argmax.
Examples
import numpy as np
from scipy import stats
# Discrete random variable: fair die
values = [1, 2, 3, 4, 5, 6]
probs = [1/6] * 6
print(f"P(X=3) = {probs[2]:.4f}") # 1/6 ≈ 0.167
print(f"Sum of PMF = {sum(probs)}") # 1.0 — valid distribution
# CDF: P(X <= 4)
cdf_4 = sum(p for v, p in zip(values, probs) if v <= 4)
print(f"P(X <= 4) = {cdf_4:.4f}") # 4/6 ≈ 0.667
# Continuous random variable: standard Normal
dist = stats.norm(loc=0, scale=1)
print(f"\nPDF at x=0: {dist.pdf(0):.4f}") # ≈ 0.399 — can be > 0 but < 1 here
print(f"PDF at x=0 (tight): {stats.norm(0, 0.1).pdf(0):.4f}") # ≈ 3.99 — exceeds 1!
# P(−1 ≤ X ≤ 1) for standard Normal
p_within_1_std = dist.cdf(1) - dist.cdf(-1)
print(f"P(|X| ≤ 1) = {p_within_1_std:.4f}") # ≈ 0.683 (68% rule)
# Sampling from a distribution
samples = dist.rvs(size=1000)
print(f"Sample mean: {samples.mean():.3f}") # ≈ 0
print(f"Sample std: {samples.std():.3f}") # ≈ 1