Random Variables

A random variable assigns a number to each outcome of a random process. It's the bridge between abstract probability and concrete calculations — and the foundation for every distribution, expectation, and loss function in ML.

Intuition First

You roll a die. The outcome is random — you don't know in advance what it'll be. But you can ask: "what's the probability it's a 4?" or "what number do I expect on average?"

A random variable is just a way of attaching numbers to random outcomes so you can do math on them. The die roll is a random variable X. It takes the values 6, each with probability 1/6.

In machine learning, every prediction your model makes is a random variable. Every label in the dataset. Every noise term in the data-generating process.

What's Actually Happening

A random variable X has a distribution — a description of which values it can take and how likely each one is.

There are two kinds:

Discrete random variables take a countable set of values (integers, categories). You describe them with a PMF (Probability Mass Function): P(X = x) gives the exact probability of each specific value.

Continuous random variables take any value in a range (temperature, height, a neural network's logit). You describe them with a PDF (Probability Density Function): P(X = x) is always 0 for any exact value — instead you ask for P(a ≤ X ≤ b), the area under the curve.

Both types have a CDF (Cumulative Distribution Function): F(x) = P(X ≤ x). It starts at 0, ends at 1, and never decreases.

Build the Idea Step-by-Step

Random process (coin flip, model prediction)

→

Random variable X: maps outcome → number

→

Discrete? → PMF: P(X=x) for each value

→

Continuous? → PDF: probability is area under curve

→

CDF: P(X ≤ x) — always 0 to 1

Formal Explanation

PMF (discrete): P(X = x) gives the probability of each exact value.

Properties: P(X = x) ≥ 0, and Σ P(X = x) = 1

PDF (continuous): f(x) gives the density at x. Probability in an interval:

P(a ≤ X ≤ b) = ∫_a^b f(x) dx

Properties: f(x) ≥ 0, and ∫ f(x) dx = 1
Note: f(x) can exceed 1 (it's a density, not a probability)

CDF (both types):

F(x) = P(X ≤ x)

F(-∞) = 0, F(+∞) = 1, F is non-decreasing
For continuous: F'(x) = f(x) (derivative of CDF = PDF)

Key Properties / Rules

Concept	Discrete	Continuous
Description	PMF: P(X=x)	PDF: f(x)
Probabilities	Exact values sum to 1	Areas under curve sum to 1
CDF	Sum up to x	Integral up to x
P(X = exact value)	Can be > 0	Always = 0
Example	Die roll, word token	Temperature, model logit

Why It Matters

Model outputs are distributions, not just numbers. A classification model outputs P(class = k | input) — a discrete random variable over classes. A regression model with uncertainty outputs a continuous distribution. Understanding whether your model should output a PMF or a PDF determines your architecture and loss function.

Loss functions assume a distribution. MSE assumes Gaussian noise (continuous, symmetric). Binary cross-entropy assumes Bernoulli (discrete, 0 or 1). Cross-entropy assumes categorical (discrete, multiple classes). Using the wrong loss = wrong distributional assumption.

Sampling from a distribution is what text generation is. The model outputs a PMF over vocabulary; you sample a token from it. Temperature scaling stretches or sharpens that distribution before sampling.

Common Pitfalls

PDF values can exceed 1. A PDF value of 3 at a point doesn't mean probability 3 — probability is the area under the curve. N(0, 0.1) has a peak density around 4 at x=0, which is fine because the area under the whole curve still equals 1.
P(X = exactly 2.71828) = 0 for continuous. For continuous random variables, any exact point has zero probability. You can only ask about intervals. This is why we use PDFs and integrals, not just lookups.
Don't confuse a distribution with a sample. The distribution describes all possible outcomes. A sample is one specific realization. A model's output distribution is not the same as the single prediction you get from argmax.

Examples

import numpy as np
from scipy import stats

# Discrete random variable: fair die
values = [1, 2, 3, 4, 5, 6]
probs = [1/6] * 6

print(f"P(X=3) = {probs[2]:.4f}")        # 1/6 ≈ 0.167
print(f"Sum of PMF = {sum(probs)}")        # 1.0 — valid distribution

# CDF: P(X <= 4)
cdf_4 = sum(p for v, p in zip(values, probs) if v <= 4)
print(f"P(X <= 4) = {cdf_4:.4f}")         # 4/6 ≈ 0.667

# Continuous random variable: standard Normal
dist = stats.norm(loc=0, scale=1)
print(f"\nPDF at x=0: {dist.pdf(0):.4f}")   # ≈ 0.399 — can be > 0 but < 1 here
print(f"PDF at x=0 (tight): {stats.norm(0, 0.1).pdf(0):.4f}")  # ≈ 3.99 — exceeds 1!

# P(−1 ≤ X ≤ 1) for standard Normal
p_within_1_std = dist.cdf(1) - dist.cdf(-1)
print(f"P(|X| ≤ 1) = {p_within_1_std:.4f}")   # ≈ 0.683 (68% rule)

# Sampling from a distribution
samples = dist.rvs(size=1000)
print(f"Sample mean: {samples.mean():.3f}")    # ≈ 0
print(f"Sample std:  {samples.std():.3f}")     # ≈ 1