Entropy

Entropy measures how unpredictable or surprising a probability distribution is. High entropy means high uncertainty — you can't easily guess what's coming next. It's the foundation for understanding why cross-entropy loss works.

Intuition First

Imagine a coin flip. Heads or tails — 50/50. You genuinely don't know what's coming. That's high entropy: maximum unpredictability.

Now imagine a biased coin that lands heads 99% of the time. You know what's coming before you flip it. That's low entropy: low unpredictability.

Entropy is a single number that captures how surprised you expect to be by the outcome of a random event.

What's Actually Happening

Think of entropy as the answer to: "How many yes/no questions do I need to identify the outcome?"

If there are 8 equally likely outcomes (like 8 letters), you need exactly 3 binary questions to narrow it down. Entropy = 3 bits.

If one outcome is almost certain (like a fair coin flipped once with probability 0.99/0.01), you barely need to ask — entropy is nearly 0.

Key insight: rare events carry more information when they happen. If your model predicts "dog" with 1% probability and it turns out to be a dog — that's surprising. Surprising = informative = high information content.

The information content of a single outcome with probability p is:

-log₂(p)

When p = 0.5 → 1 bit. When p = 0.01 → ~6.6 bits. When p = 1.0 → 0 bits (no surprise).

Entropy is the expected information content — the weighted average of how surprising each outcome is.

Build the Idea Step-by-Step

Each outcome has a probability p

→

Surprise of that outcome = -log(p) — rare events are more surprising

→

Entropy = average surprise = Σ p · (-log p)

→

High entropy → flat distribution → hard to predict; Low entropy → peaked distribution → easy to predict

Formal Explanation

For a discrete probability distribution over events x₁, x₂, ..., xₙ with probabilities p₁, p₂, ..., pₙ:

H(P) = -Σᵢ pᵢ · log(pᵢ)

Where:

log is typically base 2 (gives entropy in bits) or base e (gives nats, used in ML)
By convention: 0 · log(0) = 0 (zero-probability events contribute nothing)

Maximum entropy for n outcomes: all pᵢ = 1/n → H = log(n)

Minimum entropy: one pᵢ = 1, all others 0 → H = 0

Key Properties / Rules

Property	Meaning
H(P) ≥ 0	Entropy is always non-negative
H(P) = 0	Perfectly certain — one outcome has probability 1
H(P) = log(n)	Maximum — all n outcomes equally likely
H increases with uncertainty	Flatter distribution = higher entropy
Rare events have high surprise	-log(0.01) ≈ 6.6 bits vs -log(0.99) ≈ 0.01 bits

Why It Matters

Cross-entropy loss (the standard loss for classification) is built on entropy. When you train a neural network to predict categories, you're minimizing the cross-entropy between the true distribution and the predicted distribution. Entropy is the lower bound you're trying to approach.

Language modeling is literally entropy minimization. A good language model assigns high probability to likely next tokens — which means low entropy in its predictions. Perplexity (a standard LM metric) is just 2^H — entropy in disguise.

Compression. Shannon's source coding theorem says the minimum average bits needed to encode samples from a distribution is exactly the entropy. This is why ZIP files compress some files more than others — low-entropy files (repetitive data) compress better.

Common Pitfalls

log(0) is undefined — probability 0 would give infinite surprise. In practice, clamp probabilities away from 0 or add a small epsilon before computing log.
Entropy of a continuous distribution (differential entropy) can be negative — don't confuse it with discrete entropy, which is always ≥ 0.
High entropy ≠ bad. A model's output entropy should be low (confident predictions). But you might want high-entropy data (diverse training set). Context matters.

Examples

import numpy as np

def entropy(probs):
    probs = np.array(probs)
    # Clip to avoid log(0)
    probs = np.clip(probs, 1e-10, 1.0)
    return -np.sum(probs * np.log(probs))  # nats

# Uniform distribution over 4 classes: maximum entropy
uniform = [0.25, 0.25, 0.25, 0.25]
print(f"Uniform entropy: {entropy(uniform):.4f} nats")   # ≈ 1.3863

# Peaked distribution: low entropy
peaked = [0.97, 0.01, 0.01, 0.01]
print(f"Peaked entropy:  {entropy(peaked):.4f} nats")    # ≈ 0.1571

# Certain: zero entropy
certain = [1.0, 0.0, 0.0, 0.0]
print(f"Certain entropy: {entropy(certain):.4f} nats")   # ≈ 0.0000

# Binary entropy — max at p=0.5
for p in [0.1, 0.3, 0.5, 0.7, 0.9]:
    q = 1 - p
    h = -(p * np.log(p) + q * np.log(q))
    print(f"p={p:.1f}: H = {h:.4f} nats")
# p=0.1: H = 0.3251   (low — almost certain heads)
# p=0.5: H = 0.6931   (max — completely uncertain)
# p=0.9: H = 0.3251   (symmetric — same as p=0.1)