Entropy
Entropy measures how unpredictable or surprising a probability distribution is. High entropy means high uncertainty — you can't easily guess what's coming next. It's the foundation for understanding why cross-entropy loss works.
Intuition First
Imagine a coin flip. Heads or tails — 50/50. You genuinely don't know what's coming. That's high entropy: maximum unpredictability.
Now imagine a biased coin that lands heads 99% of the time. You know what's coming before you flip it. That's low entropy: low unpredictability.
Entropy is a single number that captures how surprised you expect to be by the outcome of a random event.
What's Actually Happening
Think of entropy as the answer to: "How many yes/no questions do I need to identify the outcome?"
If there are 8 equally likely outcomes (like 8 letters), you need exactly 3 binary questions to narrow it down. Entropy = 3 bits.
If one outcome is almost certain (like a fair coin flipped once with probability 0.99/0.01), you barely need to ask — entropy is nearly 0.
Key insight: rare events carry more information when they happen. If your model predicts "dog" with 1% probability and it turns out to be a dog — that's surprising. Surprising = informative = high information content.
The information content of a single outcome with probability p is:
-log₂(p)
When p = 0.5 → 1 bit. When p = 0.01 → ~6.6 bits. When p = 1.0 → 0 bits (no surprise).
Entropy is the expected information content — the weighted average of how surprising each outcome is.
Build the Idea Step-by-Step
Formal Explanation
For a discrete probability distribution over events x₁, x₂, ..., xₙ with probabilities p₁, p₂, ..., pₙ:
H(P) = -Σᵢ pᵢ · log(pᵢ)
Where:
logis typically base 2 (gives entropy in bits) or base e (gives nats, used in ML)- By convention:
0 · log(0) = 0(zero-probability events contribute nothing)
Maximum entropy for n outcomes: all pᵢ = 1/n → H = log(n)
Minimum entropy: one pᵢ = 1, all others 0 → H = 0
Key Properties / Rules
| Property | Meaning |
|---|---|
| H(P) ≥ 0 | Entropy is always non-negative |
| H(P) = 0 | Perfectly certain — one outcome has probability 1 |
| H(P) = log(n) | Maximum — all n outcomes equally likely |
| H increases with uncertainty | Flatter distribution = higher entropy |
| Rare events have high surprise | -log(0.01) ≈ 6.6 bits vs -log(0.99) ≈ 0.01 bits |
Why It Matters
Cross-entropy loss (the standard loss for classification) is built on entropy. When you train a neural network to predict categories, you're minimizing the cross-entropy between the true distribution and the predicted distribution. Entropy is the lower bound you're trying to approach.
Language modeling is literally entropy minimization. A good language model assigns high probability to likely next tokens — which means low entropy in its predictions. Perplexity (a standard LM metric) is just 2^H — entropy in disguise.
Compression. Shannon's source coding theorem says the minimum average bits needed to encode samples from a distribution is exactly the entropy. This is why ZIP files compress some files more than others — low-entropy files (repetitive data) compress better.
Common Pitfalls
- log(0) is undefined — probability 0 would give infinite surprise. In practice, clamp probabilities away from 0 or add a small epsilon before computing log.
- Entropy of a continuous distribution (differential entropy) can be negative — don't confuse it with discrete entropy, which is always ≥ 0.
- High entropy ≠ bad. A model's output entropy should be low (confident predictions). But you might want high-entropy data (diverse training set). Context matters.
Examples
import numpy as np
def entropy(probs):
probs = np.array(probs)
# Clip to avoid log(0)
probs = np.clip(probs, 1e-10, 1.0)
return -np.sum(probs * np.log(probs)) # nats
# Uniform distribution over 4 classes: maximum entropy
uniform = [0.25, 0.25, 0.25, 0.25]
print(f"Uniform entropy: {entropy(uniform):.4f} nats") # ≈ 1.3863
# Peaked distribution: low entropy
peaked = [0.97, 0.01, 0.01, 0.01]
print(f"Peaked entropy: {entropy(peaked):.4f} nats") # ≈ 0.1571
# Certain: zero entropy
certain = [1.0, 0.0, 0.0, 0.0]
print(f"Certain entropy: {entropy(certain):.4f} nats") # ≈ 0.0000
# Binary entropy — max at p=0.5
for p in [0.1, 0.3, 0.5, 0.7, 0.9]:
q = 1 - p
h = -(p * np.log(p) + q * np.log(q))
print(f"p={p:.1f}: H = {h:.4f} nats")
# p=0.1: H = 0.3251 (low — almost certain heads)
# p=0.5: H = 0.6931 (max — completely uncertain)
# p=0.9: H = 0.3251 (symmetric — same as p=0.1)