Cross-Entropy
Cross-entropy measures how well a predicted probability distribution matches the true distribution. It's the standard loss function for classification in neural networks — minimizing it teaches the model to assign high probability to correct answers.
Intuition First
You're grading a student's predictions. The student says "I think there's a 90% chance the answer is A." The real answer is A. Great — they were confident and correct.
Now another student says "I think there's a 10% chance it's A." Also correct, but they weren't confident. You want to penalize overconfidence in wrong answers and reward overconfidence in right answers.
Cross-entropy is that grade. It measures: if I coded outcomes using a codebook designed for distribution Q (the model's predictions), what's the actual average cost of encoding samples from distribution P (the true answers)?
What's Actually Happening
Entropy (H(P)) measures the minimum average bits needed to encode a source if you know its true distribution.
Cross-entropy H(P, Q) measures the average bits needed when you think the distribution is Q but it's actually P.
If Q is wrong — if the model is confidently predicting the wrong thing — you need extra bits to encode the truth. The gap is wasteful.
Cross-entropy = entropy + KL divergence
H(P, Q) = H(P) + KL(P ‖ Q)
Since entropy H(P) is fixed (it's the true distribution, not the model), minimizing cross-entropy is exactly the same as minimizing KL divergence — which means making Q as close to P as possible.
Build the Idea Step-by-Step
Formal Explanation
For distributions P (true) and Q (predicted) over outcomes x:
H(P, Q) = -Σₓ P(x) · log Q(x)
In classification (one-hot P where only the true class yₜᵣᵤₑ has P = 1):
H(P, Q) = -log Q(yₜᵣᵤₑ)
This simplifies everything: you just take the negative log of the probability your model assigned to the correct class.
For binary classification (two classes, true label y ∈ 1, predicted probability ŷ):
BCE = -(y · log(ŷ) + (1 - y) · log(1 - ŷ))
This is called Binary Cross-Entropy (BCE). It's the standard loss for logistic regression and binary classifiers.
Key Properties / Rules
| Property | Meaning |
|---|---|
| H(P, Q) ≥ H(P) | Always ≥ true entropy — equality only when P = Q |
| H(P, Q) ≥ 0 | Always non-negative |
| Not symmetric | H(P, Q) ≠ H(Q, P) in general |
| Penalizes confident wrong answers harshly | -log(0.01) ≈ 4.6 — model was 99% wrong, huge loss |
| Rewards confident correct answers | -log(0.99) ≈ 0.01 — model was 99% right, tiny loss |
Why It Matters
Standard classification loss. Every PyTorch/TensorFlow classification model uses CrossEntropyLoss or BCELoss. Understanding what it measures tells you how to interpret your training curves.
Language model training. GPT models are trained on cross-entropy loss over next-token prediction. The model's job is to assign high probability to the actual next token. Minimizing this loss is equivalent to maximizing log-likelihood.
Perplexity is exp(average cross-entropy). A perplexity of 50 means the model is, on average, as confused as if it had to choose uniformly among 50 options. Lower perplexity = better language model.
Soft labels. Cross-entropy works even when P isn't one-hot. Label smoothing replaces [1, 0, 0] with [0.9, 0.05, 0.05] — the model learns a smoother distribution, which improves calibration and generalization.
Common Pitfalls
- Predicted probability of 0 → infinite loss.
log(0)is undefined. In practice, softmax outputs are never exactly 0, but numerical instability is real. PyTorch'sCrossEntropyLossuses log-sum-exp internally to avoid this. - Using sigmoid + BCE vs softmax + CE. For multi-class (exactly one class), use softmax + cross-entropy. For multi-label (multiple classes can be true), use sigmoid + BCE per class.
- Cross-entropy is not symmetric. H(P, Q) ≠ H(Q, P). The "true" distribution is always the first argument. Don't swap them.
- Loss going to 0 doesn't mean 100% accuracy. Cross-entropy minimizes to H(P) — if your labels have noise, the minimum is nonzero.
Examples
import numpy as np
def cross_entropy(p_true, q_pred):
"""Cross-entropy H(P, Q)."""
q_pred = np.clip(q_pred, 1e-10, 1.0) # avoid log(0)
return -np.sum(np.array(p_true) * np.log(q_pred))
# One-hot true label: class 0 is correct
p_true = [1.0, 0.0, 0.0]
# Good prediction: model is confident and correct
q_good = [0.9, 0.07, 0.03]
print(f"Good prediction loss: {cross_entropy(p_true, q_good):.4f}") # ≈ 0.1054
# Bad prediction: model is confident but wrong
q_bad = [0.05, 0.90, 0.05]
print(f"Bad prediction loss: {cross_entropy(p_true, q_bad):.4f}") # ≈ 2.9957
# Uniform: maximum confusion
q_uniform = [0.333, 0.333, 0.333]
print(f"Uniform prediction: {cross_entropy(p_true, q_uniform):.4f}") # ≈ 1.0986
# Binary cross-entropy
def bce(y_true, y_pred):
y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)
return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
print(f"\nBCE (y=1, ŷ=0.95): {bce(1, 0.95):.4f}") # ≈ 0.0513 — great prediction
print(f"BCE (y=1, ŷ=0.50): {bce(1, 0.50):.4f}") # ≈ 0.6931 — uncertain
print(f"BCE (y=1, ŷ=0.05): {bce(1, 0.05):.4f}") # ≈ 2.9957 — confidently wrong
# In PyTorch
import torch
import torch.nn.functional as F
logits = torch.tensor([[2.0, 0.5, -1.0]]) # raw model outputs (before softmax)
target = torch.tensor([0]) # correct class is index 0
loss = F.cross_entropy(logits, target)
print(f"\nPyTorch CrossEntropyLoss: {loss.item():.4f}") # handles softmax internally