MnemosyneMnemosyne

Cross-Entropy

Cross-entropy measures how well a predicted probability distribution matches the true distribution. It's the standard loss function for classification in neural networks — minimizing it teaches the model to assign high probability to correct answers.

Intuition First

You're grading a student's predictions. The student says "I think there's a 90% chance the answer is A." The real answer is A. Great — they were confident and correct.

Now another student says "I think there's a 10% chance it's A." Also correct, but they weren't confident. You want to penalize overconfidence in wrong answers and reward overconfidence in right answers.

Cross-entropy is that grade. It measures: if I coded outcomes using a codebook designed for distribution Q (the model's predictions), what's the actual average cost of encoding samples from distribution P (the true answers)?


What's Actually Happening

Entropy (H(P)) measures the minimum average bits needed to encode a source if you know its true distribution.

Cross-entropy H(P, Q) measures the average bits needed when you think the distribution is Q but it's actually P.

If Q is wrong — if the model is confidently predicting the wrong thing — you need extra bits to encode the truth. The gap is wasteful.

Cross-entropy = entropy + KL divergence

H(P, Q) = H(P) + KL(P ‖ Q)

Since entropy H(P) is fixed (it's the true distribution, not the model), minimizing cross-entropy is exactly the same as minimizing KL divergence — which means making Q as close to P as possible.


Build the Idea Step-by-Step

True distribution P: [1, 0, 0] — the correct class is 'cat'
Predicted distribution Q: [0.7, 0.2, 0.1] — model's softmax output
Per-class cost: -P(x) · log Q(x) — only non-zero for true class
Cross-entropy loss = -log Q(correct class) = -log(0.7) ≈ 0.357

Formal Explanation

For distributions P (true) and Q (predicted) over outcomes x:

H(P, Q) = -Σₓ P(x) · log Q(x)

In classification (one-hot P where only the true class yₜᵣᵤₑ has P = 1):

H(P, Q) = -log Q(yₜᵣᵤₑ)

This simplifies everything: you just take the negative log of the probability your model assigned to the correct class.

For binary classification (two classes, true label y ∈ 1, predicted probability ŷ):

BCE = -(y · log(ŷ) + (1 - y) · log(1 - ŷ))

This is called Binary Cross-Entropy (BCE). It's the standard loss for logistic regression and binary classifiers.


Key Properties / Rules

PropertyMeaning
H(P, Q) ≥ H(P)Always ≥ true entropy — equality only when P = Q
H(P, Q) ≥ 0Always non-negative
Not symmetricH(P, Q) ≠ H(Q, P) in general
Penalizes confident wrong answers harshly-log(0.01) ≈ 4.6 — model was 99% wrong, huge loss
Rewards confident correct answers-log(0.99) ≈ 0.01 — model was 99% right, tiny loss

Why It Matters

Standard classification loss. Every PyTorch/TensorFlow classification model uses CrossEntropyLoss or BCELoss. Understanding what it measures tells you how to interpret your training curves.

Language model training. GPT models are trained on cross-entropy loss over next-token prediction. The model's job is to assign high probability to the actual next token. Minimizing this loss is equivalent to maximizing log-likelihood.

Perplexity is exp(average cross-entropy). A perplexity of 50 means the model is, on average, as confused as if it had to choose uniformly among 50 options. Lower perplexity = better language model.

Soft labels. Cross-entropy works even when P isn't one-hot. Label smoothing replaces [1, 0, 0] with [0.9, 0.05, 0.05] — the model learns a smoother distribution, which improves calibration and generalization.


Common Pitfalls

  • Predicted probability of 0 → infinite loss. log(0) is undefined. In practice, softmax outputs are never exactly 0, but numerical instability is real. PyTorch's CrossEntropyLoss uses log-sum-exp internally to avoid this.
  • Using sigmoid + BCE vs softmax + CE. For multi-class (exactly one class), use softmax + cross-entropy. For multi-label (multiple classes can be true), use sigmoid + BCE per class.
  • Cross-entropy is not symmetric. H(P, Q) ≠ H(Q, P). The "true" distribution is always the first argument. Don't swap them.
  • Loss going to 0 doesn't mean 100% accuracy. Cross-entropy minimizes to H(P) — if your labels have noise, the minimum is nonzero.

Examples

import numpy as np

def cross_entropy(p_true, q_pred):
    """Cross-entropy H(P, Q)."""
    q_pred = np.clip(q_pred, 1e-10, 1.0)  # avoid log(0)
    return -np.sum(np.array(p_true) * np.log(q_pred))

# One-hot true label: class 0 is correct
p_true = [1.0, 0.0, 0.0]

# Good prediction: model is confident and correct
q_good = [0.9, 0.07, 0.03]
print(f"Good prediction loss:  {cross_entropy(p_true, q_good):.4f}")  # ≈ 0.1054

# Bad prediction: model is confident but wrong
q_bad = [0.05, 0.90, 0.05]
print(f"Bad prediction loss:   {cross_entropy(p_true, q_bad):.4f}")   # ≈ 2.9957

# Uniform: maximum confusion
q_uniform = [0.333, 0.333, 0.333]
print(f"Uniform prediction:    {cross_entropy(p_true, q_uniform):.4f}")  # ≈ 1.0986

# Binary cross-entropy
def bce(y_true, y_pred):
    y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)
    return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

print(f"\nBCE (y=1, ŷ=0.95): {bce(1, 0.95):.4f}")   # ≈ 0.0513  — great prediction
print(f"BCE (y=1, ŷ=0.50): {bce(1, 0.50):.4f}")    # ≈ 0.6931  — uncertain
print(f"BCE (y=1, ŷ=0.05): {bce(1, 0.05):.4f}")    # ≈ 2.9957  — confidently wrong

# In PyTorch
import torch
import torch.nn.functional as F

logits = torch.tensor([[2.0, 0.5, -1.0]])  # raw model outputs (before softmax)
target = torch.tensor([0])                  # correct class is index 0

loss = F.cross_entropy(logits, target)
print(f"\nPyTorch CrossEntropyLoss: {loss.item():.4f}")  # handles softmax internally

Review Questions