Softmax and Cross-Entropy
Softmax converts raw scores into probabilities. Cross-entropy measures how wrong those probabilities are. Together, they form the standard output layer + loss function for classification — and they're mathematically designed to complement each other.
Intuition First
Your model scores three candidate answers: [3.0, 1.0, 0.2]. These are raw numbers — no probability yet. Which one is the answer?
Softmax converts these scores into probabilities that sum to 1: maybe [0.84, 0.11, 0.05]. Now you can say "the model is 84% confident it's class 0."
Cross-entropy then measures how far those probabilities are from the truth. If class 0 is correct: loss = -log(0.84) ≈ 0.17. If class 2 is correct but we assigned it only 5%: loss = -log(0.05) ≈ 3.0. The more confident you were in the wrong answer, the higher the penalty.
They're paired because softmax produces exactly what cross-entropy needs: a probability distribution over classes.
What's Actually Happening
Softmax is an "amplifying normalizer." It exponentiates each score (making all values positive) and then divides by the total (making them sum to 1):
softmax(z)ᵢ = eᶻⁱ / Σⱼ eᶻʲ
Why exponentiation? It preserves the ordering of scores (higher score → higher probability) but magnifies the differences. A score of 3 doesn't just become 3× more likely than a score of 1 — it becomes e³/e¹ = e² ≈ 7.4× more likely.
Cross-entropy then measures: given the true label, how many bits does it cost to encode it using the model's predicted distribution? The lower the probability you assigned to the correct answer, the higher the cost.
Loss = -log(softmax(z)_correct_class)
= -log(eᶻᶜ / Σⱼ eᶻʲ)
= -zc + log(Σⱼ eᶻʲ)
This is the log-sum-exp form — it's what PyTorch actually computes, and it's numerically stable.
Build the Idea Step-by-Step
Formal Explanation
Softmax
Given logit vector z ∈ ℝᵏ for k classes:
softmax(z)ᵢ = eᶻⁱ / Σⱼ₌₁ᵏ eᶻʲ
Properties:
- All outputs are in
(0, 1)— strictly between 0 and 1 - Outputs sum to 1 — valid probability distribution
- Relative ordering is preserved (larger logit → larger probability)
- Translation-invariant:
softmax(z) = softmax(z + c)for any constantc
Cross-Entropy Loss (classification)
For true one-hot label y and predicted probabilities p = softmax(z):
L = -Σᵢ yᵢ · log(pᵢ)
= -log(p_correct) ← since yᵢ = 1 only for the true class
The Combined Formula
Substituting softmax into cross-entropy:
L = -log(eᶻᶜ / Σⱼ eᶻʲ)
= -zc + log(Σⱼ eᶻʲ)
= log-sum-exp(z) - zc
where c is the correct class index.
Gradient of the combined loss with respect to logits (the beautiful result):
∂L/∂zᵢ = softmax(z)ᵢ - yᵢ = pᵢ - yᵢ
If i is the correct class: pᵢ - 1 (push probability up)
If i is any other class: pᵢ - 0 = pᵢ (push probability down)
This is one of the cleanest gradients in all of machine learning. The combined softmax + cross-entropy gradient is just "predicted probability minus truth."
Numerical Stability
The problem: eᶻ overflows to ∞ for large z (e.g., e¹⁰⁰ = ∞ in float32).
The fix: Subtract the max logit before exponentiating. Since softmax is translation-invariant:
z_shifted = z - max(z) # no overflow: largest value becomes 0
softmax(z) = exp(z_shifted) / sum(exp(z_shifted))
Log-softmax is even better: When you're computing log(softmax(z)) for the loss:
log(softmax(z)ᵢ) = zᵢ - log(Σⱼ eᶻʲ)
= zᵢ - max(z) - log(Σⱼ eᶻʲ⁻ᵐᵃˣ⁽ᶻ⁾)
This avoids computing softmax then log separately (which introduces precision loss). PyTorch's F.cross_entropy does this automatically — pass raw logits, not pre-softmaxed probabilities.
Key Properties / Rules
| Concept | Detail |
|---|---|
| Softmax temperature | softmax(z/T): high T → uniform, low T → sharp/argmax-like |
| Gradient of combined loss | pᵢ - yᵢ — prediction minus truth |
| Never apply softmax before CrossEntropyLoss | PyTorch applies it internally — double softmax breaks training |
| Log-sum-exp trick | Subtract max before exponentiation to avoid overflow |
| Softmax vs sigmoid | Multi-class → softmax; Multi-label (many can be true) → sigmoid per class |
Why It Matters
Temperature scaling changes model confidence. Language models use temperature T to control how "creative" or "random" their outputs are. T = 1.0 is standard softmax. T → 0 always picks the highest logit (greedy). T = 2.0 spreads probability more evenly, allowing more varied outputs.
The gradient is why training works cleanly. Because ∂L/∂zᵢ = pᵢ - yᵢ, the network gets a direct error signal: "you said probability 0.84 for class 0 but the truth is 1.0 — push that probability higher by 0.16." This clean signal makes softmax + cross-entropy much easier to train than, say, sigmoid + MSE on a multi-class problem.
RLHF uses the same structure. Language models produce logits over the vocabulary. Softmax converts to probabilities. The probability of the target token is the cross-entropy loss. Training maximizes the log-likelihood of the correct token — same formula.
Common Pitfalls
- Applying softmax before
nn.CrossEntropyLossin PyTorch. Don't. The function applieslog_softmaxinternally. If you pass softmaxed probabilities, it applies softmax again — the math is wrong and gradients become nearly zero. - Using softmax for multi-label classification. If multiple classes can be true simultaneously (e.g., image tags: cat, outdoor, cute), softmax is wrong — its outputs compete. Use sigmoid on each logit independently.
- Interpreting logits as probabilities. Logits are raw scores. A logit of 5 doesn't mean 50%. Only after softmax do you have probabilities.
- Forgetting numerical stability in custom implementations. If you implement softmax yourself, always subtract the max first.
Examples
import numpy as np
# --- Softmax with numerical stability ---
def softmax(z):
z = np.array(z, dtype=float)
z -= z.max() # subtract max for stability
exp_z = np.exp(z)
return exp_z / exp_z.sum()
logits = np.array([3.0, 1.0, 0.2])
probs = softmax(logits)
print("Probs:", np.round(probs, 4)) # [0.8437, 0.1142, 0.0421]
print("Sum: ", probs.sum()) # 1.0
# --- Cross-entropy loss ---
def cross_entropy(probs, true_class):
return -np.log(probs[true_class] + 1e-10)
print("Loss (class 0 is correct):", round(cross_entropy(probs, 0), 4)) # ≈ 0.17
print("Loss (class 2 is correct):", round(cross_entropy(probs, 2), 4)) # ≈ 3.17
# --- Gradient: the beautiful result ---
def softmax_crossentropy_gradient(logits, true_class):
p = softmax(logits)
y = np.zeros_like(p)
y[true_class] = 1.0
return p - y # prediction minus truth
grad = softmax_crossentropy_gradient(logits, true_class=0)
print("Gradient:", np.round(grad, 4))
# [-0.1563, 0.1142, 0.0421]
# correct class: -(1 - 0.8437) = -0.1563 (push up)
# wrong classes: positive (push down)
# Temperature scaling — how confidence changes
def softmax_with_temp(z, temperature=1.0):
z = np.array(z, dtype=float) / temperature
z -= z.max()
exp_z = np.exp(z)
return exp_z / exp_z.sum()
logits = np.array([3.0, 1.0, 0.2])
print("T=0.5 (sharp):", np.round(softmax_with_temp(logits, 0.5), 3))
print("T=1.0 (normal):", np.round(softmax_with_temp(logits, 1.0), 3))
print("T=2.0 (flat): ", np.round(softmax_with_temp(logits, 2.0), 3))
# T=0.5: [0.983, 0.016, 0.002] — very confident
# T=1.0: [0.844, 0.114, 0.042] — standard
# T=2.0: [0.644, 0.236, 0.120] — more spread out
# PyTorch: always pass raw logits to CrossEntropyLoss
import torch
import torch.nn as nn
import torch.nn.functional as F
logits = torch.tensor([[3.0, 1.0, 0.2]])
target = torch.tensor([0]) # class 0 is correct
# ✓ Correct: pass logits directly
loss = nn.CrossEntropyLoss()(logits, target)
print(f"Correct loss: {loss.item():.4f}")
# ✗ Wrong: softmax applied twice
probs = F.softmax(logits, dim=1)
wrong_loss = nn.CrossEntropyLoss()(probs, target)
print(f"Double-softmax loss (wrong): {wrong_loss.item():.4f}") # artificially low