KL Divergence
KL divergence measures how different one probability distribution is from another. It quantifies the "information loss" when you approximate the true distribution with a model. It's the backbone of VAEs, RLHF, and why cross-entropy loss works.
Intuition First
You're a spy who received an encoded message. The sender used a codebook based on their language (distribution Q), but the message was actually encoded in your language (distribution P).
You waste extra effort decoding because the codebook is wrong. The amount of extra effort — the inefficiency — is the KL divergence.
KL divergence answers: "How much information do I waste if I use distribution Q to describe data that actually came from distribution P?"
Zero KL divergence means the two distributions are identical. Any divergence > 0 means they differ, and you're losing efficiency.
What's Actually Happening
KL divergence is the extra average surprise you experience when you expect distribution Q but reality is distribution P.
- If P(x) = Q(x) for all x → no extra surprise → KL = 0
- If P assigns high probability to an event that Q calls unlikely → that event surprises you → KL is large
It's defined as the expected log-ratio of probabilities:
KL(P ‖ Q) = E_P[ log(P(x) / Q(x)) ]
= Σₓ P(x) · log(P(x) / Q(x))
Read this as: "On average (weighted by P), how different are P and Q in log-probability space?"
Build the Idea Step-by-Step
Formal Explanation
For discrete distributions P and Q:
KL(P ‖ Q) = Σₓ P(x) · log(P(x) / Q(x))
Equivalently, expanding the log:
KL(P ‖ Q) = Σₓ P(x) · log P(x) - Σₓ P(x) · log Q(x)
= -H(P) + H(P, Q)
= H(P, Q) - H(P)
This shows:
- Cross-entropy = Entropy + KL divergence
- Since H(P) is fixed during training, minimizing cross-entropy = minimizing KL divergence
Key facts:
- KL(P ‖ Q) ≥ 0 always (proved by Jensen's inequality)
- KL(P ‖ Q) = 0 if and only if P = Q
- Not symmetric: KL(P ‖ Q) ≠ KL(Q ‖ P)
Key Properties / Rules
| Property | Meaning |
|---|---|
| KL ≥ 0 | Always non-negative — guaranteed by Jensen's inequality |
| KL = 0 iff P = Q | Zero divergence ↔ identical distributions |
| Not symmetric | KL(P‖Q) ≠ KL(Q‖P) — order matters |
| Not a metric | No triangle inequality — it's a divergence, not a distance |
| KL(P‖Q) = ∞ | If Q(x) = 0 but P(x) > 0 — Q can't explain what P says happens |
Why Direction Matters
The asymmetry is not a flaw — it's meaningful:
KL(P ‖ Q): "forward KL" — comparing Q to reality P. If P(x) > 0 but Q(x) ≈ 0, KL is huge. Q must cover everything P says is possible. This gives mean-seeking behavior: Q spreads out to cover all modes of P.
KL(Q ‖ P): "reverse KL" — comparing P to the model Q. If Q(x) > 0 but P(x) ≈ 0, KL is huge. Q avoids regions P says are impossible. This gives mode-seeking behavior: Q picks one peak of P and ignores others.
In VAEs, the regularization term is KL(Q ‖ P) where P is the standard Gaussian prior — this pushes the encoder to stay close to N(0,1).
Why It Matters
Training is KL minimization. Cross-entropy loss is KL(true ‖ model) plus a constant. Every time you minimize cross-entropy loss, you're making the model's distribution closer to the true data distribution in the KL sense.
Variational Autoencoders (VAEs). The VAE loss has two terms: reconstruction loss + β · KL(encoder ‖ prior). The KL term forces the encoder to produce latent codes that look like samples from N(0,1). Without it, the encoder would memorize each input as a point with no structure.
RLHF / KL penalties. When fine-tuning a language model with reinforcement learning (e.g., InstructGPT, ChatGPT), the reward function includes -β · KL(policy ‖ reference_model). This prevents the model from drifting too far from its pre-trained behavior while still optimizing the human preference reward.
f-divergences and GANs. KL divergence is one instance of a broader family of f-divergences. GAN training minimizes Jensen-Shannon divergence (the symmetric average of forward and reverse KL), which is why GANs generate sharp, mode-seeking samples.
Common Pitfalls
- Q(x) = 0 where P(x) > 0 means KL = ∞. Your model must assign non-zero probability to every event that can actually occur. This is why we add epsilon or use log-sum-exp tricks in practice.
- KL is not a distance. You can't use it with nearest-neighbor search or apply the triangle inequality. If you need a symmetric measure, use Jensen-Shannon divergence: JSD(P, Q) = (KL(P‖M) + KL(Q‖M)) / 2 where M = (P+Q)/2.
- Mixing up P and Q. KL(P ‖ Q) ≠ KL(Q ‖ P). In VAEs, the regularization is KL(posterior ‖ prior) — swap those and the math is wrong.
Examples
import numpy as np
def kl_divergence(p, q):
"""KL(P ‖ Q) — how much Q diverges from P."""
p = np.array(p, dtype=float)
q = np.clip(np.array(q, dtype=float), 1e-10, 1.0)
# Only sum over where p > 0 (0 * log(0/q) = 0 by convention)
mask = p > 0
return np.sum(p[mask] * np.log(p[mask] / q[mask]))
# P: true distribution — 3 equally likely classes
p_true = [0.333, 0.333, 0.333]
# Q1: close to P
q_close = [0.300, 0.350, 0.350]
print(f"KL(P ‖ Q_close): {kl_divergence(p_true, q_close):.4f}") # ≈ 0.0054
# Q2: peaked distribution — very different from uniform P
q_peaked = [0.90, 0.05, 0.05]
print(f"KL(P ‖ Q_peaked): {kl_divergence(p_true, q_peaked):.4f}") # ≈ 0.6365
# KL is not symmetric
print(f"KL(Q_peaked ‖ P): {kl_divergence(q_peaked, p_true):.4f}") # ≈ 0.5443
# KL(P ‖ P) should be 0
print(f"KL(P ‖ P): {kl_divergence(p_true, p_true):.4f}") # = 0.0000
# Relationship: H(P, Q) = H(P) + KL(P ‖ Q)
def entropy(p):
p = np.clip(np.array(p), 1e-10, 1.0)
return -np.sum(p * np.log(p))
def cross_entropy(p, q):
q = np.clip(np.array(q), 1e-10, 1.0)
return -np.sum(np.array(p) * np.log(q))
h_p = entropy(p_true)
kl = kl_divergence(p_true, q_peaked)
ce = cross_entropy(p_true, q_peaked)
print(f"\nH(P) = {h_p:.4f}")
print(f"KL(P‖Q) = {kl:.4f}")
print(f"H(P) + KL = {h_p + kl:.4f}")
print(f"H(P, Q) = {ce:.4f}")
# H(P) + KL should equal H(P, Q) ✓