Information Theory
Entropy
Entropy measures how unpredictable or surprising a probability distribution is. High entropy means high uncertainty — you can't easily guess what's coming next. It's the foundation for understanding why cross-entropy loss works.
Cross-Entropy
Cross-entropy measures how well a predicted probability distribution matches the true distribution. It's the standard loss function for classification in neural networks — minimizing it teaches the model to assign high probability to correct answers.
KL Divergence
KL divergence measures how different one probability distribution is from another. It quantifies the "information loss" when you approximate the true distribution with a model. It's the backbone of VAEs, RLHF, and why cross-entropy loss works.