Gradient Descent Intuition
Gradient descent is the algorithm that trains neural networks. It takes small steps in the direction of the negative gradient — downhill on the loss surface — until the loss is acceptably low.
Intuition First
Same hilly landscape, same darkness. But now you have a strategy:
- Feel the slope at your current position (compute the gradient)
- Take a small step in the steepest downhill direction (move opposite to gradient)
- Repeat
The "small step" matters. Too large a step and you'll overshoot the valley. Too small and you'll take forever. That step size is the learning rate — the most important hyperparameter in training.
What's Actually Happening
At each training step:
- Run the current weights through the network, compute the loss
- Compute the gradient of the loss with respect to every weight (backpropagation)
- Subtract a fraction of the gradient from each weight
That fraction is controlled by the learning rate α (alpha). The result is a weight configuration that's (hopefully) slightly better than before.
Repeat this thousands to millions of times, and the loss gradually decreases.
Build the Idea Step-by-Step
Formal Explanation
The update rule:
w ← w - α · ∇L(w)
Where:
w— current weights (a vector of all parameters)α— learning rate (a small positive number, e.g. 0.001)∇L(w)— gradient of the loss at current weights←— replacewwith the updated value
Why does this reduce the loss?
The gradient ∇L points uphill. Moving in the direction -∇L (downhill) decreases L, at least for a small enough step. This is guaranteed locally — the gradient gives exact first-order information about the slope.
Variants:
| Variant | Description |
|---|---|
| Batch gradient descent | Full dataset gradient each step — accurate but slow |
| Stochastic (SGD) | One sample per step — noisy but fast |
| Mini-batch SGD | Small batch (32–256) — the standard in practice |
| Adam | Adaptive per-weight learning rates + momentum — most common default |
Key Properties / Rules
| Concept | Meaning |
|---|---|
Learning rate α too high | Overshoots, loss diverges or oscillates |
Learning rate α too low | Converges too slowly, may get stuck |
| Gradient = 0 | Reached a critical point — stop updating |
| Mini-batch noise | Random gradients help escape saddle points |
| Momentum | Carries velocity from past steps — smooths trajectory |
| Learning rate schedule | Decrease α over time — take big steps early, fine-tune later |
Why It Matters
Gradient descent is the engine of all deep learning. Every model you've heard of — GPT, BERT, Stable Diffusion, AlphaFold — was trained by running gradient descent until the loss was low enough.
Connection to other concepts:
- Gradient (previous note): tells you the direction to move
- Chain rule / backprop: how the gradient is computed efficiently
- Loss function: the landscape you're descending
- Optimizer (Adam, SGD): variations on gradient descent with smarter stepping strategies
Understanding gradient descent makes all the optimizer hyperparameters make sense: learning rate, momentum, weight decay, learning rate schedulers.
Common Pitfalls
- Loss goes up or oscillates wildly. Learning rate is too large — reduce it by 10×.
- Loss decreases then plateaus without reaching a good value. Common causes: learning rate too small, saddle point, architecture issue.
- Training loss is low but validation loss is high. This is overfitting, not an optimization failure — the descent worked, but it optimized to noise.
- All weights update identically. Usually means all weights are initialized to the same value. Random initialization breaks this symmetry.
- Gradients vanish or explode. Chain rule multiplies many derivatives — if they're all < 1 they collapse to zero; if all > 1 they explode. Solved by normalization layers, careful initialization, and gradient clipping.
Examples
# Manual gradient descent on a simple function
# Goal: minimize f(w) = (w - 3)^2, true minimum at w = 3
def loss(w):
return (w - 3)**2
def grad(w):
return 2 * (w - 3) # df/dw
w = 0.0 # start far from the minimum
alpha = 0.1 # learning rate
for step in range(20):
g = grad(w)
w = w - alpha * g
print(f"step {step:2d}: w={w:.4f}, loss={loss(w):.4f}, grad={g:.4f}")
# w converges to 3.0
# Mini-batch gradient descent with PyTorch
import torch
import torch.nn as nn
model = nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()
# Fake data: y = 2x1 + 3x2 + noise
X = torch.randn(100, 2)
y = 2*X[:, 0:1] + 3*X[:, 1:2] + 0.1 * torch.randn(100, 1)
for epoch in range(100):
optimizer.zero_grad() # clear previous gradients
pred = model(X)
loss = loss_fn(pred, y)
loss.backward() # compute ∇L via chain rule / backprop
optimizer.step() # w ← w - α·∇L
if epoch % 20 == 0:
print(f"epoch {epoch}: loss={loss.item():.4f}")
print(f"learned weights: {model.weight.data}") # should be close to [2, 3]
Choosing a learning rate (practical rules of thumb):
- Start with
1e-3(Adam) or1e-2(SGD) - If loss diverges: divide by 10
- If loss barely moves: multiply by 10
- Use a learning rate finder or schedule for important training runs