MnemosyneMnemosyne

Gradient Descent Intuition

Gradient descent is the algorithm that trains neural networks. It takes small steps in the direction of the negative gradient — downhill on the loss surface — until the loss is acceptably low.

Intuition First

Same hilly landscape, same darkness. But now you have a strategy:

  1. Feel the slope at your current position (compute the gradient)
  2. Take a small step in the steepest downhill direction (move opposite to gradient)
  3. Repeat

The "small step" matters. Too large a step and you'll overshoot the valley. Too small and you'll take forever. That step size is the learning rate — the most important hyperparameter in training.


What's Actually Happening

At each training step:

  1. Run the current weights through the network, compute the loss
  2. Compute the gradient of the loss with respect to every weight (backpropagation)
  3. Subtract a fraction of the gradient from each weight

That fraction is controlled by the learning rate α (alpha). The result is a weight configuration that's (hopefully) slightly better than before.

Repeat this thousands to millions of times, and the loss gradually decreases.


Build the Idea Step-by-Step

Start: random weights w
Forward pass: compute loss L(w)
Backward pass: compute ∇L (backprop)
Update: w ← w - α·∇L
New w has lower loss (usually)
Repeat until loss is small enough

Formal Explanation

The update rule:

w ← w - α · ∇L(w)

Where:

  • w — current weights (a vector of all parameters)
  • α — learning rate (a small positive number, e.g. 0.001)
  • ∇L(w) — gradient of the loss at current weights
  • — replace w with the updated value

Why does this reduce the loss?

The gradient ∇L points uphill. Moving in the direction -∇L (downhill) decreases L, at least for a small enough step. This is guaranteed locally — the gradient gives exact first-order information about the slope.

Variants:

VariantDescription
Batch gradient descentFull dataset gradient each step — accurate but slow
Stochastic (SGD)One sample per step — noisy but fast
Mini-batch SGDSmall batch (32–256) — the standard in practice
AdamAdaptive per-weight learning rates + momentum — most common default

Key Properties / Rules

ConceptMeaning
Learning rate α too highOvershoots, loss diverges or oscillates
Learning rate α too lowConverges too slowly, may get stuck
Gradient = 0Reached a critical point — stop updating
Mini-batch noiseRandom gradients help escape saddle points
MomentumCarries velocity from past steps — smooths trajectory
Learning rate scheduleDecrease α over time — take big steps early, fine-tune later

Why It Matters

Gradient descent is the engine of all deep learning. Every model you've heard of — GPT, BERT, Stable Diffusion, AlphaFold — was trained by running gradient descent until the loss was low enough.

Connection to other concepts:

  • Gradient (previous note): tells you the direction to move
  • Chain rule / backprop: how the gradient is computed efficiently
  • Loss function: the landscape you're descending
  • Optimizer (Adam, SGD): variations on gradient descent with smarter stepping strategies

Understanding gradient descent makes all the optimizer hyperparameters make sense: learning rate, momentum, weight decay, learning rate schedulers.


Common Pitfalls

  • Loss goes up or oscillates wildly. Learning rate is too large — reduce it by 10×.
  • Loss decreases then plateaus without reaching a good value. Common causes: learning rate too small, saddle point, architecture issue.
  • Training loss is low but validation loss is high. This is overfitting, not an optimization failure — the descent worked, but it optimized to noise.
  • All weights update identically. Usually means all weights are initialized to the same value. Random initialization breaks this symmetry.
  • Gradients vanish or explode. Chain rule multiplies many derivatives — if they're all < 1 they collapse to zero; if all > 1 they explode. Solved by normalization layers, careful initialization, and gradient clipping.

Examples

# Manual gradient descent on a simple function
# Goal: minimize f(w) = (w - 3)^2, true minimum at w = 3

def loss(w):
    return (w - 3)**2

def grad(w):
    return 2 * (w - 3)   # df/dw

w = 0.0          # start far from the minimum
alpha = 0.1      # learning rate

for step in range(20):
    g = grad(w)
    w = w - alpha * g
    print(f"step {step:2d}: w={w:.4f}, loss={loss(w):.4f}, grad={g:.4f}")

# w converges to 3.0
# Mini-batch gradient descent with PyTorch
import torch
import torch.nn as nn

model = nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()

# Fake data: y = 2x1 + 3x2 + noise
X = torch.randn(100, 2)
y = 2*X[:, 0:1] + 3*X[:, 1:2] + 0.1 * torch.randn(100, 1)

for epoch in range(100):
    optimizer.zero_grad()   # clear previous gradients
    pred = model(X)
    loss = loss_fn(pred, y)
    loss.backward()         # compute ∇L via chain rule / backprop
    optimizer.step()        # w ← w - α·∇L

    if epoch % 20 == 0:
        print(f"epoch {epoch}: loss={loss.item():.4f}")

print(f"learned weights: {model.weight.data}")  # should be close to [2, 3]

Choosing a learning rate (practical rules of thumb):

  • Start with 1e-3 (Adam) or 1e-2 (SGD)
  • If loss diverges: divide by 10
  • If loss barely moves: multiply by 10
  • Use a learning rate finder or schedule for important training runs

Review Questions