Gradient Descent Intuition

Gradient descent is the algorithm that trains neural networks. It takes small steps in the direction of the negative gradient — downhill on the loss surface — until the loss is acceptably low.

Intuition First

Same hilly landscape, same darkness. But now you have a strategy:

Feel the slope at your current position (compute the gradient)
Take a small step in the steepest downhill direction (move opposite to gradient)
Repeat

The "small step" matters. Too large a step and you'll overshoot the valley. Too small and you'll take forever. That step size is the learning rate — the most important hyperparameter in training.

What's Actually Happening

At each training step:

Run the current weights through the network, compute the loss
Compute the gradient of the loss with respect to every weight (backpropagation)
Subtract a fraction of the gradient from each weight

That fraction is controlled by the learning rate α (alpha). The result is a weight configuration that's (hopefully) slightly better than before.

Repeat this thousands to millions of times, and the loss gradually decreases.

Build the Idea Step-by-Step

Start: random weights w

→

Forward pass: compute loss L(w)

→

Backward pass: compute ∇L (backprop)

→

Update: w ← w - α·∇L

→

New w has lower loss (usually)

→

Repeat until loss is small enough

Formal Explanation

The update rule:

w ← w - α · ∇L(w)

Where:

w — current weights (a vector of all parameters)
α — learning rate (a small positive number, e.g. 0.001)
∇L(w) — gradient of the loss at current weights
← — replace w with the updated value

Why does this reduce the loss?

The gradient ∇L points uphill. Moving in the direction -∇L (downhill) decreases L, at least for a small enough step. This is guaranteed locally — the gradient gives exact first-order information about the slope.

Variants:

Variant	Description
Batch gradient descent	Full dataset gradient each step — accurate but slow
Stochastic (SGD)	One sample per step — noisy but fast
Mini-batch SGD	Small batch (32–256) — the standard in practice
Adam	Adaptive per-weight learning rates + momentum — most common default

Key Properties / Rules

Concept	Meaning
Learning rate `α` too high	Overshoots, loss diverges or oscillates
Learning rate `α` too low	Converges too slowly, may get stuck
Gradient = 0	Reached a critical point — stop updating
Mini-batch noise	Random gradients help escape saddle points
Momentum	Carries velocity from past steps — smooths trajectory
Learning rate schedule	Decrease `α` over time — take big steps early, fine-tune later

Why It Matters

Gradient descent is the engine of all deep learning. Every model you've heard of — GPT, BERT, Stable Diffusion, AlphaFold — was trained by running gradient descent until the loss was low enough.

Connection to other concepts:

Gradient (previous note): tells you the direction to move
Chain rule / backprop: how the gradient is computed efficiently
Loss function: the landscape you're descending
Optimizer (Adam, SGD): variations on gradient descent with smarter stepping strategies

Understanding gradient descent makes all the optimizer hyperparameters make sense: learning rate, momentum, weight decay, learning rate schedulers.

Common Pitfalls

Loss goes up or oscillates wildly. Learning rate is too large — reduce it by 10×.
Loss decreases then plateaus without reaching a good value. Common causes: learning rate too small, saddle point, architecture issue.
Training loss is low but validation loss is high. This is overfitting, not an optimization failure — the descent worked, but it optimized to noise.
All weights update identically. Usually means all weights are initialized to the same value. Random initialization breaks this symmetry.
Gradients vanish or explode. Chain rule multiplies many derivatives — if they're all < 1 they collapse to zero; if all > 1 they explode. Solved by normalization layers, careful initialization, and gradient clipping.

Examples

# Manual gradient descent on a simple function
# Goal: minimize f(w) = (w - 3)^2, true minimum at w = 3

def loss(w):
    return (w - 3)**2

def grad(w):
    return 2 * (w - 3)   # df/dw

w = 0.0          # start far from the minimum
alpha = 0.1      # learning rate

for step in range(20):
    g = grad(w)
    w = w - alpha * g
    print(f"step {step:2d}: w={w:.4f}, loss={loss(w):.4f}, grad={g:.4f}")

# w converges to 3.0

# Mini-batch gradient descent with PyTorch
import torch
import torch.nn as nn

model = nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()

# Fake data: y = 2x1 + 3x2 + noise
X = torch.randn(100, 2)
y = 2*X[:, 0:1] + 3*X[:, 1:2] + 0.1 * torch.randn(100, 1)

for epoch in range(100):
    optimizer.zero_grad()   # clear previous gradients
    pred = model(X)
    loss = loss_fn(pred, y)
    loss.backward()         # compute ∇L via chain rule / backprop
    optimizer.step()        # w ← w - α·∇L

    if epoch % 20 == 0:
        print(f"epoch {epoch}: loss={loss.item():.4f}")

print(f"learned weights: {model.weight.data}")  # should be close to [2, 3]

Choosing a learning rate (practical rules of thumb):

Start with 1e-3 (Adam) or 1e-2 (SGD)
If loss diverges: divide by 10
If loss barely moves: multiply by 10
Use a learning rate finder or schedule for important training runs