MnemosyneMnemosyne

Gradients

The gradient collects all partial derivatives into a single vector pointing in the direction of steepest increase. Negating it gives you the direction to move weights to reduce loss — the core operation in all neural network training.

Intuition First

You're standing on a hilly landscape — eyes closed, can't see the valley. How do you find the lowest point?

Feel the ground in all directions. The ground slopes differently left-right vs. front-back. If you combine those slopes into one arrow, that arrow points toward the steepest uphill direction. Turn around and walk the other way, and you're heading toward the valley.

The gradient is that arrow. In ML, the "landscape" is the loss surface, and the gradient at any point tells you the direction of steepest increase in loss. Flip it, and you know exactly which way to move your weights to make the loss go down.


What's Actually Happening

Partial derivatives tell you how a function changes in each individual direction. The gradient packages all of them into one vector.

For f(x, y):

  • ∂f/∂x is the slope in the x-direction
  • ∂f/∂y is the slope in the y-direction

The gradient ∇f is just both of these together: [∂f/∂x, ∂f/∂y].

This vector has a direction (pointing uphill) and a magnitude (how steep that uphill is). The steeper the landscape, the longer the gradient vector.


Build the Idea Step-by-Step

f(x₁, ..., xₙ): multi-input function
Compute each ∂f/∂xᵢ
Stack into vector: ∇f = [∂f/∂x₁, ..., ∂f/∂xₙ]
∇f points toward steepest increase
-∇f points toward steepest decrease
Move weights by -α·∇L → reduce loss

Formal Explanation

For a function f(x₁, x₂, ..., xₙ):

∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]

This vector lives in the same space as the inputs. It has one component per input dimension.

Example: f(x, y) = x² + 2y²

∂f/∂x = 2x
∂f/∂y = 4y

∇f(x,y) = [2x, 4y]

At (1, 2): ∇f = [2, 8]

The gradient [2, 8] says: at this point, y contributes 4× more to the function's rate of change than x.

Key property: The gradient vector always points in the direction of steepest ascent. Its magnitude is the rate of that ascent.


Key Properties / Rules

PropertyMeaning
∇f is a vectorOne entry per input dimension
Points uphillMaximally increases f
-∇f points downhillUsed in gradient descent
‖∇f‖ is largeSteep landscape (big changes ahead)
‖∇f‖ = 0At a critical point (potential minimum)
Perpendicular to level curvesThe gradient crosses isolines at 90°

Why It Matters

In neural network training:

  • The loss L(w) is a function of all weights w = [w₁, w₂, ..., wₙ]
  • The gradient ∇L = [∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ] tells you how each weight affects the loss
  • The update rule w ← w - α·∇L moves every weight slightly downhill simultaneously

The gradient is the mechanism that makes learning possible. Without it, you'd have no signal about which direction to change millions of parameters.

Gradient magnitude as a diagnostic:

  • Exploding gradients: ‖∇L‖ is enormous — training diverges
  • Vanishing gradients: ‖∇L‖ ≈ 0 — early layers learn nothing
  • Normal: gradients stay at a reasonable scale throughout training

Common Pitfalls

  • Confusing gradient with a single derivative. The derivative df/dx is a scalar. The gradient ∇f is a vector. They're the same concept in 1D; in higher dimensions, the gradient is the natural extension.
  • Forgetting the negative. The gradient points uphill. To decrease loss, you use -∇L. "Gradient descent" walks in the direction of -∇L.
  • Large gradient doesn't mean you're far from minimum. A loss function can have steep slopes everywhere, including near the minimum. Gradient magnitude tells you local steepness, not distance to the minimum.
  • The gradient is a local concept. It describes the slope at one specific point — not the global shape of the landscape.

Examples

import torch

# f(x, y) = x^2 + 2y^2
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)

f = x**2 + 2*y**2
f.backward()

print(x.grad)   # ∂f/∂x = 2x = 2.0
print(y.grad)   # ∂f/∂y = 4y = 8.0
# Gradient: [2.0, 8.0] — y-direction is 4x steeper here

# In gradient descent, you'd update:
# x = x - lr * x.grad  → 1.0 - 0.1 * 2.0 = 0.8 (moved toward 0)
# y = y - lr * y.grad  → 2.0 - 0.1 * 8.0 = 1.2 (moved toward 0 faster)
# Neural network gradients after backward pass
import torch.nn as nn

model = nn.Linear(3, 1)
x = torch.randn(10, 3)
y = torch.randn(10, 1)

loss = nn.MSELoss()(model(x), y)
loss.backward()

# Every parameter now has a .grad attribute
print(model.weight.grad)  # shape (1, 3): ∂L/∂W for each weight
print(model.bias.grad)    # shape (1,):   ∂L/∂b

# This is the gradient — it tells the optimizer which way to move each weight

Reading the gradient:

  • Large |∂L/∂wᵢ| → weight wᵢ strongly influences loss right now
  • ∂L/∂wᵢ > 0 → increase in wᵢ increases loss → decrease wᵢ
  • ∂L/∂wᵢ < 0 → increase in wᵢ decreases loss → increase wᵢ

Review Questions