Backpropagation
Backpropagation is how neural networks learn — it applies the chain rule backwards through every layer to compute how much each weight contributed to the error, then nudges each weight in the right direction.
Intuition First
Imagine you're running a factory. The final product came out wrong. You need to figure out who to blame — and by how much — so you can tell each worker to adjust their output.
You start at the end: the quality inspector says the product scored 3 when it should have scored 10. You trace backwards: the assembly line worker got bad parts. The parts came from the machinist, who used bad metal. The metal came from a supplier who shipped the wrong grade.
Each link in the chain carries some responsibility. You multiply the responsibilities together as you trace backwards. That's backpropagation.
What's Actually Happening
A neural network is a chain of composed functions:
output = fₗ(fₗ₋₁(... f₂(f₁(x))))
Training requires computing ∂Loss/∂w for every weight w in every layer. Backpropagation does this efficiently in two passes:
- Forward pass — compute the output, saving intermediate values at every layer
- Backward pass — walk backwards from the loss, applying the chain rule at each layer, reusing the saved values
The key insight: gradients are just products of derivatives through the chain. Once you've computed the gradient at layer L, you can multiply by the local derivative to get the gradient at layer L-1.
Build the Idea Step-by-Step
Formal Explanation
A minimal 2-layer network:
z₁ = W₁x + b₁ (linear transform, layer 1)
a₁ = ReLU(z₁) (activation, layer 1)
z₂ = W₂a₁ + b₂ (linear transform, layer 2)
Loss = MSE(z₂, y) (loss — mean squared error)
Forward pass — just compute and save everything:
z₁ = W₁x + b₁
a₁ = max(0, z₁)
z₂ = W₂a₁ + b₂
L = (z₂ - y)²
Backward pass — chain rule from Loss back to W₁:
∂L/∂z₂ = 2(z₂ - y) (loss → layer 2 pre-activation)
∂L/∂W₂ = ∂L/∂z₂ · a₁ᵀ (gradient for W₂)
∂L/∂b₂ = ∂L/∂z₂ (gradient for b₂)
∂L/∂a₁ = W₂ᵀ · ∂L/∂z₂ (chain: propagate through W₂)
∂L/∂z₁ = ∂L/∂a₁ ⊙ ReLU'(z₁) (chain: through activation — ⊙ = elementwise)
∂L/∂W₁ = ∂L/∂z₁ · xᵀ (gradient for W₁)
∂L/∂b₁ = ∂L/∂z₁ (gradient for b₁)
The pattern: every layer receives a "gradient signal" from the layer above, multiplies it by its own local derivative, then passes it down.
Key Properties / Rules
| Concept | What It Means |
|---|---|
| Forward pass stores activations | Backward pass needs them — don't recompute |
| Gradients are products of derivatives | Each layer multiplies the signal by its local slope |
∂L/∂W = δ · aᵀ | Outer product: gradient for any weight matrix |
∂L/∂a = Wᵀ · δ | Propagating the signal to the layer below |
| Vanishing gradient | If activations have small derivatives (e.g. sigmoid), gradients shrink multiplicatively and early layers learn nothing |
| Exploding gradient | If derivatives are large, gradients grow and destabilize training — clipping or normalization fixes this |
Why It Matters
It's the only way to train deep networks. Gradient descent updates weights by w ← w - η · ∂L/∂w. Backprop is what computes ∂L/∂w for thousands of weights efficiently.
Computational graph. PyTorch and TensorFlow build a computation graph during the forward pass and traverse it in reverse during .backward(). You get automatic differentiation of any differentiable function — not just standard network architectures.
Why some activations fail. Sigmoid saturates at large inputs — its derivative approaches 0. In a 10-layer network: 0.25¹⁰ ≈ 0.000001. The gradient for the first layer is essentially zero. ReLU keeps its derivative at 1, solving the vanishing gradient problem for most practical networks.
Common Pitfalls
- Forgetting to zero gradients in PyTorch. Gradients accumulate by default. Always call
optimizer.zero_grad()before each backward pass, or you're adding gradients from the previous batch. - Wrong gradient shape.
∂L/∂Wmust have the same shape asW. A mismatch means a transposition error in the chain rule — check whether you needa₁ᵀoraᵀ₁. - Computing gradient through non-differentiable ops.
argmax,round, hard comparisons — these have zero derivative almost everywhere. They break backprop. Use differentiable approximations (softmax instead of argmax, straight-through estimators, etc.). - Detaching tensors you need for the backward. If you detach an activation during the forward pass, its gradient won't flow. Common source of bugs in custom architectures.
Examples
# Manual backprop on a tiny network — matching PyTorch's autograd
import numpy as np
np.random.seed(0)
x = np.array([[1.0, 2.0]]) # (1, 2)
y = np.array([[1.0]]) # target
W1 = np.random.randn(2, 4) # (2, 4)
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1) # (4, 1)
b2 = np.zeros((1, 1))
# --- Forward pass ---
z1 = x @ W1 + b1 # (1, 4)
a1 = np.maximum(0, z1) # ReLU
z2 = a1 @ W2 + b2 # (1, 1)
loss = ((z2 - y) ** 2).mean()
# --- Backward pass ---
dL_dz2 = 2 * (z2 - y) / z2.size # ∂L/∂z2
dL_dW2 = a1.T @ dL_dz2 # (4, 1) — outer product
dL_db2 = dL_dz2.sum(axis=0)
dL_da1 = dL_dz2 @ W2.T # (1, 4) — propagate back through W2
dL_dz1 = dL_da1 * (z1 > 0) # ReLU derivative: 1 if active, 0 if not
dL_dW1 = x.T @ dL_dz1 # (2, 4)
dL_db1 = dL_dz1.sum(axis=0)
print(f"Loss: {loss:.6f}")
print(f"dL/dW2 shape: {dL_dW2.shape}") # (4, 1) ✓
print(f"dL/dW1 shape: {dL_dW1.shape}") # (2, 4) ✓
# PyTorch does the same thing automatically
import torch
import torch.nn as nn
torch.manual_seed(0)
model = nn.Sequential(
nn.Linear(2, 4),
nn.ReLU(),
nn.Linear(4, 1),
)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
x = torch.tensor([[1.0, 2.0]])
y = torch.tensor([[1.0]])
optimizer.zero_grad() # clear accumulated gradients
pred = model(x) # forward pass — builds computation graph
loss = ((pred - y) ** 2).mean()
loss.backward() # backprop — fills .grad for every parameter
optimizer.step() # update weights using gradients
for name, p in model.named_parameters():
print(f"{name}: grad = {p.grad}")
What .backward() does internally:
- Calls each layer's backward function in reverse order
- Each backward function multiplies the incoming gradient by the local derivative
- Accumulates the result into
.gradfor any leaf parameters - Passes the signal down to the next layer in the chain