Backpropagation

Backpropagation is how neural networks learn — it applies the chain rule backwards through every layer to compute how much each weight contributed to the error, then nudges each weight in the right direction.

Intuition First

Imagine you're running a factory. The final product came out wrong. You need to figure out who to blame — and by how much — so you can tell each worker to adjust their output.

You start at the end: the quality inspector says the product scored 3 when it should have scored 10. You trace backwards: the assembly line worker got bad parts. The parts came from the machinist, who used bad metal. The metal came from a supplier who shipped the wrong grade.

Each link in the chain carries some responsibility. You multiply the responsibilities together as you trace backwards. That's backpropagation.

What's Actually Happening

A neural network is a chain of composed functions:

output = fₗ(fₗ₋₁(... f₂(f₁(x))))

Training requires computing ∂Loss/∂w for every weight w in every layer. Backpropagation does this efficiently in two passes:

Forward pass — compute the output, saving intermediate values at every layer
Backward pass — walk backwards from the loss, applying the chain rule at each layer, reusing the saved values

The key insight: gradients are just products of derivatives through the chain. Once you've computed the gradient at layer L, you can multiply by the local derivative to get the gradient at layer L-1.

Build the Idea Step-by-Step

Forward pass: input → hidden → output → compute Loss

→

Save activations at each layer (needed for backward)

→

Compute ∂Loss/∂output (how sensitive loss is to final output)

→

Multiply by ∂output/∂hidden — chain rule at the last layer

→

Repeat backwards through every layer to get ∂Loss/∂w

→

Use gradients to update weights via gradient descent

Formal Explanation

A minimal 2-layer network:

z₁ = W₁x + b₁        (linear transform, layer 1)
a₁ = ReLU(z₁)        (activation, layer 1)
z₂ = W₂a₁ + b₂       (linear transform, layer 2)
Loss = MSE(z₂, y)     (loss — mean squared error)

Forward pass — just compute and save everything:

z₁ = W₁x + b₁
a₁ = max(0, z₁)
z₂ = W₂a₁ + b₂
L = (z₂ - y)²

Backward pass — chain rule from Loss back to W₁:

∂L/∂z₂ = 2(z₂ - y)                     (loss → layer 2 pre-activation)
∂L/∂W₂ = ∂L/∂z₂ · a₁ᵀ                 (gradient for W₂)
∂L/∂b₂ = ∂L/∂z₂                        (gradient for b₂)

∂L/∂a₁ = W₂ᵀ · ∂L/∂z₂                 (chain: propagate through W₂)
∂L/∂z₁ = ∂L/∂a₁ ⊙ ReLU'(z₁)          (chain: through activation — ⊙ = elementwise)
∂L/∂W₁ = ∂L/∂z₁ · xᵀ                  (gradient for W₁)
∂L/∂b₁ = ∂L/∂z₁                        (gradient for b₁)

The pattern: every layer receives a "gradient signal" from the layer above, multiplies it by its own local derivative, then passes it down.

Key Properties / Rules

Concept	What It Means
Forward pass stores activations	Backward pass needs them — don't recompute
Gradients are products of derivatives	Each layer multiplies the signal by its local slope
`∂L/∂W = δ · aᵀ`	Outer product: gradient for any weight matrix
`∂L/∂a = Wᵀ · δ`	Propagating the signal to the layer below
Vanishing gradient	If activations have small derivatives (e.g. sigmoid), gradients shrink multiplicatively and early layers learn nothing
Exploding gradient	If derivatives are large, gradients grow and destabilize training — clipping or normalization fixes this

Why It Matters

It's the only way to train deep networks. Gradient descent updates weights by w ← w - η · ∂L/∂w. Backprop is what computes ∂L/∂w for thousands of weights efficiently.

Computational graph. PyTorch and TensorFlow build a computation graph during the forward pass and traverse it in reverse during .backward(). You get automatic differentiation of any differentiable function — not just standard network architectures.

Why some activations fail. Sigmoid saturates at large inputs — its derivative approaches 0. In a 10-layer network: 0.25¹⁰ ≈ 0.000001. The gradient for the first layer is essentially zero. ReLU keeps its derivative at 1, solving the vanishing gradient problem for most practical networks.

Common Pitfalls

Forgetting to zero gradients in PyTorch. Gradients accumulate by default. Always call optimizer.zero_grad() before each backward pass, or you're adding gradients from the previous batch.
Wrong gradient shape. ∂L/∂W must have the same shape as W. A mismatch means a transposition error in the chain rule — check whether you need a₁ᵀ or aᵀ₁.
Computing gradient through non-differentiable ops. argmax, round, hard comparisons — these have zero derivative almost everywhere. They break backprop. Use differentiable approximations (softmax instead of argmax, straight-through estimators, etc.).
Detaching tensors you need for the backward. If you detach an activation during the forward pass, its gradient won't flow. Common source of bugs in custom architectures.

Examples

# Manual backprop on a tiny network — matching PyTorch's autograd
import numpy as np

np.random.seed(0)
x = np.array([[1.0, 2.0]])   # (1, 2)
y = np.array([[1.0]])         # target

W1 = np.random.randn(2, 4)   # (2, 4)
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1)   # (4, 1)
b2 = np.zeros((1, 1))

# --- Forward pass ---
z1 = x @ W1 + b1             # (1, 4)
a1 = np.maximum(0, z1)       # ReLU
z2 = a1 @ W2 + b2            # (1, 1)
loss = ((z2 - y) ** 2).mean()

# --- Backward pass ---
dL_dz2 = 2 * (z2 - y) / z2.size   # ∂L/∂z2
dL_dW2 = a1.T @ dL_dz2             # (4, 1) — outer product
dL_db2 = dL_dz2.sum(axis=0)

dL_da1 = dL_dz2 @ W2.T             # (1, 4) — propagate back through W2
dL_dz1 = dL_da1 * (z1 > 0)         # ReLU derivative: 1 if active, 0 if not
dL_dW1 = x.T @ dL_dz1             # (2, 4)
dL_db1 = dL_dz1.sum(axis=0)

print(f"Loss: {loss:.6f}")
print(f"dL/dW2 shape: {dL_dW2.shape}")   # (4, 1) ✓
print(f"dL/dW1 shape: {dL_dW1.shape}")   # (2, 4) ✓

# PyTorch does the same thing automatically
import torch
import torch.nn as nn

torch.manual_seed(0)

model = nn.Sequential(
    nn.Linear(2, 4),
    nn.ReLU(),
    nn.Linear(4, 1),
)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

x = torch.tensor([[1.0, 2.0]])
y = torch.tensor([[1.0]])

optimizer.zero_grad()        # clear accumulated gradients
pred = model(x)              # forward pass — builds computation graph
loss = ((pred - y) ** 2).mean()
loss.backward()              # backprop — fills .grad for every parameter
optimizer.step()             # update weights using gradients

for name, p in model.named_parameters():
    print(f"{name}: grad = {p.grad}")

What .backward() does internally:

Calls each layer's backward function in reverse order
Each backward function multiplies the incoming gradient by the local derivative
Accumulates the result into .grad for any leaf parameters
Passes the signal down to the next layer in the chain