Partial Derivatives
When a function takes multiple inputs, partial derivatives measure the effect of changing one input while holding all others fixed. Every weight in a neural network gets its own partial derivative of the loss.
Intuition First
Imagine you're adjusting the temperature and humidity in a greenhouse to maximize plant growth. If you change only the temperature (keeping humidity fixed), you can measure how growth responds to that single change. Then hold temperature fixed and change humidity. Each isolated measurement is a partial derivative.
Neural networks have millions of weights. Backprop computes the partial derivative of the loss with respect to each individual weight — holding all others fixed. That's how the network learns which weights to adjust.
What's Actually Happening
With a single-input function f(x), there's only one direction to change. With f(x, y), you can move in the x-direction or the y-direction independently.
The partial derivative ∂f/∂x asks: "if I change only x by a tiny amount (keeping y constant), how much does f change?"
You compute it exactly like an ordinary derivative, but treat every other variable as if it were a fixed constant.
Build the Idea Step-by-Step
Formal Explanation
For f(x, y):
∂f/∂x = lim_{Δx→0} [f(x+Δx, y) - f(x, y)] / Δx (y held fixed)
∂f/∂y = lim_{Δy→0} [f(x, y+Δy) - f(x, y)] / Δy (x held fixed)
Notation: ∂f/∂x is read "the partial of f with respect to x." The curly ∂ (vs straight d) signals that other variables exist.
Examples:
f(x, y) = x² + 3y → ∂f/∂x = 2x, ∂f/∂y = 3
f(x, y) = x²y → ∂f/∂x = 2xy, ∂f/∂y = x²
f(x, y) = x³ + 5xy + y² → ∂f/∂x = 3x² + 5y, ∂f/∂y = 5x + 2y
The rule: when computing ∂f/∂x, treat all other variables as numbers.
Key Properties / Rules
| Concept | Meaning |
|---|---|
∂f/∂x | Rate of change in the x-direction only |
| Treat others as constants | Exactly like regular derivative, but ignore other variables |
| Order of partial derivatives | ∂²f/∂x∂y = ∂²f/∂y∂x for smooth functions (mixed partials are equal) |
| Multiple variables | A function of n inputs has n partial derivatives |
Why It Matters
A neural network's loss L depends on every weight: L(w₁, w₂, ..., wₙ). Training requires knowing ∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ — one partial per weight.
These partials answer: "for each individual weight, if I increase it slightly, does loss go up or down?" This is the information backpropagation computes.
Collecting all these partials into a single vector gives you the gradient (see next note).
Common Pitfalls
- Forgetting to treat other variables as constants. For
f(x,y) = x²y, the partial∂f/∂x = 2xy, not just2x. Theystays because it's multiplied into the expression — it's a constant factor, not zero. - Confusing
dand∂.df/dxmeans total derivative (only one variable).∂f/∂xmeans partial derivative (other variables exist but are held fixed). Using the wrong one is a notation error. ∂f/∂x = 0for variables not present. Iff(x, y) = y² + 5, then∂f/∂x = 0—xdoesn't appear, so changing it does nothing.
Examples
import torch
# f(x, y) = x^2 * y + y^3
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
f = x**2 * y + y**3
f.backward()
print(x.grad) # ∂f/∂x = 2xy = 2*2*3 = 12.0
print(y.grad) # ∂f/∂y = x² + 3y² = 4 + 27 = 31.0
Neural network perspective:
# Loss depends on 2 weights
w1 = torch.tensor(0.5, requires_grad=True)
w2 = torch.tensor(-1.0, requires_grad=True)
# Simplified: loss = (w1 - 1)^2 + (w2 + 0.5)^2
loss = (w1 - 1)**2 + (w2 + 0.5)**2
loss.backward()
print(w1.grad) # ∂L/∂w1 = 2*(w1-1) = 2*(0.5-1) = -1.0 → should increase w1
print(w2.grad) # ∂L/∂w2 = 2*(w2+0.5) = 2*(-1+0.5) = -1.0 → should increase w2
Each partial derivative tells the training loop exactly how sensitive the loss is to one specific weight.