Partial Derivatives

When a function takes multiple inputs, partial derivatives measure the effect of changing one input while holding all others fixed. Every weight in a neural network gets its own partial derivative of the loss.

Intuition First

Imagine you're adjusting the temperature and humidity in a greenhouse to maximize plant growth. If you change only the temperature (keeping humidity fixed), you can measure how growth responds to that single change. Then hold temperature fixed and change humidity. Each isolated measurement is a partial derivative.

Neural networks have millions of weights. Backprop computes the partial derivative of the loss with respect to each individual weight — holding all others fixed. That's how the network learns which weights to adjust.

What's Actually Happening

With a single-input function f(x), there's only one direction to change. With f(x, y), you can move in the x-direction or the y-direction independently.

The partial derivative ∂f/∂x asks: "if I change only x by a tiny amount (keeping y constant), how much does f change?"

You compute it exactly like an ordinary derivative, but treat every other variable as if it were a fixed constant.

Build the Idea Step-by-Step

f(x, y): surface in 3D

→

Fix y, vary only x

→

∂f/∂x: slope along x-direction

→

Fix x, vary only y

→

∂f/∂y: slope along y-direction

→

Together: full picture of how f changes

Formal Explanation

For f(x, y):

∂f/∂x = lim_{Δx→0}  [f(x+Δx, y) - f(x, y)] / Δx   (y held fixed)
∂f/∂y = lim_{Δy→0}  [f(x, y+Δy) - f(x, y)] / Δy   (x held fixed)

Notation: ∂f/∂x is read "the partial of f with respect to x." The curly ∂ (vs straight d) signals that other variables exist.

Examples:

f(x, y) = x² + 3y         →  ∂f/∂x = 2x,   ∂f/∂y = 3
f(x, y) = x²y             →  ∂f/∂x = 2xy,  ∂f/∂y = x²
f(x, y) = x³ + 5xy + y²   →  ∂f/∂x = 3x² + 5y,  ∂f/∂y = 5x + 2y

The rule: when computing ∂f/∂x, treat all other variables as numbers.

Key Properties / Rules

Concept	Meaning
`∂f/∂x`	Rate of change in the x-direction only
Treat others as constants	Exactly like regular derivative, but ignore other variables
Order of partial derivatives	`∂²f/∂x∂y = ∂²f/∂y∂x` for smooth functions (mixed partials are equal)
Multiple variables	A function of n inputs has n partial derivatives

Why It Matters

A neural network's loss L depends on every weight: L(w₁, w₂, ..., wₙ). Training requires knowing ∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ — one partial per weight.

These partials answer: "for each individual weight, if I increase it slightly, does loss go up or down?" This is the information backpropagation computes.

Collecting all these partials into a single vector gives you the gradient (see next note).

Common Pitfalls

Forgetting to treat other variables as constants. For f(x,y) = x²y, the partial ∂f/∂x = 2xy, not just 2x. The y stays because it's multiplied into the expression — it's a constant factor, not zero.
Confusing d and ∂. df/dx means total derivative (only one variable). ∂f/∂x means partial derivative (other variables exist but are held fixed). Using the wrong one is a notation error.
∂f/∂x = 0 for variables not present. If f(x, y) = y² + 5, then ∂f/∂x = 0 — x doesn't appear, so changing it does nothing.

Examples

import torch

# f(x, y) = x^2 * y + y^3
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

f = x**2 * y + y**3
f.backward()

print(x.grad)  # ∂f/∂x = 2xy = 2*2*3 = 12.0
print(y.grad)  # ∂f/∂y = x² + 3y² = 4 + 27 = 31.0

Neural network perspective:

# Loss depends on 2 weights
w1 = torch.tensor(0.5, requires_grad=True)
w2 = torch.tensor(-1.0, requires_grad=True)

# Simplified: loss = (w1 - 1)^2 + (w2 + 0.5)^2
loss = (w1 - 1)**2 + (w2 + 0.5)**2
loss.backward()

print(w1.grad)   # ∂L/∂w1 = 2*(w1-1) = 2*(0.5-1) = -1.0 → should increase w1
print(w2.grad)   # ∂L/∂w2 = 2*(w2+0.5) = 2*(-1+0.5) = -1.0 → should increase w2

Each partial derivative tells the training loop exactly how sensitive the loss is to one specific weight.