Derivatives
A derivative measures how much a function's output changes when you nudge its input. It's the foundation of every learning algorithm — without it, there's no way to know which way to adjust weights.
Intuition First
You're driving on a highway. Your speedometer reads 60 mph. That's a derivative — the rate at which your position is changing right now. Not how far you've traveled total, but how fast things are changing at this instant.
In ML, you're not measuring speed. You're asking: "if I change this weight slightly, how much does the loss change?" That rate-of-change is what a derivative tells you — and it's exactly what you need to improve the model.
What's Actually Happening
A derivative is the slope of a curve at a single point.
Take f(x) = x². If you plot it, it's a parabola. At x = 1, the curve is rising gently. At x = 3, it's rising steeply. The slope is different at every point — and the derivative f'(x) is the function that tells you that slope.
The key insight: the slope at a point is found by zooming in until the curve looks like a straight line. That line's slope is the derivative.
Build the Idea Step-by-Step
Formal Explanation
The derivative is defined as:
f'(x) = lim_{Δx→0} [f(x + Δx) - f(x)] / Δx
You don't need to evaluate this limit by hand. Instead, rules give you shortcuts:
Common derivatives:
f(x) = c → f'(x) = 0 (constant doesn't change)
f(x) = x → f'(x) = 1 (slope is always 1)
f(x) = x² → f'(x) = 2x (slope depends on x)
f(x) = x³ → f'(x) = 3x²
f(x) = xⁿ → f'(x) = n·xⁿ⁻¹ (power rule)
f(x) = eˣ → f'(x) = eˣ (its own derivative)
f(x) = ln(x) → f'(x) = 1/x
Notation: f'(x), df/dx, and d/dx f(x) all mean the same thing.
Key Properties / Rules
| Rule | Formula | When to use |
|---|---|---|
| Power rule | (xⁿ)' = n·xⁿ⁻¹ | Any polynomial |
| Sum rule | (f+g)' = f' + g' | Adding functions |
| Constant multiple | (c·f)' = c·f' | Scaling |
| Product rule | (f·g)' = f'g + fg' | Multiplied functions |
| Chain rule | see Chain Rule note | Composed functions |
Why It Matters
Training a neural network means adjusting weights to minimize the loss. To do that, you need to know: "if I increase this weight by a tiny amount, does the loss go up or down, and by how much?"
That's the derivative of the loss with respect to the weight: ∂L/∂w. If ∂L/∂w > 0, the loss increases when w increases — so you decrease w. If it's negative, increase w. The derivative tells you exactly which direction to move.
Without derivatives, training is blind.
Common Pitfalls
- Confusing f(x) with f'(x).
f(x) = x²is the function.f'(x) = 2xis its rate of change. Different objects. - f'(x) = 0 doesn't always mean minimum. It could be a maximum, or a saddle point. A flat slope means "critical point" — could go either way.
- Derivative is a local concept.
f'(x)is the slope at x, not the slope everywhere. The parabolax²has a different slope at every point.
Examples
# Finite difference approximation of derivative (conceptual)
def numerical_derivative(f, x, dx=1e-5):
return (f(x + dx) - f(x)) / dx
f = lambda x: x**2
print(numerical_derivative(f, x=3)) # ≈ 6.0 (matches f'(3) = 2·3 = 6)
print(numerical_derivative(f, x=0)) # ≈ 0.0 (flat at the bottom of parabola)
print(numerical_derivative(f, x=-2)) # ≈ -4.0 (negative slope on left side)
# PyTorch computes exact derivatives for you
import torch
x = torch.tensor(3.0, requires_grad=True)
y = x**2
y.backward() # compute dy/dx
print(x.grad) # tensor(6.0) — exactly 2x at x=3
Reading the slope:
f'(x) > 0→ function is increasing atx(moving right goes up)f'(x) < 0→ function is decreasing atx(moving right goes down)f'(x) = 0→ flat atx(critical point — could be min, max, or saddle)