MnemosyneMnemosyne

Derivatives

A derivative measures how much a function's output changes when you nudge its input. It's the foundation of every learning algorithm — without it, there's no way to know which way to adjust weights.

Intuition First

You're driving on a highway. Your speedometer reads 60 mph. That's a derivative — the rate at which your position is changing right now. Not how far you've traveled total, but how fast things are changing at this instant.

In ML, you're not measuring speed. You're asking: "if I change this weight slightly, how much does the loss change?" That rate-of-change is what a derivative tells you — and it's exactly what you need to improve the model.


What's Actually Happening

A derivative is the slope of a curve at a single point.

Take f(x) = x². If you plot it, it's a parabola. At x = 1, the curve is rising gently. At x = 3, it's rising steeply. The slope is different at every point — and the derivative f'(x) is the function that tells you that slope.

The key insight: the slope at a point is found by zooming in until the curve looks like a straight line. That line's slope is the derivative.


Build the Idea Step-by-Step

f(x): some curved function
Pick a point x
Nudge input by a tiny Δx
Measure Δy / Δx
Let Δx → 0 → exact slope at x

Formal Explanation

The derivative is defined as:

f'(x) = lim_{Δx→0}  [f(x + Δx) - f(x)] / Δx

You don't need to evaluate this limit by hand. Instead, rules give you shortcuts:

Common derivatives:

f(x) = c        →  f'(x) = 0        (constant doesn't change)
f(x) = x        →  f'(x) = 1        (slope is always 1)
f(x) = x²       →  f'(x) = 2x       (slope depends on x)
f(x) = x³       →  f'(x) = 3x²
f(x) = xⁿ       →  f'(x) = n·xⁿ⁻¹  (power rule)
f(x) = eˣ       →  f'(x) = eˣ       (its own derivative)
f(x) = ln(x)    →  f'(x) = 1/x

Notation: f'(x), df/dx, and d/dx f(x) all mean the same thing.


Key Properties / Rules

RuleFormulaWhen to use
Power rule(xⁿ)' = n·xⁿ⁻¹Any polynomial
Sum rule(f+g)' = f' + g'Adding functions
Constant multiple(c·f)' = c·f'Scaling
Product rule(f·g)' = f'g + fg'Multiplied functions
Chain rulesee Chain Rule noteComposed functions

Why It Matters

Training a neural network means adjusting weights to minimize the loss. To do that, you need to know: "if I increase this weight by a tiny amount, does the loss go up or down, and by how much?"

That's the derivative of the loss with respect to the weight: ∂L/∂w. If ∂L/∂w > 0, the loss increases when w increases — so you decrease w. If it's negative, increase w. The derivative tells you exactly which direction to move.

Without derivatives, training is blind.


Common Pitfalls

  • Confusing f(x) with f'(x). f(x) = x² is the function. f'(x) = 2x is its rate of change. Different objects.
  • f'(x) = 0 doesn't always mean minimum. It could be a maximum, or a saddle point. A flat slope means "critical point" — could go either way.
  • Derivative is a local concept. f'(x) is the slope at x, not the slope everywhere. The parabola has a different slope at every point.

Examples

# Finite difference approximation of derivative (conceptual)
def numerical_derivative(f, x, dx=1e-5):
    return (f(x + dx) - f(x)) / dx

f = lambda x: x**2
print(numerical_derivative(f, x=3))   # ≈ 6.0  (matches f'(3) = 2·3 = 6)
print(numerical_derivative(f, x=0))   # ≈ 0.0  (flat at the bottom of parabola)
print(numerical_derivative(f, x=-2))  # ≈ -4.0 (negative slope on left side)

# PyTorch computes exact derivatives for you
import torch

x = torch.tensor(3.0, requires_grad=True)
y = x**2
y.backward()          # compute dy/dx
print(x.grad)         # tensor(6.0) — exactly 2x at x=3

Reading the slope:

  • f'(x) > 0 → function is increasing at x (moving right goes up)
  • f'(x) < 0 → function is decreasing at x (moving right goes down)
  • f'(x) = 0 → flat at x (critical point — could be min, max, or saddle)

Review Questions