Optimization Basics
Optimization means finding the input that minimizes (or maximizes) a function. In ML, the function is the loss and the inputs are the weights — and the gradient is the compass that guides the search.
Intuition First
You're lost in a hilly landscape at night and want to reach the lowest valley. You can't see ahead, but you can feel the slope under your feet.
Your strategy: always step in whichever direction slopes downward. Keep stepping until the ground is flat in all directions — you've found a low point.
This is optimization. The "landscape" is the loss function. The position you stand on is your current weight values. The valley is the minimum. The slope under your feet is the gradient.
What's Actually Happening
Minimizing a function means finding where its output is smallest. The key insight: at a minimum, the function is flat — the slope is zero in every direction. So gradient = 0 is a necessary condition for a minimum.
But "gradient = 0" is also true at maximums and saddle points (flat in some directions, curved in others). The gradient alone doesn't tell you which kind of critical point you've found — just that you've found some kind of flat spot.
Build the Idea Step-by-Step
Formal Explanation
Critical points are where the derivative (or gradient) equals zero.
For single-variable f(x):
f'(x*) = 0 → x* is a critical point
f''(x*) > 0 → local minimum (concave up)
f''(x*) < 0 → local maximum (concave down)
f''(x*) = 0 → inconclusive (saddle or inflection)
For multi-variable L(w) (the real ML case):
∇L(w*) = 0 → critical point
In high dimensions, checking whether it's a min/max/saddle requires the Hessian matrix (matrix of second derivatives). In practice, neural network training doesn't check — it just runs gradient descent and relies on the loss decreasing.
Global vs local minimum:
- Global minimum: the absolute lowest point on the entire landscape
- Local minimum: a low point where all nearby directions go up, but lower points may exist elsewhere
- Neural networks typically find local minima — and that's usually fine
Key Properties / Rules
| Concept | Meaning |
|---|---|
∇L = 0 | Critical point — candidate for minimum |
f''(x) > 0 | Local minimum (in 1D) |
f''(x) < 0 | Local maximum (in 1D) |
| Local minimum | Lowest nearby, but not necessarily globally |
| Global minimum | Absolute lowest — usually unreachable in deep learning |
| Saddle point | Flat in one direction, curved in another |
| Loss landscape | The surface traced by L(w) as w varies |
Why It Matters
Training = optimization. Every epoch of gradient descent is an attempt to move w toward a point where ∇L(w) ≈ 0 (or at least smaller than before).
The loss landscape of neural networks:
- Is extremely high-dimensional (millions of weight dimensions)
- Has many local minima — but they often have similar loss values
- Has many saddle points — gradient descent usually escapes them due to noise in mini-batch gradients
- Does NOT have a single global minimum that training can find exactly
Practical implication: don't chase the "optimal" solution. Chase a solution that generalizes well to new data. Stopping early, regularization, and data augmentation are all strategies that accept a worse training loss in exchange for better generalization.
Common Pitfalls
- Assuming local minimum = global minimum. In neural networks, they almost never coincide. The good news: many local minima generalize similarly well, so finding the global minimum isn't necessary.
- Gradient = 0 at the starting point is a red flag. If weights are all initialized to zero, all gradients are zero and nothing learns. Symmetry breaking (random initialization) is essential.
- Saddle points look like convergence. The loss may plateau because the gradient is small near a saddle point, not because you've found a minimum. More training (or a different learning rate) often escapes these.
Examples
import numpy as np
import matplotlib.pyplot as plt
# Simple 1D optimization: find the minimum of f(x) = x^2 - 4x + 5
# Analytically: f'(x) = 2x - 4 = 0 → x* = 2, f(2) = 1
def f(x):
return x**2 - 4*x + 5
def f_prime(x):
return 2*x - 4
x = np.linspace(-1, 5, 100)
print(f"f'(2) = {f_prime(2)}") # 0 — confirmed critical point
print(f"f(2) = {f(2)}") # 1 — minimum value
print(f"f(0) = {f(0)}") # 5 — higher than minimum
print(f"f(4) = {f(4)}") # 5 — symmetric, also higher
# Verify it's a minimum: f''(x) = 2 > 0 → concave up → minimum ✓
# Multiple critical points: f(x) = x^4 - 4x^2 + 1 (two local minima, one local max)
# f'(x) = 4x^3 - 8x = 4x(x^2 - 2) = 0 → x = 0, x = ±√2
# f''(x) = 12x^2 - 8
# f''(0) = -8 < 0 → LOCAL MAX at x=0
# f''(±√2) = 24-8 = 16 > 0 → LOCAL MINIMA at x=±√2
# In ML: the loss landscape has many such local minima
# Gradient descent may find either one, depending on initialization
The optimization goal in ML:
Not: find w* = argmin L(w) exactly
But: find w such that L(w) is small AND generalizes to new data
This is why early stopping, regularization, and validation sets exist — they prevent over-optimizing the training loss.