MnemosyneMnemosyne

Optimization Basics

Optimization means finding the input that minimizes (or maximizes) a function. In ML, the function is the loss and the inputs are the weights — and the gradient is the compass that guides the search.

Intuition First

You're lost in a hilly landscape at night and want to reach the lowest valley. You can't see ahead, but you can feel the slope under your feet.

Your strategy: always step in whichever direction slopes downward. Keep stepping until the ground is flat in all directions — you've found a low point.

This is optimization. The "landscape" is the loss function. The position you stand on is your current weight values. The valley is the minimum. The slope under your feet is the gradient.


What's Actually Happening

Minimizing a function means finding where its output is smallest. The key insight: at a minimum, the function is flat — the slope is zero in every direction. So gradient = 0 is a necessary condition for a minimum.

But "gradient = 0" is also true at maximums and saddle points (flat in some directions, curved in others). The gradient alone doesn't tell you which kind of critical point you've found — just that you've found some kind of flat spot.


Build the Idea Step-by-Step

Loss landscape L(w)
Gradient ∇L: slope at current w
∇L = 0: critical point found
Types: local min, local max, saddle
Neural nets: saddle points are common
Gradient descent finds good local solutions

Formal Explanation

Critical points are where the derivative (or gradient) equals zero.

For single-variable f(x):

f'(x*) = 0   →  x* is a critical point
f''(x*) > 0  →  local minimum (concave up)
f''(x*) < 0  →  local maximum (concave down)
f''(x*) = 0  →  inconclusive (saddle or inflection)

For multi-variable L(w) (the real ML case):

∇L(w*) = 0   →  critical point

In high dimensions, checking whether it's a min/max/saddle requires the Hessian matrix (matrix of second derivatives). In practice, neural network training doesn't check — it just runs gradient descent and relies on the loss decreasing.

Global vs local minimum:

  • Global minimum: the absolute lowest point on the entire landscape
  • Local minimum: a low point where all nearby directions go up, but lower points may exist elsewhere
  • Neural networks typically find local minima — and that's usually fine

Key Properties / Rules

ConceptMeaning
∇L = 0Critical point — candidate for minimum
f''(x) > 0Local minimum (in 1D)
f''(x) < 0Local maximum (in 1D)
Local minimumLowest nearby, but not necessarily globally
Global minimumAbsolute lowest — usually unreachable in deep learning
Saddle pointFlat in one direction, curved in another
Loss landscapeThe surface traced by L(w) as w varies

Why It Matters

Training = optimization. Every epoch of gradient descent is an attempt to move w toward a point where ∇L(w) ≈ 0 (or at least smaller than before).

The loss landscape of neural networks:

  • Is extremely high-dimensional (millions of weight dimensions)
  • Has many local minima — but they often have similar loss values
  • Has many saddle points — gradient descent usually escapes them due to noise in mini-batch gradients
  • Does NOT have a single global minimum that training can find exactly

Practical implication: don't chase the "optimal" solution. Chase a solution that generalizes well to new data. Stopping early, regularization, and data augmentation are all strategies that accept a worse training loss in exchange for better generalization.


Common Pitfalls

  • Assuming local minimum = global minimum. In neural networks, they almost never coincide. The good news: many local minima generalize similarly well, so finding the global minimum isn't necessary.
  • Gradient = 0 at the starting point is a red flag. If weights are all initialized to zero, all gradients are zero and nothing learns. Symmetry breaking (random initialization) is essential.
  • Saddle points look like convergence. The loss may plateau because the gradient is small near a saddle point, not because you've found a minimum. More training (or a different learning rate) often escapes these.

Examples

import numpy as np
import matplotlib.pyplot as plt

# Simple 1D optimization: find the minimum of f(x) = x^2 - 4x + 5
# Analytically: f'(x) = 2x - 4 = 0  →  x* = 2, f(2) = 1

def f(x):
    return x**2 - 4*x + 5

def f_prime(x):
    return 2*x - 4

x = np.linspace(-1, 5, 100)
print(f"f'(2) = {f_prime(2)}")   # 0 — confirmed critical point
print(f"f(2)  = {f(2)}")         # 1 — minimum value
print(f"f(0)  = {f(0)}")         # 5 — higher than minimum
print(f"f(4)  = {f(4)}")         # 5 — symmetric, also higher

# Verify it's a minimum: f''(x) = 2 > 0 → concave up → minimum ✓
# Multiple critical points: f(x) = x^4 - 4x^2 + 1 (two local minima, one local max)
# f'(x) = 4x^3 - 8x = 4x(x^2 - 2) = 0  →  x = 0, x = ±√2
# f''(x) = 12x^2 - 8
# f''(0) = -8 < 0  →  LOCAL MAX at x=0
# f''(±√2) = 24-8 = 16 > 0  →  LOCAL MINIMA at x=±√2

# In ML: the loss landscape has many such local minima
# Gradient descent may find either one, depending on initialization

The optimization goal in ML:

Not: find w* = argmin L(w) exactly
But: find w such that L(w) is small AND generalizes to new data

This is why early stopping, regularization, and validation sets exist — they prevent over-optimizing the training loss.

Review Questions