Convex vs Non-Convex Functions
Convex functions have a single global minimum that gradient descent always finds. Neural network loss surfaces are non-convex — but understanding why this is actually okay is key to trusting that deep learning works.
Intuition First
Imagine two landscapes:
Convex: A single smooth bowl. No matter where you start, rolling downhill always leads to the same lowest point — the global minimum. Gradient descent on this landscape is guaranteed to work.
Non-convex: A mountain range with many valleys, ridges, and flat plateaus. Rolling downhill might land you in a small local valley, not the deepest one. And sometimes you might hit a saddle — flat in one direction, curved in another — where you can get temporarily stuck.
Neural network loss surfaces are non-convex. But the bad news is less catastrophic than it sounds.
What's Actually Happening
The shape of a function determines how hard it is to minimize:
- Convex functions: one minimum, gradient always points toward it
- Non-convex functions: many critical points, gradient can mislead you locally
The formal definition: a function f is convex if the line segment between any two points on the function lies above or on the function. Visually, it curves upward everywhere (like a bowl).
For neural networks, the loss L(w) has millions of weight dimensions. The "landscape" is a surface in this high-dimensional space — impossible to visualize, but the math still applies.
Build the Idea Step-by-Step
Formal Explanation
Convex function:
f is convex if: f(λx + (1-λ)y) ≤ λf(x) + (1-λ)f(y)
for all x, y and λ ∈ [0, 1]
The function value of any interpolated point is ≤ the interpolated function values. The line connecting two points on the graph lies above the graph.
Detecting convexity (1D):
f is convex iff f''(x) ≥ 0 for all x
f is concave iff f''(x) ≤ 0 for all x
Detecting convexity (multi-dimensional):
The Hessian matrix H = ∇²L(w) must be positive semi-definite (all eigenvalues ≥ 0).
If any eigenvalue is negative, the function curves downward in some direction — non-convex.
Types of critical points (where ∇L = 0):
| Type | Description | Hessian |
|---|---|---|
| Local minimum | All directions go up locally | All eigenvalues > 0 |
| Local maximum | All directions go down locally | All eigenvalues < 0 |
| Saddle point | Some directions up, some down | Mixed eigenvalues |
| Global minimum | Lowest point in entire space | All eigenvalues > 0 (special) |
Key Properties / Rules
| Property | Convex | Non-Convex |
|---|---|---|
| Number of local minima | 1 (= global) | Many |
| Saddle points | None | Common in high dimensions |
| GD guarantee | Always finds global minimum | No guarantee |
| In practice (deep nets) | — | Local minima often similar quality |
Why deep learning works despite non-convexity:
In very high dimensions (millions of weights), true local minima that are much worse than the global minimum are rare. Most critical points with small loss values in high dimensions are saddle points, not local minima. Mini-batch noise helps gradient descent escape saddle points — the gradient from a random sample often has a downhill component even when the full-data gradient is near zero.
Research finding (Goodfellow, Dauphin et al.): most "local minima" found by gradient descent in deep nets have loss values very close to each other. Getting stuck in a genuinely bad local minimum is unusual in practice.
Why It Matters
Understanding convexity shapes how you reason about training:
- Logistic regression, linear regression → convex loss → guaranteed to find the global minimum with gradient descent
- Neural networks → non-convex loss → training might get stuck, but usually finds good solutions
- Convex sub-problems → batch normalization and layer normalization are partly motivated by making the loss landscape "smoother" (more nearly convex locally)
Saddle points are the real enemy, not local minima:
Saddle points have ∇L ≈ 0 (gradient is tiny) but the loss is not at a good minimum. They cause training to plateau. The fixes:
- Momentum: carries velocity through saddle points
- SGD noise: random gradients have a push in the right direction
- Larger LR: escapes the flat region faster
Common Pitfalls
- Assuming the model is "converged" because the loss plateaued. It might be on a saddle point. Try increasing the LR temporarily, changing the optimizer, or adding noise.
- Confusing local minimum with global minimum for non-convex problems. You'll almost never find the global minimum of a deep neural net — and that's fine.
- Thinking convex = easy to train. Convex is theoretically easier, but poorly scaled features or a very high condition number can still make convex optimization slow in practice.
- Weight initialization matters because of non-convexity. Different starting points (different random seeds) can lead to different local minima. This is why large model training often uses multiple runs.
Examples
import numpy as np
# Convex function: f(x) = x^2 (bowl shape)
# Second derivative: 2 > 0 everywhere → strictly convex
def f_convex(x): return x**2
def grad_convex(x): return 2*x
# Gradient descent always finds x* = 0 regardless of start
for x0 in [-10, -5, 0, 5, 10]:
x = x0
for _ in range(100):
x -= 0.1 * grad_convex(x)
print(f"start={x0:4.0f} → converged to x={x:.6f}") # always 0.0
# Non-convex function: f(x) = x^4 - 4x^2 (two local minima)
# f'(x) = 4x^3 - 8x = 4x(x^2 - 2) = 0 → x = 0, ±√2
# Local minima at x = ±√2 ≈ ±1.414, local max at x = 0
def f_nonconvex(x): return x**4 - 4*x**2
def grad_nonconvex(x): return 4*x**3 - 8*x
for x0 in [-3.0, -0.5, 0.5, 3.0]:
x = x0
for _ in range(1000):
x -= 0.01 * grad_nonconvex(x)
print(f"start={x0:+.1f} → converged to x={x:.4f}")
# Negative start → converges to -1.4142
# Positive start → converges to +1.4142
# Starting near 0 → could go either way
# Saddle point example: f(x, y) = x^2 - y^2
# Critical point at (0, 0): minimum in x-direction, maximum in y-direction
# ∇f = [2x, -2y] = [0, 0] at origin — but origin is neither min nor max
import numpy as np
def f_saddle(w):
x, y = w
return x**2 - y**2
def grad_saddle(w):
x, y = w
return np.array([2*x, -2*y])
# Starting near the saddle point
w = np.array([0.001, 0.001]) # tiny perturbation from (0, 0)
for step in range(50):
g = grad_saddle(w)
w = w - 0.1 * g
print(f"step {step:2d}: w={w}, f={f_saddle(w):.4f}")
# y diverges (gradient descent runs uphill in y!) — classic saddle behavior
# Mini-batch noise would inject a random kick that escapes this
# Checking convexity numerically: second derivative test
# If f''(x) >= 0 everywhere, f is convex
def numerical_hessian_1d(f, x, h=1e-5):
return (f(x+h) - 2*f(x) + f(x-h)) / h**2
# Convex: x^2
xs = np.linspace(-5, 5, 100)
hessians = [numerical_hessian_1d(lambda x: x**2, x) for x in xs]
print(f"x^2 hessian range: [{min(hessians):.2f}, {max(hessians):.2f}]") # all ~2.0 > 0 ✓
# Non-convex: x^4 - 4x^2
hessians_nc = [numerical_hessian_1d(lambda x: x**4 - 4*x**2, x) for x in xs]
print(f"x^4-4x^2 hessian range: [{min(hessians_nc):.2f}, {max(hessians_nc):.2f}]") # some negative ✗