Convex vs Non-Convex Functions

Convex functions have a single global minimum that gradient descent always finds. Neural network loss surfaces are non-convex — but understanding why this is actually okay is key to trusting that deep learning works.

Intuition First

Imagine two landscapes:

Convex: A single smooth bowl. No matter where you start, rolling downhill always leads to the same lowest point — the global minimum. Gradient descent on this landscape is guaranteed to work.

Non-convex: A mountain range with many valleys, ridges, and flat plateaus. Rolling downhill might land you in a small local valley, not the deepest one. And sometimes you might hit a saddle — flat in one direction, curved in another — where you can get temporarily stuck.

Neural network loss surfaces are non-convex. But the bad news is less catastrophic than it sounds.

What's Actually Happening

The shape of a function determines how hard it is to minimize:

Convex functions: one minimum, gradient always points toward it
Non-convex functions: many critical points, gradient can mislead you locally

The formal definition: a function f is convex if the line segment between any two points on the function lies above or on the function. Visually, it curves upward everywhere (like a bowl).

For neural networks, the loss L(w) has millions of weight dimensions. The "landscape" is a surface in this high-dimensional space — impossible to visualize, but the math still applies.

Build the Idea Step-by-Step

Convex: one bowl, one global minimum

→

Non-convex: many valleys and saddle points

→

Local minimum: all nearby directions go up

→

Global minimum: the absolute lowest

→

Saddle point: downhill in some directions, uphill in others

→

Neural nets: many local minima, all often similar in quality

Formal Explanation

Convex function:

f is convex if: f(λx + (1-λ)y) ≤ λf(x) + (1-λ)f(y)
for all x, y and λ ∈ [0, 1]

The function value of any interpolated point is ≤ the interpolated function values. The line connecting two points on the graph lies above the graph.

Detecting convexity (1D):

f is convex   iff   f''(x) ≥ 0   for all x
f is concave  iff   f''(x) ≤ 0   for all x

Detecting convexity (multi-dimensional): The Hessian matrix H = ∇²L(w) must be positive semi-definite (all eigenvalues ≥ 0). If any eigenvalue is negative, the function curves downward in some direction — non-convex.

Types of critical points (where ∇L = 0):

Type	Description	Hessian
Local minimum	All directions go up locally	All eigenvalues > 0
Local maximum	All directions go down locally	All eigenvalues < 0
Saddle point	Some directions up, some down	Mixed eigenvalues
Global minimum	Lowest point in entire space	All eigenvalues > 0 (special)

Key Properties / Rules

Property	Convex	Non-Convex
Number of local minima	1 (= global)	Many
Saddle points	None	Common in high dimensions
GD guarantee	Always finds global minimum	No guarantee
In practice (deep nets)	—	Local minima often similar quality

Why deep learning works despite non-convexity:

In very high dimensions (millions of weights), true local minima that are much worse than the global minimum are rare. Most critical points with small loss values in high dimensions are saddle points, not local minima. Mini-batch noise helps gradient descent escape saddle points — the gradient from a random sample often has a downhill component even when the full-data gradient is near zero.

Research finding (Goodfellow, Dauphin et al.): most "local minima" found by gradient descent in deep nets have loss values very close to each other. Getting stuck in a genuinely bad local minimum is unusual in practice.

Why It Matters

Understanding convexity shapes how you reason about training:

Logistic regression, linear regression → convex loss → guaranteed to find the global minimum with gradient descent
Neural networks → non-convex loss → training might get stuck, but usually finds good solutions
Convex sub-problems → batch normalization and layer normalization are partly motivated by making the loss landscape "smoother" (more nearly convex locally)

Saddle points are the real enemy, not local minima:

Saddle points have ∇L ≈ 0 (gradient is tiny) but the loss is not at a good minimum. They cause training to plateau. The fixes:

Momentum: carries velocity through saddle points
SGD noise: random gradients have a push in the right direction
Larger LR: escapes the flat region faster

Common Pitfalls

Assuming the model is "converged" because the loss plateaued. It might be on a saddle point. Try increasing the LR temporarily, changing the optimizer, or adding noise.
Confusing local minimum with global minimum for non-convex problems. You'll almost never find the global minimum of a deep neural net — and that's fine.
Thinking convex = easy to train. Convex is theoretically easier, but poorly scaled features or a very high condition number can still make convex optimization slow in practice.
Weight initialization matters because of non-convexity. Different starting points (different random seeds) can lead to different local minima. This is why large model training often uses multiple runs.

Examples

import numpy as np

# Convex function: f(x) = x^2 (bowl shape)
# Second derivative: 2 > 0 everywhere → strictly convex
def f_convex(x): return x**2
def grad_convex(x): return 2*x

# Gradient descent always finds x* = 0 regardless of start
for x0 in [-10, -5, 0, 5, 10]:
    x = x0
    for _ in range(100):
        x -= 0.1 * grad_convex(x)
    print(f"start={x0:4.0f} → converged to x={x:.6f}")   # always 0.0

# Non-convex function: f(x) = x^4 - 4x^2 (two local minima)
# f'(x) = 4x^3 - 8x = 4x(x^2 - 2) = 0  → x = 0, ±√2
# Local minima at x = ±√2 ≈ ±1.414, local max at x = 0
def f_nonconvex(x): return x**4 - 4*x**2
def grad_nonconvex(x): return 4*x**3 - 8*x

for x0 in [-3.0, -0.5, 0.5, 3.0]:
    x = x0
    for _ in range(1000):
        x -= 0.01 * grad_nonconvex(x)
    print(f"start={x0:+.1f} → converged to x={x:.4f}")
# Negative start → converges to -1.4142
# Positive start → converges to +1.4142
# Starting near 0 → could go either way

# Saddle point example: f(x, y) = x^2 - y^2
# Critical point at (0, 0): minimum in x-direction, maximum in y-direction
# ∇f = [2x, -2y] = [0, 0] at origin — but origin is neither min nor max

import numpy as np

def f_saddle(w):
    x, y = w
    return x**2 - y**2

def grad_saddle(w):
    x, y = w
    return np.array([2*x, -2*y])

# Starting near the saddle point
w = np.array([0.001, 0.001])   # tiny perturbation from (0, 0)
for step in range(50):
    g = grad_saddle(w)
    w = w - 0.1 * g
    print(f"step {step:2d}: w={w}, f={f_saddle(w):.4f}")

# y diverges (gradient descent runs uphill in y!) — classic saddle behavior
# Mini-batch noise would inject a random kick that escapes this

# Checking convexity numerically: second derivative test
# If f''(x) >= 0 everywhere, f is convex

def numerical_hessian_1d(f, x, h=1e-5):
    return (f(x+h) - 2*f(x) + f(x-h)) / h**2

# Convex: x^2
xs = np.linspace(-5, 5, 100)
hessians = [numerical_hessian_1d(lambda x: x**2, x) for x in xs]
print(f"x^2 hessian range: [{min(hessians):.2f}, {max(hessians):.2f}]")  # all ~2.0 > 0 ✓

# Non-convex: x^4 - 4x^2
hessians_nc = [numerical_hessian_1d(lambda x: x**4 - 4*x**2, x) for x in xs]
print(f"x^4-4x^2 hessian range: [{min(hessians_nc):.2f}, {max(hessians_nc):.2f}]")  # some negative ✗