Optimization

Convex vs Non-Convex Functions

Convex functions have a single global minimum that gradient descent always finds. Neural network loss surfaces are non-convex — but understanding why this is actually okay is key to trusting that deep learning works.

Gradient Descent Variants

SGD, mini-batch, momentum, RMSProp, Adam — what each variant fixes and why Adam became the default optimizer for deep learning.

Learning Rate Effects

The learning rate controls how big each optimization step is. Too high and training explodes; too low and it stalls. Schedules, warmup, and the LR finder are the tools practitioners use to get it right.