Optimization
Convex vs Non-Convex Functions
Convex functions have a single global minimum that gradient descent always finds. Neural network loss surfaces are non-convex — but understanding why this is actually okay is key to trusting that deep learning works.
Gradient Descent Variants
SGD, mini-batch, momentum, RMSProp, Adam — what each variant fixes and why Adam became the default optimizer for deep learning.
Learning Rate Effects
The learning rate controls how big each optimization step is. Too high and training explodes; too low and it stalls. Schedules, warmup, and the LR finder are the tools practitioners use to get it right.