Regularization (L1 and L2)

Regularization adds a penalty for model complexity to prevent overfitting. L1 produces sparse models by zeroing out weak weights. L2 shrinks all weights smoothly toward zero. Both discourage the model from relying too heavily on any single feature.

Intuition First

Imagine you're grading a student's essay, but you also penalize them for using long, complicated sentences. They still want a good grade, so they'll only use a complex sentence when it truly helps — not just to seem smart.

Regularization does the same to weights: the model gets penalized for having large weights, so it only makes them large when they genuinely improve predictions.

The core idea: complexity has a cost. Add that cost to the loss function, and the optimizer will trade some prediction accuracy for simpler weights.

What's Actually Happening

Without regularization, a model can assign huge weights to features — even noisy or coincidental ones — because doing so perfectly fits the training data.

With regularization, the loss becomes:

Total Loss = Prediction Loss + λ × Complexity Penalty

λ (lambda) controls the regularization strength:

λ = 0: no regularization, free to overfit
λ very large: heavily penalized, model collapses to near-zero weights (underfits)
Good λ: learned via cross-validation

Build the Idea Step-by-Step

Model overfits: large weights fit noise

→

Add complexity penalty to loss

→

L1: penalizes sum of |weights|

→

L2: penalizes sum of weights²

→

Optimizer balances fit vs. penalty

→

Result: smaller weights, better generalization

Formal Explanation

L2 Regularization (Ridge / Weight Decay)

Total Loss = L(y_pred, y_true) + λ × Σ wᵢ²

The gradient of the penalty with respect to weight w is 2λw. This is added to the gradient at every step, which shrinks every weight by a constant factor:

w ← w - α × (∇L + 2λw)
  = w × (1 - 2αλ) - α × ∇L

The factor (1 - 2αλ) is slightly less than 1 — called weight decay. Every weight gets multiplied by this factor each step. Large and small weights both shrink proportionally. No weight reaches exactly zero, but all become small.

L1 Regularization (Lasso)

Total Loss = L(y_pred, y_true) + λ × Σ |wᵢ|

The gradient of |w| is sign(w) — a constant ±1. This subtracts a fixed amount from every weight each step, regardless of its size:

w ← w - α × (∇L + λ × sign(w))

A weight of 0.001 and a weight of 5 both get the same push toward zero. Small weights get pushed past zero and "snap to" exactly 0. This is why L1 produces sparse weights.

Key Properties / Rules

Property	L1 (Lasso)	L2 (Ridge)
Penalty form	Σ\|wᵢ\|	Σwᵢ²
Effect on weights	Zeros out small weights	Shrinks all weights smoothly
Resulting sparsity	Sparse (many zeros)	Dense (all small, few zero)
Use case	Feature selection, sparse models	General regularization (most common)
PyTorch name	`weight_decay` with L1 manually	`weight_decay` in optimizer
Gradient behavior	Constant push (sign of w)	Proportional push (value of w)

Why It Matters

L2 is the default. PyTorch's weight_decay parameter in optimizers (SGD, Adam) implements L2. It's computationally cheap and works well in almost all settings.

L1 for sparsity. If you need to know which features matter (e.g., interpretable models, reducing feature dimensionality), L1 automatically zeros out irrelevant features.

In neural networks: L2 regularization is equivalent to adding weight decay to every parameter update. It prevents individual neurons from over-specializing to specific training examples.

Elastic Net combines both: λ₁Σ|wᵢ| + λ₂Σwᵢ² — you get some sparsity with smooth shrinkage.

Common Pitfalls

Regularizing biases. Only regularize weights, not biases. Biases don't overfit the same way — they just shift the output.
λ too large. The model underfits. All weights approach zero. Loss stops decreasing because the penalty dominates.
λ too small. No meaningful regularization effect — might as well have none.
Forgetting zero_grad() with L1. L1 gradients must be computed manually in PyTorch if not using a library that supports it natively.
Using L1 when you don't need sparsity. L1's non-smooth gradient (sign function) makes optimization slightly harder. Use L2 by default; switch to L1 only if sparsity matters.

Examples

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(10, 1)

# L2 regularization: just set weight_decay in the optimizer
optimizer_l2 = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
# weight_decay=1e-4 is the λ. Adam will add 1e-4 × w to the gradient.

# L1 regularization: must add penalty manually
def l1_penalty(model, lambda_l1):
    return lambda_l1 * sum(p.abs().sum() for p in model.parameters())

optimizer_l1 = optim.Adam(model.parameters(), lr=1e-3)

X = torch.randn(100, 10)
y = torch.randn(100, 1)
loss_fn = nn.MSELoss()

for epoch in range(100):
    optimizer_l1.zero_grad()
    pred = model(X)
    loss = loss_fn(pred, y) + l1_penalty(model, lambda_l1=1e-4)
    loss.backward()
    optimizer_l1.step()

Visualizing the effect on weights:

import numpy as np

# Simulate weight decay (L2) step
w = np.array([5.0, 0.01, 2.0, -3.0])
alpha, lambda_ = 0.1, 0.1
grad = np.zeros(4)   # assume zero gradient for illustration

# L2: w × (1 - 2αλ)
w_after_l2 = w * (1 - 2 * alpha * lambda_)
print("L2:", w_after_l2)   # [4.9, 0.0098, 1.96, -2.94] — all shrink proportionally

# L1: w - α × sign(w)
w_after_l1 = w - alpha * lambda_ * np.sign(w)
print("L1:", w_after_l1)   # [4.99, 0.009, 1.99, -3.01] — all shift by same amount
# After many steps, 0.01 hits zero first — becomes sparse