Regularization (L1 and L2)
Regularization adds a penalty for model complexity to prevent overfitting. L1 produces sparse models by zeroing out weak weights. L2 shrinks all weights smoothly toward zero. Both discourage the model from relying too heavily on any single feature.
Intuition First
Imagine you're grading a student's essay, but you also penalize them for using long, complicated sentences. They still want a good grade, so they'll only use a complex sentence when it truly helps — not just to seem smart.
Regularization does the same to weights: the model gets penalized for having large weights, so it only makes them large when they genuinely improve predictions.
The core idea: complexity has a cost. Add that cost to the loss function, and the optimizer will trade some prediction accuracy for simpler weights.
What's Actually Happening
Without regularization, a model can assign huge weights to features — even noisy or coincidental ones — because doing so perfectly fits the training data.
With regularization, the loss becomes:
Total Loss = Prediction Loss + λ × Complexity Penalty
λ (lambda) controls the regularization strength:
λ = 0: no regularization, free to overfitλvery large: heavily penalized, model collapses to near-zero weights (underfits)- Good
λ: learned via cross-validation
Build the Idea Step-by-Step
Formal Explanation
L2 Regularization (Ridge / Weight Decay)
Total Loss = L(y_pred, y_true) + λ × Σ wᵢ²
The gradient of the penalty with respect to weight w is 2λw. This is added to the gradient at every step, which shrinks every weight by a constant factor:
w ← w - α × (∇L + 2λw)
= w × (1 - 2αλ) - α × ∇L
The factor (1 - 2αλ) is slightly less than 1 — called weight decay. Every weight gets multiplied by this factor each step. Large and small weights both shrink proportionally. No weight reaches exactly zero, but all become small.
L1 Regularization (Lasso)
Total Loss = L(y_pred, y_true) + λ × Σ |wᵢ|
The gradient of |w| is sign(w) — a constant ±1. This subtracts a fixed amount from every weight each step, regardless of its size:
w ← w - α × (∇L + λ × sign(w))
A weight of 0.001 and a weight of 5 both get the same push toward zero. Small weights get pushed past zero and "snap to" exactly 0. This is why L1 produces sparse weights.
Key Properties / Rules
| Property | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Penalty form | Σ|wᵢ| | Σwᵢ² |
| Effect on weights | Zeros out small weights | Shrinks all weights smoothly |
| Resulting sparsity | Sparse (many zeros) | Dense (all small, few zero) |
| Use case | Feature selection, sparse models | General regularization (most common) |
| PyTorch name | weight_decay with L1 manually | weight_decay in optimizer |
| Gradient behavior | Constant push (sign of w) | Proportional push (value of w) |
Why It Matters
L2 is the default. PyTorch's weight_decay parameter in optimizers (SGD, Adam) implements L2. It's computationally cheap and works well in almost all settings.
L1 for sparsity. If you need to know which features matter (e.g., interpretable models, reducing feature dimensionality), L1 automatically zeros out irrelevant features.
In neural networks: L2 regularization is equivalent to adding weight decay to every parameter update. It prevents individual neurons from over-specializing to specific training examples.
Elastic Net combines both: λ₁Σ|wᵢ| + λ₂Σwᵢ² — you get some sparsity with smooth shrinkage.
Common Pitfalls
- Regularizing biases. Only regularize weights, not biases. Biases don't overfit the same way — they just shift the output.
- λ too large. The model underfits. All weights approach zero. Loss stops decreasing because the penalty dominates.
- λ too small. No meaningful regularization effect — might as well have none.
- Forgetting
zero_grad()with L1. L1 gradients must be computed manually in PyTorch if not using a library that supports it natively. - Using L1 when you don't need sparsity. L1's non-smooth gradient (sign function) makes optimization slightly harder. Use L2 by default; switch to L1 only if sparsity matters.
Examples
import torch
import torch.nn as nn
import torch.optim as optim
model = nn.Linear(10, 1)
# L2 regularization: just set weight_decay in the optimizer
optimizer_l2 = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
# weight_decay=1e-4 is the λ. Adam will add 1e-4 × w to the gradient.
# L1 regularization: must add penalty manually
def l1_penalty(model, lambda_l1):
return lambda_l1 * sum(p.abs().sum() for p in model.parameters())
optimizer_l1 = optim.Adam(model.parameters(), lr=1e-3)
X = torch.randn(100, 10)
y = torch.randn(100, 1)
loss_fn = nn.MSELoss()
for epoch in range(100):
optimizer_l1.zero_grad()
pred = model(X)
loss = loss_fn(pred, y) + l1_penalty(model, lambda_l1=1e-4)
loss.backward()
optimizer_l1.step()
Visualizing the effect on weights:
import numpy as np
# Simulate weight decay (L2) step
w = np.array([5.0, 0.01, 2.0, -3.0])
alpha, lambda_ = 0.1, 0.1
grad = np.zeros(4) # assume zero gradient for illustration
# L2: w × (1 - 2αλ)
w_after_l2 = w * (1 - 2 * alpha * lambda_)
print("L2:", w_after_l2) # [4.9, 0.0098, 1.96, -2.94] — all shrink proportionally
# L1: w - α × sign(w)
w_after_l1 = w - alpha * lambda_ * np.sign(w)
print("L1:", w_after_l1) # [4.99, 0.009, 1.99, -3.01] — all shift by same amount
# After many steps, 0.01 hits zero first — becomes sparse