Activation Functions
Activation functions introduce non-linearity into neural networks. Without them, stacking layers is mathematically equivalent to a single layer — the network can't learn curves, boundaries, or complex patterns.
Intuition First
Imagine you're trying to draw a circle on paper using only straight lines. No matter how many straight lines you draw or combine, you can't make a circle — lines are linear.
A neural network without activation functions is the same. Every linear layer just does Wx + b. Stack three linear layers together and the math collapses: it's still just one big linear transformation. You've gained nothing.
Activation functions bend the output. They introduce "kinks" or "curves" that let the network learn any shape — spirals, boundaries, non-linear relationships. This is why they're essential.
What's Actually Happening
Without activation functions:
Layer 3 output = W₃(W₂(W₁x + b₁) + b₂) + b₃
= W₃W₂W₁x + (W₃W₂b₁ + W₃b₂ + b₃)
= Ax + c ← still just one linear function
No matter how many layers you add, composition of linear functions is linear. The network is stuck at the power of logistic regression.
With an activation function σ between layers:
a₁ = σ(W₁x + b₁) ← output is now curved
a₂ = σ(W₂a₁ + b₂) ← curves applied again
The composition is no longer linear. Given enough neurons, the network can approximate any continuous function (Universal Approximation Theorem).
Build the Idea Step-by-Step
Formal Explanation
ReLU (Rectified Linear Unit)
ReLU(x) = max(0, x)
Derivative:
ReLU'(x) = 1 if x > 0
0 if x ≤ 0
- Output: 0 for all negative inputs, identity for positive inputs
- Creates sparse activations — only some neurons fire
- Simple derivative makes backprop fast and clean
Sigmoid
σ(x) = 1 / (1 + e⁻ˣ)
Range: (0, 1)
Derivative:
σ'(x) = σ(x) · (1 - σ(x)) ← maximum value is 0.25
- Squashes input to (0, 1) — good for binary probabilities
- Saturates at ±∞: gradient becomes ~0 → vanishing gradient problem
- Rarely used in hidden layers; still used for binary output layers
Tanh
tanh(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)
Range: (-1, 1)
Derivative:
tanh'(x) = 1 - tanh²(x) ← maximum value is 1
- Zero-centered (unlike sigmoid): output averages near 0, which helps gradient flow
- Still saturates at large inputs — vanishing gradient persists
Softmax (output layer only)
softmax(z)ᵢ = eᶻⁱ / Σⱼ eᶻʲ
Outputs sum to 1 → a probability distribution
- Takes a vector of raw scores ("logits") and turns them into probabilities
- Used at the final layer for multi-class classification
- Combined with cross-entropy loss for training (see the Softmax + Cross-Entropy note)
GELU (Gaussian Error Linear Unit)
GELU(x) ≈ x · Φ(x) where Φ is the standard normal CDF
Simplified approximation:
GELU(x) ≈ 0.5x · (1 + tanh(√(2/π) · (x + 0.044715x³)))
- Smooth approximation of ReLU — no sharp kink at zero
- Used in modern transformers (BERT, GPT-3+) — empirically outperforms ReLU
- Differentiable everywhere, which can improve gradient flow
Key Properties / Rules
| Activation | Range | Vanishing Gradient? | Common Use Case |
|---|---|---|---|
| ReLU | [0, ∞) | No (for positive inputs) | Default for hidden layers |
| Sigmoid | (0, 1) | Yes | Binary output layer |
| Tanh | (-1, 1) | Yes (but milder) | RNNs, some gates |
| Softmax | (0, 1), sums to 1 | N/A | Multi-class output layer |
| GELU | (-∞, ∞) | No | Transformers |
| Leaky ReLU | (-∞, ∞) | No | ReLU but no dead neurons |
Why It Matters
Vanishing gradients kill training in deep networks. Sigmoid's max derivative is 0.25. In a 10-layer network: 0.25¹⁰ ≈ 10⁻⁶. The gradient for the first layer is essentially zero — it stops learning. This is why ReLU dominates: its derivative is 1 for active neurons, so gradients don't shrink as they travel backward.
Dead ReLU problem. If a neuron's input is always negative, its output is always 0 and its gradient is always 0. The neuron is "dead" — it can never recover. Large learning rates can cause this. Leaky ReLU (max(0.01x, x)) avoids it by keeping a small negative slope.
Softmax is only for the output layer. It computes global probabilities that sum to 1. In a hidden layer, this would create weird dependencies between neurons that scale with each other. Hidden layers use ReLU or GELU to keep activations independent.
Common Pitfalls
- Using sigmoid in hidden layers. It worked in the 1990s on shallow nets. In deep nets, it kills gradients. Use ReLU or GELU instead.
- Forgetting activation functions entirely. A network with no activations is just matrix multiplication — it can only learn linear relationships, no matter how many layers it has.
- Applying softmax before CrossEntropyLoss in PyTorch. PyTorch's
nn.CrossEntropyLossapplies softmax internally. If you apply softmax first, it gets applied twice, which breaks the math. Pass raw logits toCrossEntropyLoss. - Very large inputs to sigmoid/tanh. The gradient will be nearly 0 and the neuron won't learn. Initialize weights carefully (Xavier/He initialization) to keep activations in the sensitive range near 0.
Examples
import numpy as np
import torch
import torch.nn as nn
# --- All four activations in NumPy ---
def relu(x):
return np.maximum(0, x)
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def tanh(x):
return np.tanh(x)
def softmax(x):
e = np.exp(x - x.max()) # subtract max for numerical stability
return e / e.sum()
x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
print("Input: ", x)
print("ReLU: ", relu(x)) # [0, 0, 0, 1, 2]
print("Sigmoid: ", np.round(sigmoid(x), 3)) # [0.119, 0.269, 0.5, 0.731, 0.881]
print("Tanh: ", np.round(tanh(x), 3)) # [-0.964, -0.762, 0, 0.762, 0.964]
logits = np.array([2.0, 1.0, 0.1])
print("Softmax:", np.round(softmax(logits), 3)) # sums to 1
# Visualizing vanishing gradient — why sigmoid fails in deep nets
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_grad(x):
s = sigmoid(x)
return s * (1 - s) # max is 0.25
def relu_grad(x):
return (x > 0).astype(float) # 1 or 0
x = 0.5 # a typical activation value
layers = 10
# Gradient after 10 layers (starting from 1.0)
sigmoid_remaining = 1.0
relu_remaining = 1.0
for _ in range(layers):
sigmoid_remaining *= sigmoid_grad(x) # multiply by local gradient
relu_remaining *= relu_grad(x)
print(f"Sigmoid gradient after {layers} layers: {sigmoid_remaining:.8f}") # ≈ 0.0000009
print(f"ReLU gradient after {layers} layers: {relu_remaining:.6f}") # 1.0
# Using activations in PyTorch
model = nn.Sequential(
nn.Linear(4, 64),
nn.ReLU(), # hidden layer: ReLU
nn.Linear(64, 64),
nn.GELU(), # or GELU for transformer-style
nn.Linear(64, 3), # 3 classes
# NO softmax here — CrossEntropyLoss handles it
)
logits = model(torch.randn(8, 4)) # (batch=8, classes=3)
loss_fn = nn.CrossEntropyLoss()
targets = torch.randint(0, 3, (8,))
loss = loss_fn(logits, targets) # passes logits directly ✓