Activation Functions

Activation functions introduce non-linearity into neural networks. Without them, stacking layers is mathematically equivalent to a single layer — the network can't learn curves, boundaries, or complex patterns.

Intuition First

Imagine you're trying to draw a circle on paper using only straight lines. No matter how many straight lines you draw or combine, you can't make a circle — lines are linear.

A neural network without activation functions is the same. Every linear layer just does Wx + b. Stack three linear layers together and the math collapses: it's still just one big linear transformation. You've gained nothing.

Activation functions bend the output. They introduce "kinks" or "curves" that let the network learn any shape — spirals, boundaries, non-linear relationships. This is why they're essential.

What's Actually Happening

Without activation functions:

Layer 3 output = W₃(W₂(W₁x + b₁) + b₂) + b₃
               = W₃W₂W₁x + (W₃W₂b₁ + W₃b₂ + b₃)
               = Ax + c   ← still just one linear function

No matter how many layers you add, composition of linear functions is linear. The network is stuck at the power of logistic regression.

With an activation function σ between layers:

a₁ = σ(W₁x + b₁)   ← output is now curved
a₂ = σ(W₂a₁ + b₂)  ← curves applied again

The composition is no longer linear. Given enough neurons, the network can approximate any continuous function (Universal Approximation Theorem).

Build the Idea Step-by-Step

Linear layer: z = Wx + b (straight line output)

→

Apply activation: a = σ(z) (bends the output)

→

Stack layers: compose multiple bent functions

→

Result: network can represent any non-linear function

→

Training: backprop works as long as σ has a derivative

Formal Explanation

ReLU (Rectified Linear Unit)

ReLU(x) = max(0, x)

Derivative:
ReLU'(x) = 1  if x > 0
            0  if x ≤ 0

Output: 0 for all negative inputs, identity for positive inputs
Creates sparse activations — only some neurons fire
Simple derivative makes backprop fast and clean

Sigmoid

σ(x) = 1 / (1 + e⁻ˣ)

Range: (0, 1)

Derivative:
σ'(x) = σ(x) · (1 - σ(x))   ← maximum value is 0.25

Squashes input to (0, 1) — good for binary probabilities
Saturates at ±∞: gradient becomes ~0 → vanishing gradient problem
Rarely used in hidden layers; still used for binary output layers

Tanh

tanh(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)

Range: (-1, 1)

Derivative:
tanh'(x) = 1 - tanh²(x)   ← maximum value is 1

Zero-centered (unlike sigmoid): output averages near 0, which helps gradient flow
Still saturates at large inputs — vanishing gradient persists

Softmax (output layer only)

softmax(z)ᵢ = eᶻⁱ / Σⱼ eᶻʲ

Outputs sum to 1 → a probability distribution

Takes a vector of raw scores ("logits") and turns them into probabilities
Used at the final layer for multi-class classification
Combined with cross-entropy loss for training (see the Softmax + Cross-Entropy note)

GELU (Gaussian Error Linear Unit)

GELU(x) ≈ x · Φ(x)   where Φ is the standard normal CDF

Simplified approximation:
GELU(x) ≈ 0.5x · (1 + tanh(√(2/π) · (x + 0.044715x³)))

Smooth approximation of ReLU — no sharp kink at zero
Used in modern transformers (BERT, GPT-3+) — empirically outperforms ReLU
Differentiable everywhere, which can improve gradient flow

Key Properties / Rules

Activation	Range	Vanishing Gradient?	Common Use Case
ReLU	[0, ∞)	No (for positive inputs)	Default for hidden layers
Sigmoid	(0, 1)	Yes	Binary output layer
Tanh	(-1, 1)	Yes (but milder)	RNNs, some gates
Softmax	(0, 1), sums to 1	N/A	Multi-class output layer
GELU	(-∞, ∞)	No	Transformers
Leaky ReLU	(-∞, ∞)	No	ReLU but no dead neurons

Why It Matters

Vanishing gradients kill training in deep networks. Sigmoid's max derivative is 0.25. In a 10-layer network: 0.25¹⁰ ≈ 10⁻⁶. The gradient for the first layer is essentially zero — it stops learning. This is why ReLU dominates: its derivative is 1 for active neurons, so gradients don't shrink as they travel backward.

Dead ReLU problem. If a neuron's input is always negative, its output is always 0 and its gradient is always 0. The neuron is "dead" — it can never recover. Large learning rates can cause this. Leaky ReLU (max(0.01x, x)) avoids it by keeping a small negative slope.

Softmax is only for the output layer. It computes global probabilities that sum to 1. In a hidden layer, this would create weird dependencies between neurons that scale with each other. Hidden layers use ReLU or GELU to keep activations independent.

Common Pitfalls

Using sigmoid in hidden layers. It worked in the 1990s on shallow nets. In deep nets, it kills gradients. Use ReLU or GELU instead.
Forgetting activation functions entirely. A network with no activations is just matrix multiplication — it can only learn linear relationships, no matter how many layers it has.
Applying softmax before CrossEntropyLoss in PyTorch. PyTorch's nn.CrossEntropyLoss applies softmax internally. If you apply softmax first, it gets applied twice, which breaks the math. Pass raw logits to CrossEntropyLoss.
Very large inputs to sigmoid/tanh. The gradient will be nearly 0 and the neuron won't learn. Initialize weights carefully (Xavier/He initialization) to keep activations in the sensitive range near 0.

Examples

import numpy as np
import torch
import torch.nn as nn

# --- All four activations in NumPy ---

def relu(x):
    return np.maximum(0, x)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def tanh(x):
    return np.tanh(x)

def softmax(x):
    e = np.exp(x - x.max())   # subtract max for numerical stability
    return e / e.sum()

x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])

print("Input:   ", x)
print("ReLU:    ", relu(x))       # [0, 0, 0, 1, 2]
print("Sigmoid: ", np.round(sigmoid(x), 3))   # [0.119, 0.269, 0.5, 0.731, 0.881]
print("Tanh:    ", np.round(tanh(x), 3))      # [-0.964, -0.762, 0, 0.762, 0.964]

logits = np.array([2.0, 1.0, 0.1])
print("Softmax:", np.round(softmax(logits), 3))  # sums to 1

# Visualizing vanishing gradient — why sigmoid fails in deep nets
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_grad(x):
    s = sigmoid(x)
    return s * (1 - s)  # max is 0.25

def relu_grad(x):
    return (x > 0).astype(float)  # 1 or 0

x = 0.5  # a typical activation value
layers = 10

# Gradient after 10 layers (starting from 1.0)
sigmoid_remaining = 1.0
relu_remaining = 1.0

for _ in range(layers):
    sigmoid_remaining *= sigmoid_grad(x)   # multiply by local gradient
    relu_remaining *= relu_grad(x)

print(f"Sigmoid gradient after {layers} layers: {sigmoid_remaining:.8f}")  # ≈ 0.0000009
print(f"ReLU gradient after {layers} layers:    {relu_remaining:.6f}")     # 1.0

# Using activations in PyTorch
model = nn.Sequential(
    nn.Linear(4, 64),
    nn.ReLU(),           # hidden layer: ReLU
    nn.Linear(64, 64),
    nn.GELU(),           # or GELU for transformer-style
    nn.Linear(64, 3),   # 3 classes
    # NO softmax here — CrossEntropyLoss handles it
)

logits = model(torch.randn(8, 4))  # (batch=8, classes=3)
loss_fn = nn.CrossEntropyLoss()
targets = torch.randint(0, 3, (8,))
loss = loss_fn(logits, targets)    # passes logits directly ✓