Activation Functions: The Non-Linear Gatekeepers
Activation functions decide whether a neuron should fire or not. They inject non-linearity into neural networks, enabling universal function approximation. From classic Sigmoid to modern GELU — complete reference with code.
Sigmoid
0 to 1, probabilistic
ReLU
max(0,x), sparse
Tanh
-1 to 1, zero-centered
Softmax
Multi-class probability
Why do we need activation functions?
Without activation functions, neural networks would just be linear transformations. No matter how many layers, a linear combination of linear functions is still linear. Activation functions introduce non-linearity, allowing the network to learn complex patterns, decision boundaries, and hierarchical representations.
Every neuron applies an activation function to its weighted input.
Classic Activation Functions: Sigmoid & Tanh
Sigmoid (σ)
Formula: σ(x) = 1 / (1 + e-x)
Derivative: σ(x)(1-σ(x))
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s)
Output (0,1) Vanishing gradient Used in output layer (binary classification).
Tanh (Hyperbolic Tangent)
Formula: tanh(x) = (ex - e-x) / (ex + e-x)
Derivative: 1 - tanh²(x)
def tanh(x):
return np.tanh(x) # or manual
def tanh_derivative(x):
return 1 - np.tanh(x)**2
Zero-centered (-1,1) Stronger gradients, still saturates.
ReLU & Family: Solving Vanishing Gradient
ReLU (Rectified Linear Unit)
f(x) = max(0, x)
def relu(x):
return np.maximum(0, x)
# derivative: 1 if x>0 else 0
Pros: Computationally cheap, sparse, no saturation for x>0. Cons: Dying ReLU (neurons stuck at 0).
Leaky ReLU
f(x) = x if x>0 else αx (α small, e.g., 0.01)
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
Fixes dying ReLU; allows gradient flow for negative values.
ELU (Exponential Linear Unit)
f(x)= x if x>0 else α(e^x -1)
Smooth, negative values push mean closer to zero. Faster learning.
PReLU (Parametric ReLU)
α is learned during training.
# TensorFlow: tf.keras.layers.PReLU()
Softmax: From Logits to Probabilities
Softmax is used in the output layer for multi-class classification. It converts a vector of raw scores (logits) into a probability distribution over classes.
def softmax(logits):
exp_shifted = np.exp(logits - np.max(logits)) # numerical stability
return exp_shifted / np.sum(exp_shifted, axis=-1, keepdims=True)
# Example: logits = [2.0, 1.0, 0.1] -> probabilities sum=1
Key property
All outputs ∈ (0,1) and sum to 1.
Ideal for mutually exclusive classes.
Modern Activation Functions (Transformers, CNNs)
Swish (Google, 2017)
f(x) = x * sigmoid(βx) (β learnable or constant =1)
Smooth, non-monotonic. Outperforms ReLU in deep nets.
# TF: tf.keras.activations.swish
GELU (Gaussian Error Linear Unit)
GELU(x) = x * Φ(x) (Φ is CDF of Gaussian).
Used in BERT, GPT, ViT. Smooth ReLU variant.
# from transformers, torch.nn.GELU
Mish (2019)
f(x) = x * tanh(softplus(x)).
Self-regularized, slightly better than Swish on some benchmarks.
Activation Function Selection Guide
| Function | Range | When to use | Common in |
|---|---|---|---|
| Sigmoid | (0,1) | Binary output, probabilistic gate | Logistic regression, some attention |
| Tanh | (-1,1) | Zero-centered hidden layers (older RNNs) | LSTM candidate gates |
| ReLU | [0,∞) | Default for hidden layers (CNNs, MLPs) | ResNet, VGG, YOLO |
| Leaky ReLU | (-∞,∞) | Avoid dead neurons | GANs, some detection models |
| Softmax | (0,1) sum=1 | Multi-class classification output | Classification heads |
| Swish / SiLU | (-∞,∞) | Deep transformer-style models | EfficientNet, RL |
| GELU | (-∞,∞) | NLP Transformers (BERT, GPT) | Hugging Face models |
Rule of thumb:
- Start with ReLU for hidden layers.
- For output: Sigmoid (binary), Softmax (multi-class), Linear (regression).
- If ReLU causes dead neurons → Leaky ReLU / ELU.
- For Transformers: GELU.
- For very deep nets: consider Swish.
Activation Functions in TensorFlow & PyTorch
TensorFlow / Keras
import tensorflow as tf
# As activation string or layer
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(32, activation='leaky_relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
# Advanced: tf.keras.activations.gelu, tf.nn.swish
PyTorch
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 128)
self.act1 = nn.ReLU()
self.fc2 = nn.Linear(128, 64)
self.act2 = nn.LeakyReLU(0.02)
self.out = nn.Linear(64, 10)
def forward(self, x):
x = self.act1(self.fc1(x))
x = self.act2(self.fc2(x))
return self.out(x) # with CrossEntropyLoss, no softmax needed
Activation shapes at a glance
Sigmoid: ──▄▄▄▄▄▄▄▄▄▄▄▄▄▄── squashes to [0,1]
Tanh: ─▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄─ [-1,1]
ReLU: ──────────────────▄▄▄▄▄▄▄▄▄▄ max(0,x)
Leaky ReLU:─╱╱╱╱─────────────▄▄▄▄▄▄▄▄▄ small negative slope
Softmax: [0.2, 0.7, 0.1] probabilities
* ASCII illustration of activation function curves