Activation Functions Mathematical Foundations
Essential DL Component Non-linearity

Activation Functions: The Non-Linear Gatekeepers

Activation functions decide whether a neuron should fire or not. They inject non-linearity into neural networks, enabling universal function approximation. From classic Sigmoid to modern GELU — complete reference with code.

Sigmoid

0 to 1, probabilistic

ReLU

max(0,x), sparse

Tanh

-1 to 1, zero-centered

Softmax

Multi-class probability

Why do we need activation functions?

Without activation functions, neural networks would just be linear transformations. No matter how many layers, a linear combination of linear functions is still linear. Activation functions introduce non-linearity, allowing the network to learn complex patterns, decision boundaries, and hierarchical representations.

Weighted sum (z = w·x + b) Activation f(z) Output (non-linear)

Every neuron applies an activation function to its weighted input.

Classic Activation Functions: Sigmoid & Tanh

Sigmoid (σ)

Formula: σ(x) = 1 / (1 + e-x)

Derivative: σ(x)(1-σ(x))

import numpy as np
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

Output (0,1) Vanishing gradient Used in output layer (binary classification).

Tanh (Hyperbolic Tangent)

Formula: tanh(x) = (ex - e-x) / (ex + e-x)

Derivative: 1 - tanh²(x)

def tanh(x):
    return np.tanh(x)  # or manual

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

Zero-centered (-1,1) Stronger gradients, still saturates.

Vanishing Gradient: Both Sigmoid and Tanh squash large inputs, making gradients near zero. Deep networks struggle to learn.

ReLU & Family: Solving Vanishing Gradient

ReLU (Rectified Linear Unit)

f(x) = max(0, x)

def relu(x):
    return np.maximum(0, x)
# derivative: 1 if x>0 else 0

Pros: Computationally cheap, sparse, no saturation for x>0. Cons: Dying ReLU (neurons stuck at 0).

Leaky ReLU

f(x) = x if x>0 else αx (α small, e.g., 0.01)

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

Fixes dying ReLU; allows gradient flow for negative values.

ELU (Exponential Linear Unit)

f(x)= x if x>0 else α(e^x -1)

Smooth, negative values push mean closer to zero. Faster learning.

PReLU (Parametric ReLU)

α is learned during training.

# TensorFlow: tf.keras.layers.PReLU()

Softmax: From Logits to Probabilities

Softmax is used in the output layer for multi-class classification. It converts a vector of raw scores (logits) into a probability distribution over classes.

Softmax + CrossEntropy
def softmax(logits):
    exp_shifted = np.exp(logits - np.max(logits))  # numerical stability
    return exp_shifted / np.sum(exp_shifted, axis=-1, keepdims=True)

# Example: logits = [2.0, 1.0, 0.1] -> probabilities sum=1
Key property

All outputs ∈ (0,1) and sum to 1.
Ideal for mutually exclusive classes.

Modern Activation Functions (Transformers, CNNs)

Swish (Google, 2017)

f(x) = x * sigmoid(βx) (β learnable or constant =1)

Smooth, non-monotonic. Outperforms ReLU in deep nets.

# TF: tf.keras.activations.swish
GELU (Gaussian Error Linear Unit)

GELU(x) = x * Φ(x) (Φ is CDF of Gaussian).

Used in BERT, GPT, ViT. Smooth ReLU variant.

# from transformers, torch.nn.GELU
Mish (2019)

f(x) = x * tanh(softplus(x)).

Self-regularized, slightly better than Swish on some benchmarks.

Trend: Smooth, non-monotonic, often with no saturation. Swish & GELU are default in many modern architectures.

Activation Function Selection Guide

Function Range When to use Common in
Sigmoid(0,1)Binary output, probabilistic gateLogistic regression, some attention
Tanh(-1,1)Zero-centered hidden layers (older RNNs)LSTM candidate gates
ReLU[0,∞)Default for hidden layers (CNNs, MLPs)ResNet, VGG, YOLO
Leaky ReLU(-∞,∞)Avoid dead neuronsGANs, some detection models
Softmax(0,1) sum=1Multi-class classification outputClassification heads
Swish / SiLU(-∞,∞)Deep transformer-style modelsEfficientNet, RL
GELU(-∞,∞)NLP Transformers (BERT, GPT)Hugging Face models

Rule of thumb:

  • Start with ReLU for hidden layers.
  • For output: Sigmoid (binary), Softmax (multi-class), Linear (regression).
  • If ReLU causes dead neurons → Leaky ReLU / ELU.
  • For Transformers: GELU.
  • For very deep nets: consider Swish.

Activation Functions in TensorFlow & PyTorch

TensorFlow / Keras
import tensorflow as tf
# As activation string or layer
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(32, activation='leaky_relu'), 
    tf.keras.layers.Dense(10, activation='softmax')
])
# Advanced: tf.keras.activations.gelu, tf.nn.swish
PyTorch
import torch.nn as nn
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.act1 = nn.ReLU()
        self.fc2 = nn.Linear(128, 64)
        self.act2 = nn.LeakyReLU(0.02)
        self.out = nn.Linear(64, 10)
    def forward(self, x):
        x = self.act1(self.fc1(x))
        x = self.act2(self.fc2(x))
        return self.out(x)  # with CrossEntropyLoss, no softmax needed

Activation shapes at a glance

    Sigmoid:   ──▄▄▄▄▄▄▄▄▄▄▄▄▄▄──  squashes to [0,1]
    Tanh:      ─▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄─  [-1,1]
    ReLU:      ──────────────────▄▄▄▄▄▄▄▄▄▄   max(0,x)
    Leaky ReLU:─╱╱╱╱─────────────▄▄▄▄▄▄▄▄▄   small negative slope
    Softmax:   [0.2, 0.7, 0.1] probabilities

* ASCII illustration of activation function curves

Activation Pitfalls & Best Practices

⚠️ Vanishing gradient: Avoid Sigmoid/Tanh in deep hidden layers. Use ReLU or variants.
🎯 Dead ReLU: Use Leaky ReLU or ELU if many neurons output zero forever.
💡 Numerical stability: For softmax, always subtract max(logits) before exponentiation.
🧠 Output layer: Match activation to task: linear (regression), sigmoid (binary), softmax (multi-class).

Activation Function Cheatsheet

Sigmoid 0-1
Tanh -1 to 1
ReLU max(0,x)
Leaky ReLU α=0.01
ELU α(e^x-1)
Swish x·sigmoid
GELU x·Φ(x)
Softmax ∑=1
Next Up: Loss Functions – How neural networks learn.