Related Deep Learning Links
Learn Activation Deep Learning Tutorial, validate concepts with Activation Deep Learning MCQ Questions, and prepare interviews through Activation Deep Learning Interview Questions and Answers.
Activation Functions
20 Essential Q/A
DL Interview Prep
Deep Learning Activation Functions: 20 Interview Questions
Master sigmoid, tanh, ReLU, Leaky ReLU, ELU, Swish, GeLU, Softmax and more. Vanishing gradients, dying neurons, output layers, mathematical derivatives – all with concise, interview-ready answers.
Sigmoid
ReLU
Tanh
Leaky ReLU
Softmax
Swish/GeLU
1
What is an activation function? Why is it non-linear?
âš¡ Easy
Answer: An activation function decides whether a neuron should fire. It introduces non-linearity – without it, stacked linear layers collapse into a single linear transformation. Non-linearity enables neural networks to approximate arbitrary complex functions (universal approximation theorem).
y = activation(W·x + b) ; Without activation: y = Wâ‚‚(Wâ‚x + bâ‚)+bâ‚‚ = W'x + b' (linear)
2
Explain Sigmoid activation. Where is it used? Main drawback?
âš¡ Easy
Answer: Sigmoid: σ(x) = 1/(1+eâ»Ë£), output in (0,1). Used in binary classification output layer (probability). Also in gates of LSTM. Drawbacks: Vanishing gradient (saturated regions), not zero-centered, exp() computation.
smooth, probabilistic
vanish grad, not zero-centered
3
How is Tanh different from Sigmoid? Why is Tanh preferred in hidden layers?
📊 Medium
Answer: Tanh = (eË£ - eâ»Ë£)/(eË£ + eâ»Ë£), output range (-1,1). It is zero-centered, which helps gradient flow. Still suffers from vanishing gradient. Preferred over sigmoid in hidden units because zero-centered outputs prevent biased gradients.
sigmoid: [0,1] ; tanh: [-1,1] (zero-centered)
4
What is ReLU? Why is it so widely used?
âš¡ Easy
Answer: ReLU = max(0, x). Advantages: Non-saturating, cheap computation, sparse activation, mitigates vanishing gradient. It is the default for most CNNs and deep architectures.
import numpy as np
def relu(x): return np.maximum(0, x) # derivative: 1 if x>0 else 0
5
What is the “dying ReLU†problem? Solutions?
📊 Medium
Answer: When many neurons get stuck in negative region and output 0 for all inputs – gradient zero, never recover. Solutions: Leaky ReLU, PReLU, ELU, or use smaller learning rate.
dead neurons
Leaky ReLU
6
Differentiate Leaky ReLU, PReLU, and RReLU.
🔥 Hard
Answer: Leaky ReLU: f(x)=max(αx, x) with α fixed (0.01). PReLU: α learned. RReLU: α randomly sampled during training. All fix dying ReLU.
7
What are ELU and SELU? When to use SELU?
🔥 Hard
Answer: ELU: f(x)= x if x>0 else α(eˣ-1). Smooth, negative saturation, robust to noise. SELU is self-normalizing; with proper initialization, outputs automatically tend to zero mean/unit variance – works for deep FCNs.
8
Explain Softmax. Why use exponentials?
📊 Medium
Answer: Softmax converts logits to probability distribution: eá¶»â±/Σ eᶻʲ. Exponentials amplify differences and ensure positivity. Used in multi-class output layer.
P(y=i) = exp(z_i) / Σⱼ exp(z_j)
9
What are Swish and GeLU? Why do they outperform ReLU in Transformers?
🔥 Hard
Answer: Swish = x·sigmoid(x) (smooth, non-monotonic). GeLU = x·Φ(x) (Gaussian CDF). Both allow negative values with a smooth bump. Used in BERT, GPT, ViT; provide better gradient flow and often higher accuracy.
10
Which activation functions are prone to vanishing gradient? Why?
📊 Medium
Answer: Sigmoid and Tanh – gradients near 0 for large |x|. ReLU variants avoid left saturation. However, ReLU can have dead neurons; ELU/Swish have small negative gradients but not zero.
11
What activation function for regression output?
âš¡ Easy
Answer: Linear (identity). No activation or f(x)=x. Allows unbounded values. For positive-only regression, use ReLU or Softplus.
12
What is Softplus? Relation to ReLU?
âš¡ Easy
Answer: Softplus = ln(1+eˣ). Smooth approximation of ReLU. Used in some variational autoencoders (variance). Never zero, but computationally heavier.
13
Output activation for binary classification?
âš¡ Easy
Answer: Sigmoid (single unit). Gives probability P(y=1). Loss: binary cross-entropy.
14
Activation for multi-label classification?
📊 Medium
Answer: Sigmoid per output (independent probabilities). Not softmax (which forces one class). Use sigmoid + binary cross-entropy.
15
Why not use step function as activation?
📊 Medium
Answer: Step function derivative is 0 everywhere (except discontinuity). Gradient descent can't learn. Need differentiable (or subgradient) functions.
16
Why is zero-centered activation desirable?
🔥 Hard
Answer: If activations are always positive (sigmoid, ReLU), gradients of weights for a given neuron are all same sign. This can cause zig-zagging updates and slower convergence. Zero-centered (tanh) avoids this.
17
Write derivative of ReLU and Leaky ReLU.
📊 Medium
ReLU': 1 if x>0 else 0. Leaky ReLU': 1 if x>0 else α (e.g., 0.01)
18
Which activations are used in RNNs/LSTMs? Why?
📊 Medium
Answer: LSTM: sigmoid for gates (0-1 control), tanh for cell/ hidden updates. Prevents exploding gradient via gating. Modern RNNs sometimes use ReLU with careful init.
19
What is Maxout activation? Pros/cons?
🔥 Hard
Answer: Maxout: takes max of k linear combinations. Can approximate any convex function. No vanishing gradient, but doubles parameters. Rarely used today.
20
Heuristic: which activation for hidden layers?
âš¡ Easy
Answer: Default: ReLU. If dying ReLU, try Leaky ReLU/ELU. For very deep nets: Swish/GeLU. For regression head: linear. For self-normalizing nets: SELU.
ReLU → Leaky ReLU → ELU → Swish (increasing complexity)
Activation Functions – Interview Cheat Sheet
Output layers
- Binary Sigmoid
- Multi-class Softmax
- Multi-label Sigmoid (per unit)
- Regression Linear
Vanishing gradient
- Sigmoid, Tanh (avoid hidden layers)
Hidden units
- 1st ReLU (fast, sparse)
- 2nd Leaky ReLU / ELU
- 3rd Swish / GeLU (SOTA)
Dying ReLU fix
- Leaky ReLU, PReLU, ELU
Verdict: "ReLU first, then tune. Know your gradients!"