Deep Learning Activation Functions: 20 Interview Questions

Question 1

1 What is an activation function? Why is it non-linear? âš¡ Easy

Answer

Answer: An activation function decides whether a neuron should fire. It introduces non-linearity â€“ without it, stacked linear layers collapse into a single linear transformation. Non-linearity enables neural networks to approximate arbitrary complex functions (universal approximation theorem).

Question 2

2 Explain Sigmoid activation. Where is it used? Main drawback? âš¡ Easy

Answer

Answer: Sigmoid: Ïƒ(x) = 1/(1+eâ»Ë£), output in (0,1). Used in binary classification output layer (probability). Also in gates of LSTM. Drawbacks: Vanishing gradient (saturated regions), not zero-centered, exp() computation.

Question 3

3 How is Tanh different from Sigmoid? Why is Tanh preferred in hidden layers? ðŸ“Š Medium

Answer

Answer: Tanh = (eË£ - eâ»Ë£)/(eË£ + eâ»Ë£), output range (-1,1). It is zero-centered, which helps gradient flow. Still suffers from vanishing gradient. Preferred over sigmoid in hidden units because zero-centered outputs prevent biased gradients.

Question 4

4 What is ReLU? Why is it so widely used? âš¡ Easy

Answer

Answer: ReLU = max(0, x). Advantages: Non-saturating, cheap computation, sparse activation, mitigates vanishing gradient. It is the default for most CNNs and deep architectures.

Question 5

5 What is the â€œdying ReLUâ€ problem? Solutions? ðŸ“Š Medium

Answer

Answer: When many neurons get stuck in negative region and output 0 for all inputs â€“ gradient zero, never recover. Solutions: Leaky ReLU, PReLU, ELU, or use smaller learning rate.

Question 6

6 Differentiate Leaky ReLU, PReLU, and RReLU. ðŸ”¥ Hard

Answer

Answer: Leaky ReLU: f(x)=max(Î±x, x) with Î± fixed (0.01). PReLU: Î± learned. RReLU: Î± randomly sampled during training. All fix dying ReLU.

Question 7

7 What are ELU and SELU? When to use SELU? ðŸ”¥ Hard

Answer

Answer: ELU: f(x)= x if x>0 else Î±(eË£-1). Smooth, negative saturation, robust to noise. SELU is self-normalizing; with proper initialization, outputs automatically tend to zero mean/unit variance â€“ works for deep FCNs.

Question 8

8 Explain Softmax. Why use exponentials? ðŸ“Š Medium

Answer

Answer: Softmax converts logits to probability distribution: eá¶»â±/Î£ eá¶»Ê². Exponentials amplify differences and ensure positivity. Used in multi-class output layer.

Question 9

9 What are Swish and GeLU? Why do they outperform ReLU in Transformers? ðŸ”¥ Hard

Answer

Answer: Swish = xÂ·sigmoid(x) (smooth, non-monotonic). GeLU = xÂ·Î¦(x) (Gaussian CDF). Both allow negative values with a smooth bump. Used in BERT, GPT, ViT; provide better gradient flow and often higher accuracy.

Question 10

10 Which activation functions are prone to vanishing gradient? Why? ðŸ“Š Medium

Answer

Answer: Sigmoid and Tanh â€“ gradients near 0 for large |x|. ReLU variants avoid left saturation. However, ReLU can have dead neurons; ELU/Swish have small negative gradients but not zero.

Question 11

11 What activation function for regression output? âš¡ Easy

Answer

Answer: Linear (identity). No activation or f(x)=x. Allows unbounded values. For positive-only regression, use ReLU or Softplus.

Question 12

12 What is Softplus? Relation to ReLU? âš¡ Easy

Answer

Answer: Softplus = ln(1+eË£). Smooth approximation of ReLU. Used in some variational autoencoders (variance). Never zero, but computationally heavier.

Question 13

13 Output activation for binary classification? âš¡ Easy

Answer

Answer: Sigmoid (single unit). Gives probability P(y=1). Loss: binary cross-entropy.

Question 14

14 Activation for multi-label classification? ðŸ“Š Medium

Answer

Answer: Sigmoid per output (independent probabilities). Not softmax (which forces one class). Use sigmoid + binary cross-entropy.

Question 15

15 Why not use step function as activation? ðŸ“Š Medium

Answer

Answer: Step function derivative is 0 everywhere (except discontinuity). Gradient descent can't learn. Need differentiable (or subgradient) functions.

Question 16

16 Why is zero-centered activation desirable? ðŸ”¥ Hard

Answer

Answer: If activations are always positive (sigmoid, ReLU), gradients of weights for a given neuron are all same sign. This can cause zig-zagging updates and slower convergence. Zero-centered (tanh) avoids this.

Question 17

17 Write derivative of ReLU and Leaky ReLU. ðŸ“Š Medium

Question 18

18 Which activations are used in RNNs/LSTMs? Why? ðŸ“Š Medium

Answer

Answer: LSTM: sigmoid for gates (0-1 control), tanh for cell/ hidden updates. Prevents exploding gradient via gating. Modern RNNs sometimes use ReLU with careful init.

Question 19

19 What is Maxout activation? Pros/cons? ðŸ”¥ Hard

Answer

Answer: Maxout: takes max of k linear combinations. Can approximate any convex function. No vanishing gradient, but doubles parameters. Rarely used today.

Question 20

20 Heuristic: which activation for hidden layers? âš¡ Easy

Answer

Answer: Default: ReLU. If dying ReLU, try Leaky ReLU/ELU. For very deep nets: Swish/GeLU. For regression head: linear. For self-normalizing nets: SELU.

Related Deep Learning Links

Deep Learning Activation Functions: 20 Interview Questions

Activation Functions â€“ Interview Cheat Sheet

Output layers

Vanishing gradient

Hidden units

Dying ReLU fix