Neural Networks: 20 Interview Questions

Question 1

1 What is a perceptron? How is it different from a neuron in deep learning? âš¡ Easy

Answer

Answer: A perceptron is the simplest artificial neuron, invented by Rosenblatt. It takes binary inputs, applies weights, sums them, and passes through a step activation function (0 or 1). Modern deep learning neurons use continuous activation functions (ReLU, Sigmoid, Tanh) and are arranged in multiple layers.

Question 2

2 Why do we need non-linear activation functions? âš¡ Easy

Answer

Answer: Without non-linearity, stacking multiple linear layers collapses into a single linear transformation. Non-linear activations (ReLU, sigmoid, tanh) allow neural networks to approximate any complex, non-linear function (universal approximation theorem).

Question 3

3 Compare ReLU, Sigmoid, and Tanh activations. When to use each? ðŸ“Š Medium

Answer

Answer:

ReLU (max(0,x)): Default for hidden layers. Fast, sparse, mitigates vanishing gradient. Dead neurons issue.
Sigmoid (0 to 1): Output layer for binary classification. Prone to vanishing gradient.
Tanh (-1 to 1): Zero-centered, often used in RNNs/classical nets. Still suffers saturation.

Question 4

4 Explain backpropagation in simple terms. ðŸ“Š Medium

Answer

Answer: Backpropagation computes gradients of the loss with respect to each weight using the chain rule. It propagates error backward from output to input, layer by layer. These gradients are used by optimizers (SGD, Adam) to update weights and minimize loss.

Question 5

5 What is vanishing gradient? How do you fix it? ðŸ”¥ Hard

Answer

Answer: Vanishing gradient occurs when gradients become extremely small in early layers, preventing learning. Causes: deep networks with sigmoid/tanh. Fixes: Use ReLU, residual connections (ResNet), batch normalization, proper weight initialization (Xavier/He), LSTM gates.

Question 6

6 What is the difference between batch gradient descent, SGD, and mini-batch? âš¡ Easy

Answer

Answer:

Batch GD: Full dataset â€“ accurate but slow, memory heavy.
SGD: One sample at a time â€“ fast updates, high variance.
Mini-batch: Subset (e.g., 32, 64) â€“ balance between speed and stability. Most common.

Question 7

7 Explain Dropout. Why does it work? ðŸ“Š Medium

Answer

Answer: Dropout randomly deactivates neurons during training with probability p. It prevents co-adaptation of neurons, forces redundant representations, and acts as ensemble learning. At inference, all neurons are used (weights scaled by p).

Question 8

8 What is Batch Normalization? How does it help? ðŸ“Š Medium

Answer

Answer: BatchNorm normalizes layer outputs to zero mean and unit variance within each mini-batch. It stabilizes training, allows higher learning rates, reduces internal covariate shift, and acts as a regularizer.

Question 9

9 What is weight initialization? Why is it important? ðŸ“Š Medium

Answer

Answer: Weight initialization sets initial values of weights. Poor init causes vanishing/exploding gradients. Xavier init for tanh/sigmoid, He init for ReLU. Proper init speeds convergence.

Question 10

10 What is the Universal Approximation Theorem? ðŸ”¥ Hard

Answer

Answer: A feedforward network with a single hidden layer and non-linear activation can approximate any continuous function on a compact domain, given enough neurons. Depth improves parameter efficiency, not just theoretical capacity.

Question 11

11 What is the difference between epoch, batch, and iteration? âš¡ Easy

Answer

Answer:

Epoch: One full pass of entire training data.
Batch: Number of samples processed before update.
Iteration: One batch update = steps per epoch.

Question 12

12 What is cross-entropy loss? When do you use it? âš¡ Easy

Answer

Answer: Cross-entropy measures difference between predicted probability distribution and true labels. Used for classification (binary: binary cross-entropy, multi-class: categorical cross-entropy). Preferred over MSE for classification.

Question 13

13 Explain underfitting and overfitting in neural networks. âš¡ Easy

Answer

Answer:

Underfitting: Model too simple, high bias â€“ fails on training data. Fix: increase capacity, train longer.
Overfitting: Model memorizes noise, high variance â€“ low train error, high test error. Fix: dropout, regularization, more data.

Question 14

14 What is the role of the learning rate? âš¡ Easy

Answer

Answer: Learning rate controls step size during gradient descent. Too high: overshoot, divergence. Too low: slow convergence, gets stuck. Use learning rate schedules or adaptive optimizers (Adam).

Question 15

15 Compare Adam vs SGD optimizer. ðŸ“Š Medium

Answer

Answer:

SGD: Simple, requires manual LR tuning, may need momentum.
Adam: Adaptive LR + momentum, works well out-of-box, less sensitive to hyperparameters. Tends to generalize slightly worse than tuned SGD.

Question 16

16 What is gradient clipping? When is it needed? ðŸ“Š Medium

Answer

Answer: Gradient clipping caps gradients to a threshold value during backprop. Prevents exploding gradients, common in RNNs and Transformers. Maintains stable training.

Question 17

17 What is the difference between a neural network and a deep neural network? âš¡ Easy

Answer

Answer: "Neural network" is a broad term. Deep neural network (DNN) typically has more than 2-3 hidden layers. Depth allows hierarchical feature learning. Shallow nets may suffice for simple tasks.

Question 18

18 What are skip connections? Why are they useful? ðŸ”¥ Hard

Answer

Answer: Skip connections (ResNet) add input of a layer to its output (F(x) + x). They alleviate vanishing gradient, enable training of very deep networks (>100 layers), and act as gradient superhighways.

Question 19

19 What is the F1 score? When is it better than accuracy? ðŸ“Š Medium

Answer

Answer: F1 is harmonic mean of precision and recall. Better than accuracy for imbalanced datasets. For example, fraud detection (99.9% negative) â€“ accuracy high but model useless; F1 reflects minority class performance.

Question 20

20 How do you decide the number of layers and neurons? ðŸ”¥ Hard

Answer

Answer: No fixed rule. Start with architecture proven for similar tasks. Use validation error: increase capacity until overfitting, then add regularization. Heuristic: more data â†’ deeper/wider. Automated via hyperparameter search (Grid/Random/Bayesian).

Related Deep Learning Links

Neural Networks: 20 Interview Questions

Quick Recap: Neural Networks Interview

Core Concepts

Problems & Fixes